Planning and Learning with Tabular Methods

Controls

Episodes / batch

Auto-play speed (ms)

Planning updates / real step 8

Epsilon exploration 0.10

Alpha 0.20

Gamma 0.95

episodes: 0

Q-learning last steps: -

Dyna-Q last steps: -

speedup: -

Cheat Sheet

Both agents act in the same maze from start S to goal G. Q-learning updates only from real transitions, while Dyna-Q also uses model-simulated planning updates.

Real update: Q(S,A) ← Q(S,A) + alpha [R + gamma max_a Q(S',a) - Q(S,A)]

Dyna-Q: after each real step, sample n model transitions and apply the same Q update.

Recent episodes

Run some episodes to compare sample efficiency.

ep	Q-learning steps	Dyna-Q steps	Dyna gain

Visualization

Greedy policy map Q-learning

Greedy policy map Dyna-Q

Learning curve Steps-to-goal by episode (lower is better)

Q-learning Dyna-Q moving average (Dyna-Q)