Planning and Learning with Tabular Methods Chapter 8 Dyna-Q Planning vs No Planning

Controls

episodes: 0
Q-learning last steps: -
Dyna-Q last steps: -
speedup: -

Cheat Sheet

Both agents act in the same maze from start S to goal G. Q-learning updates only from real transitions, while Dyna-Q also uses model-simulated planning updates.
Real update: Q(S,A) ← Q(S,A) + alpha [R + gamma max_a Q(S',a) - Q(S,A)]
Dyna-Q: after each real step, sample n model transitions and apply the same Q update.

Recent episodes

Run some episodes to compare sample efficiency.
ep Q-learning steps Dyna-Q steps Dyna gain

Visualization

Greedy policy map Q-learning
Greedy policy map Dyna-Q
Learning curve Steps-to-goal by episode (lower is better)
Q-learning Dyna-Q moving average (Dyna-Q)