Dynamic Programming Policy Evaluation Policy Improvement Value Iteration

Controls

sweep: 0
max Δ:
selected state:

DP chapter cheat sheet

Dynamic programming in Sutton & Barto is about solving known MDPs by iterating Bellman backups. The key ingredients are a model (P, R), a policy π, and a value function V or Q. The three canonical algorithms are:
  • Policy evaluation: repeatedly apply the Bellman expectation backup until Vπ converges.
  • Policy improvement: make π greedy w.r.t. Vπ to get a better policy.
  • Value iteration: apply the Bellman optimality backup until V* converges.
V(s) ← Σa π(a|s) Σs' P(s'|s,a) [R + γ V(s')]
V(s) ← maxa Σs' P(s'|s,a) [R + γ V(s')]

Selected-state backup

Click a grid cell to inspect its Bellman backup terms.
action next state reward backup term
Select a state to view its update formula.

Visualization

4×4 gridworld Values + greedy arrows
low value high value greedy action
Convergence max Δ per sweep