Dynamic programming in Sutton & Barto is about solving known MDPs by
iterating Bellman backups. The key ingredients are a model (P, R), a policy π,
and a value function V or Q. The three canonical algorithms are:
Policy evaluation: repeatedly apply the Bellman expectation backup until Vπ converges.
Policy improvement: make π greedy w.r.t. Vπ to get a better policy.
Value iteration: apply the Bellman optimality backup until V* converges.
V(s) ← Σa π(a|s) Σs' P(s'|s,a) [R + γ V(s')]
V(s) ← maxa Σs' P(s'|s,a) [R + γ V(s')]
Selected-state backup
Click a grid cell to inspect its Bellman backup terms.