Dynamic Programming (DP)

Controls

Algorithm

Discount γ 0.90

Initialization

sweep: 0

max Δ: –

selected state: –

DP chapter cheat sheet

Dynamic programming in Sutton & Barto is about solving known MDPs by iterating Bellman backups. The key ingredients are a model (P, R), a policy π, and a value function V or Q. The three canonical algorithms are:

Policy evaluation: repeatedly apply the Bellman expectation backup until Vπ converges.
Policy improvement: make π greedy w.r.t. Vπ to get a better policy.
Value iteration: apply the Bellman optimality backup until V* converges.

V(s) ← Σ_a π(a|s) Σ_s' P(s'|s,a) [R + γ V(s')]

V(s) ← max_a Σ_s' P(s'|s,a) [R + γ V(s')]

Selected-state backup

Click a grid cell to inspect its Bellman backup terms.

action	next state	reward	backup term

Select a state to view its update formula.

Visualization

4×4 gridworld Values + greedy arrows

low value high value greedy action

Convergence max Δ per sweep