Markov Decision Processes

MDP concepts

An MDP formalizes sequential decision-making with the Markov property: the next state and reward depend only on the current state and action.

MDP tuple
S = states, A = actions
P(s′|s,a) = transition dynamics
R(s,a,s′) = expected reward
γ ∈ [0,1] = discount factor

Return (what we maximize)
Gₜ = Rₜ₊₁ + γRₜ₊₂ + γ²Rₜ₊₃ + …

Bellman expectation backup (policy evaluation)
V(s) ← Σₐ π(a|s) [ R(s,a,s′) + γ V(s′) ]
(here gridworld transitions are deterministic)

Bellman optimality backup (value iteration)
V(s) ← maxₐ [ R(s,a,s′) + γ V(s′) ]

Gridworld controls

Mode

γ 0.90

Sweeps / second 30

Stop when max Δ 0.001

Init V(s)

Show

sweep: 0

max Δ: –

selected state: –

Click a cell on the grid to inspect a state’s backup. In Sutton’s example: A gives +10 and teleports to A′, B gives +5 and teleports to B′, and hitting the wall gives −1 (state unchanged).

Convergence curve is on the right (max Δ per sweep).

Backup breakdown (selected state)

Click a state to see its Bellman backup terms.

action	next state	reward	backup term

V(s) update will appear here after you select a state.

Visualization

Gridworld (5×5) Values + policy arrows

Convergence Max Δ per sweep