Markov Decision Processes — "Teleportation" Gridworld MDP = (S, A, P, R, γ) Bellman backup visualization

MDP concepts

An MDP formalizes sequential decision-making with the Markov property: the next state and reward depend only on the current state and action.
MDP tuple
S = states, A = actions
P(s′|s,a) = transition dynamics
R(s,a,s′) = expected reward
γ ∈ [0,1] = discount factor
Return (what we maximize)
Gₜ = Rₜ₊₁ + γRₜ₊₂ + γ²Rₜ₊₃ + …
Bellman expectation backup (policy evaluation)
V(s) ← Σₐ π(a|s) [ R(s,a,s′) + γ V(s′) ]
(here gridworld transitions are deterministic)
Bellman optimality backup (value iteration)
V(s) ← maxₐ [ R(s,a,s′) + γ V(s′) ]

Gridworld controls

sweep: 0
max Δ:
selected state:
Click a cell on the grid to inspect a state’s backup. In Sutton’s example: A gives +10 and teleports to A′, B gives +5 and teleports to B′, and hitting the wall gives −1 (state unchanged).
Convergence curve is on the right (max Δ per sweep).

Backup breakdown (selected state)

Click a state to see its Bellman backup terms.
action next state reward backup term
V(s) update will appear here after you select a state.

Visualization

Gridworld (5×5) Values + policy arrows
Convergence Max Δ per sweep