Temporal-Difference Learning
TD(0)BootstrappingRandom Walk
Controls
episodes: 0
TD RMSE: –
MC RMSE: –
TD chapter cheat sheet
Temporal-difference learning updates estimates from other learned estimates before the final return is known.
In this random-walk example, the terminal rewards are 0 (left) and 1 (right), and true values are linear.
TD(0): V(St) ← V(St) + α [Rt+1 + V(St+1) − V(St)]
MC: V(St) ← V(St) + α [Gt − V(St)]
Last trajectory
Run at least one episode to inspect visited states and terminal reward.