Temporal-Difference Learning TD(0) Bootstrapping Random Walk

Controls

episodes: 0
TD RMSE:
MC RMSE:

TD chapter cheat sheet

Temporal-difference learning updates estimates from other learned estimates before the final return is known. In this random-walk example, the terminal rewards are 0 (left) and 1 (right), and true values are linear.
TD(0): V(St) ← V(St) + α [Rt+1 + V(St+1) − V(St)]
MC: V(St) ← V(St) + α [Gt − V(St)]

Last trajectory

Run at least one episode to inspect visited states and terminal reward.
step state next reward

Visualization

State values (A-E) True vs TD vs MC
true values TD(0) Monte Carlo
Learning curve RMSE by episode