Temporal-Difference Learning

Controls

Episodes / batch

Auto-play speed (ms)

TD step size α 0.10

MC step size α 0.02

episodes: 0

TD RMSE: –

MC RMSE: –

TD chapter cheat sheet

Temporal-difference learning updates estimates from other learned estimates before the final return is known. In this random-walk example, the terminal rewards are 0 (left) and 1 (right), and true values are linear.

TD(0): V(S_t) ← V(S_t) + α [R_t+1 + V(S_t+1) − V(S_t)]

MC: V(S_t) ← V(S_t) + α [G_t − V(S_t)]

Last trajectory

Run at least one episode to inspect visited states and terminal reward.

step	state	next	reward

Visualization

State values (A-E) True vs TD vs MC

true values TD(0) Monte Carlo

Learning curve RMSE by episode