Windy Gridworld
Chapter 5 Subpage
SARSA
Control
← Home
← TD Main
Random Walk
Controls
Algorithm
SARSA (on-policy)
Q-learning (off-policy)
Episodes / batch
Auto-play speed (ms)
Step size alpha
0.50
Exploration epsilon
0.10
Reset
Run batch
Play
Stop
episodes:
0
last steps:
-
best steps:
-
Environment
7x10 grid. Start is S=(3,0), goal is G=(3,7). Actions are up/down/left/right. Each step gives reward -1. Wind pushes upward by column strengths [0,0,0,1,1,1,2,2,1,0].
SARSA: Q(s,a) <- Q(s,a) + alpha [r + gamma Q(s',a') - Q(s,a)]
Q-learning: Q(s,a) <- Q(s,a) + alpha [r + gamma max_a' Q(s',a') - Q(s,a)]
Last trajectory
Run at least one episode to inspect the path to goal.
step
state
action
next
Visualization
Grid policy
Greedy arrows + last path
low value
mid value
high value
last trajectory
Learning curve
Steps per episode