Windy Gridworld

Controls

Algorithm

Episodes / batch

Auto-play speed (ms)

Step size alpha 0.50

Exploration epsilon 0.10

episodes: 0

last steps: -

best steps: -

Environment

7x10 grid. Start is S=(3,0), goal is G=(3,7). Actions are up/down/left/right. Each step gives reward -1. Wind pushes upward by column strengths [0,0,0,1,1,1,2,2,1,0].

SARSA: Q(s,a) <- Q(s,a) + alpha [r + gamma Q(s',a') - Q(s,a)]

Q-learning: Q(s,a) <- Q(s,a) + alpha [r + gamma max_a' Q(s',a') - Q(s,a)]

Last trajectory

Run at least one episode to inspect the path to goal.

step	state	action	next

Visualization

Grid policy Greedy arrows + last path

low value mid value high value last trajectory

Learning curve Steps per episode