Tips: Use Step to understand the logic. Use Play to show convergence / getting stuck.
Non-stationary + constant α shows tracking better than sample-average.
Batch comparison (Sutton Fig 2.2 style)
This runs many independent bandit problems and plots the averages, like Sutton’s 10-armed testbed.
It’s async, but big runs still take a moment.
Idle
Visualization
Single runReward per step
Single run% Optimal action (running)
BatchAverage reward
Batch% Optimal action
Arms table (q*, Q, N)
Arm
q* (true)
Q (estimate)
N
q* is hidden in real problems, but shown here to make exploration vs exploitation
visually obvious.
What you should notice
Greedy can lock onto a suboptimal arm early (bad long-run performance).
ε-greedy keeps sampling: more ε → more exploration, but too much hurts
short-term reward.
UCB explores “uncertain” arms more systematically than ε-random.
In non-stationary settings, constant α updates track drift better than
sample-average.
Optimistic init encourages early exploration even with greedy.