Multi-Armed Bandits Greedy / ε-greedy / UCB / Softmax Stationary + Non-stationary Sample-average vs Constant step-size

Controls

t: 0
action:
reward:
avg reward:
% optimal:
Tips: Use Step to understand the logic. Use Play to show convergence / getting stuck. Non-stationary + constant α shows tracking better than sample-average.

Batch comparison (Sutton Fig 2.2 style)

This runs many independent bandit problems and plots the averages, like Sutton’s 10-armed testbed. It’s async, but big runs still take a moment.
Idle

Visualization

Single run Reward per step
Single run % Optimal action (running)
Batch Average reward
Batch % Optimal action

Arms table (q*, Q, N)

Arm q* (true) Q (estimate) N
q* is hidden in real problems, but shown here to make exploration vs exploitation visually obvious.

What you should notice

  • Greedy can lock onto a suboptimal arm early (bad long-run performance).
  • ε-greedy keeps sampling: more ε → more exploration, but too much hurts short-term reward.
  • UCB explores “uncertain” arms more systematically than ε-random.
  • In non-stationary settings, constant α updates track drift better than sample-average.
  • Optimistic init encourages early exploration even with greedy.