Multi-Armed Bandits

Controls

Arms (k) 10

Steps per run 1000

Bandit dynamics

Drift σ (non-stationary) 0.01

Update rule

α 0.10

Algorithm (single run)

ε 0.10

UCB c 2.00

Softmax τ 0.50

Optimistic init Q₀ 0.00

t: 0

action: –

reward: –

avg reward: –

% optimal: –

Tips: Use Step to understand the logic. Use Play to show convergence / getting stuck. Non-stationary + constant α shows tracking better than sample-average.

Batch comparison (Sutton Fig 2.2 style)

This runs many independent bandit problems and plots the averages, like Sutton’s 10-armed testbed. It’s async, but big runs still take a moment.

Runs

Compare methods

Idle

Visualization

Single run Reward per step

Single run % Optimal action (running)

Batch Average reward

Batch % Optimal action

Arms table (q*, Q, N)

Arm	q* (true)	Q (estimate)	N

q* is hidden in real problems, but shown here to make exploration vs exploitation visually obvious.

What you should notice

Greedy can lock onto a suboptimal arm early (bad long-run performance).
ε-greedy keeps sampling: more ε → more exploration, but too much hurts short-term reward.
UCB explores “uncertain” arms more systematically than ε-random.
In non-stationary settings, constant α updates track drift better than sample-average.
Optimistic init encourages early exploration even with greedy.