Monte Carlo ES Blackjack

Controls

Episodes / batch

Auto-play speed (ms)

episodes: 0

avg return: –

selected state: –

Monte Carlo ES (Example 5.3)

Exploring starts sample random initial states and actions so every state-action pair is explored. We learn an optimal action-value function Q*, then derive the greedy policy π* and state-value surface V*.

Q(s, a) ← average of returns following visits to (s, a).

State: player sum (12–21), dealer showing (A–10), usable ace (yes/no).

Selected-state action values

Click a state to inspect its Q(s, hit) and Q(s, stick).

Select a state to view values.

Visualization

Usable ace Optimal policy π*

No usable ace Optimal policy π*

stick hit

Usable ace V* surface (3D projection)

No usable ace V* surface (3D projection)

Learning curve average return