Monte Carlo ES Blackjack Example 5.3 Exploring starts Optimal policy

Controls

episodes: 0
avg return:
selected state:

Monte Carlo ES (Example 5.3)

Exploring starts sample random initial states and actions so every state-action pair is explored. We learn an optimal action-value function Q*, then derive the greedy policy π* and state-value surface V*.
Q(s, a) ← average of returns following visits to (s, a).
State: player sum (12–21), dealer showing (A–10), usable ace (yes/no).

Selected-state action values

Click a state to inspect its Q(s, hit) and Q(s, stick).
Select a state to view values.

Visualization

Usable ace Optimal policy π*
No usable ace Optimal policy π*
stick hit
Usable ace V* surface (3D projection)
No usable ace V* surface (3D projection)
Learning curve average return