Monte Carlo ES Blackjack
Example 5.3Exploring startsOptimal policy
Controls
episodes: 0
avg return: –
selected state: –
Monte Carlo ES (Example 5.3)
Exploring starts sample random initial states and actions so every state-action
pair is explored. We learn an optimal action-value function Q*, then derive
the greedy policy π* and state-value surface V*.
Q(s, a) ← average of returns following visits to (s, a).
State: player sum (12–21), dealer showing (A–10), usable ace (yes/no).
Selected-state action values
Click a state to inspect its Q(s, hit) and Q(s, stick).