Monte Carlo Methods Blackjack First/Every-visit Policy evaluation

Controls

episodes: 0
avg return:
selected state:

Blackjack policy (Example 5.1)

We evaluate the policy that sticks on 20 or 21 and otherwise hits. Each episode is a blackjack game against a fixed dealer who sticks on 17+. Rewards are +1 (win), −1 (lose), and 0 (draw); the return is the terminal reward.
V(s) ← average of returns G following visits to s (γ = 1).
State variables: player sum (12–21), dealer showing (A–10), usable ace (yes/no).

Selected-state returns

Click a state on either surface to inspect recent returns.
episode state return G

Visualization

Usable ace State-value estimates
No usable ace State-value estimates
−1 0 +1
Learning curve average return