Monte Carlo Methods
BlackjackFirst/Every-visitPolicy evaluation
Controls
episodes: 0
avg return: –
selected state: –
Blackjack policy (Example 5.1)
We evaluate the policy that sticks on 20 or 21 and otherwise hits. Each
episode is a blackjack game against a fixed dealer who sticks on 17+. Rewards
are +1 (win), −1 (lose), and 0 (draw); the return is the terminal reward.
V(s) ← average of returns G following visits to s (γ = 1).
State variables: player sum (12–21), dealer showing (A–10), usable ace (yes/no).
Selected-state returns
Click a state on either surface to inspect recent returns.