Monte Carlo Methods

Controls

Return estimator

Episodes / batch

Auto-play speed (ms)

episodes: 0

avg return: –

selected state: –

Blackjack policy (Example 5.1)

We evaluate the policy that sticks on 20 or 21 and otherwise hits. Each episode is a blackjack game against a fixed dealer who sticks on 17+. Rewards are +1 (win), −1 (lose), and 0 (draw); the return is the terminal reward.

V(s) ← average of returns G following visits to s (γ = 1).

State variables: player sum (12–21), dealer showing (A–10), usable ace (yes/no).

Selected-state returns

Click a state on either surface to inspect recent returns.

episode	state	return G

Visualization

Usable ace State-value estimates

No usable ace State-value estimates

−1 0 +1

Learning curve average return