Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lesson 4: Monte Carlo CFR (MCCFR)

Module/Source: Lanctot et al. (2009) "Monte Carlo Sampling for Regret Minimization in Extensive Games" (NeurIPS 2009) — the paper that introduced and analyzed outcome sampling and external sampling MCCFR. Gibson et al. (2012) "Generalized Sampling and Variance in Counterfactual Regret Minimization" for variance analysis. Brown and Sandholm (2019) "Solving Imperfect-Information Games via Discounted Regret Minimization" for discounted and linear CFR variants. Background on importance sampling: Monte Carlo Statistical Methods (Robert and Casella, 2004). Game theory foundations: Osborne (2004) Chapters 6–7; Zinkevich et al. (2007).

Where this fits

Vanilla CFR (lesson 3) is correct but slow: every iteration traverses the entire game tree. For games beyond a certain size, this is hopeless. MCCFR replaces the full tree traversal with a sampled traversal, just like Monte Carlo (Module 1, lesson 3) replaces an intractable expectation with a sample-based estimate. The trade-off is the same: noisier per-iteration updates, but many more iterations are possible. MCCFR is the workhorse algorithm for medium-sized games and the foundation of deep CFR (next lesson).

The bottleneck of vanilla CFR

For a game tree with N nodes, vanilla CFR does O(N) work per iteration. To converge to ε-Nash, it needs O(1/ε²) iterations. Total work: O(N/ε²).

Concrete numbers for poker:

  • No-limit Hold'em: ~10^14 information sets
  • One iteration: at least 10^14 operations
  • For ε = 0.01: at least 10^14 × 10^4 = 10^18 operations total

This is infeasible. MCCFR aims to dramatically reduce per-iteration cost by sampling, accepting higher per-iteration variance in exchange.

The two main MCCFR variants

Outcome sampling

In outcome sampling, each iteration samples one complete trajectory (root to terminal) and updates regrets only along that trajectory. At each chance node, sample one outcome from the chance distribution. At each player decision, sample one action from the player's current strategy.

Per-iteration cost: O(D) where D is game depth (typically much less than N, the total tree size).

The variance is high because each iteration only updates one trajectory's worth of information sets. But you can run many more iterations per unit of compute.

External sampling

In external sampling (the most popular MCCFR variant), at each iteration:

  • For one player (the "traverser"), explore all of their actions at every information set
  • For the other player and chance nodes, sample one action

This explores more of the tree than outcome sampling but less than vanilla CFR. The per-iteration cost is intermediate.

External sampling has lower variance than outcome sampling and converges faster in practice. It is what most production CFR implementations use.

Outcome sampling in detail

Here is the algorithm for one iteration of outcome sampling:

1. Sample a complete trajectory by:
   - At chance nodes: sample one outcome from the chance distribution
   - At player nodes: sample one action from the player's current strategy
   
2. Walk back through the trajectory:
   At each information set I belonging to player i along the trajectory:
       Compute the regret for actions other than the one sampled
       (using counterfactual values estimated from the sample)
       Update R(I, a) by adding the regret divided by the sampling probability
       Update the strategy at I via regret matching
       Update the average strategy at I

The key trick is the importance weighting: divide the regret update by the probability of having sampled this trajectory. This ensures the estimator is unbiased: the expected update equals the vanilla CFR update.

A simplified outcome-sampling implementation

import numpy as np
from collections import defaultdict
import random

class OutcomeSamplingMCCFR:
    def __init__(self, game):
        self.game = game
        self.regrets       = defaultdict(lambda: np.zeros(game.num_actions()))
        self.strategy_sum  = defaultdict(lambda: np.zeros(game.num_actions()))
    
    def get_strategy(self, info_set, num_actions):
        regrets = self.regrets[info_set]
        positive = np.maximum(regrets, 0)
        total = positive.sum()
        if total > 0:
            return positive / total
        return np.ones(num_actions) / num_actions
    
    def cfr_iteration(self, history, sample_prob, traversing_player):
        """
        Recursive outcome-sampling iteration.
        history: current state
        sample_prob: probability of having sampled this path
        traversing_player: which player we're updating regrets for this iteration
        Returns: (utility, sample_probability) tuple from this point onward
        """
        if history.is_terminal():
            return history.returns()[traversing_player], 1.0
        
        if history.is_chance_node():
            outcomes = history.chance_outcomes()
            actions, probs = zip(*outcomes)
            sampled_idx = np.random.choice(len(actions), p=probs)
            action = actions[sampled_idx]
            sampled_prob = probs[sampled_idx]
            next_history = history.apply(action)
            utility, downstream_prob = self.cfr_iteration(
                next_history, sample_prob * sampled_prob, traversing_player
            )
            return utility, sampled_prob * downstream_prob
        
        player = history.current_player()
        info_set = history.info_set()
        legal = history.legal_actions()
        strategy = self.get_strategy(info_set, len(legal))
        
        if player == traversing_player:
            # We need to estimate regrets for ALL actions, but only sample one.
            # Use importance weighting. For simplicity, this version just samples
            # one action and updates that one's regret estimate.
            sampled_idx = np.random.choice(len(legal), p=strategy)
            sampled_prob = strategy[sampled_idx]
            
            next_history = history.apply(legal[sampled_idx])
            utility, downstream_prob = self.cfr_iteration(
                next_history, sample_prob * sampled_prob, traversing_player
            )
            
            # Compute estimated regret for the sampled action
            # (in full outcome sampling, this is more sophisticated)
            estimated_regret = utility / (sample_prob * sampled_prob)
            
            # Simplified update: increment regret for sampled action, decrement for others
            for i in range(len(legal)):
                if i == sampled_idx:
                    self.regrets[info_set][i] += estimated_regret * (1 - strategy[i])
                else:
                    self.regrets[info_set][i] -= estimated_regret * strategy[i]
            
            # Average strategy
            for i in range(len(legal)):
                self.strategy_sum[info_set][i] += strategy[i]
            
            return utility, sampled_prob * downstream_prob
        else:
            # Other player: just sample their action and continue
            sampled_idx = np.random.choice(len(legal), p=strategy)
            sampled_prob = strategy[sampled_idx]
            next_history = history.apply(legal[sampled_idx])
            utility, downstream_prob = self.cfr_iteration(
                next_history, sample_prob * sampled_prob, traversing_player
            )
            return utility, sampled_prob * downstream_prob
    
    def run(self, iterations=100_000):
        for it in range(iterations):
            # Alternate which player we update each iteration
            player = it % self.game.num_players()
            initial = self.game.new_initial_state()
            self.cfr_iteration(initial, 1.0, player)

A note on this implementation: it is a simplified illustration. Real outcome-sampling MCCFR has additional bookkeeping for the importance weights and probability factorizations that make the estimator unbiased. The full algorithm is in the Lanctot et al. 2009 paper "Monte Carlo Sampling for Regret Minimization in Extensive Games."

External sampling

External sampling is more commonly implemented because it has better empirical convergence. The structure:

For each iteration:
    For each player p (alternating each iteration):
        Traverse the tree
        At chance nodes: sample one outcome
        At nodes belonging to player p: explore all actions
        At nodes belonging to other players: sample one action
        Update regrets at all visited p-nodes

Per-iteration cost is O(branching_factor^depth_for_traverser × 1^depth_for_others).

External sampling has the advantage that the regret estimates are exact for the traverser at each information set visited (no importance weighting needed for the traverser's actions).

The variance-vs-speed tradeoff

MethodPer-iteration costIterations neededTotal cost
Vanilla CFRO(N)O(1/ε²)O(N/ε²)
External samplingO(N/p^k)O(C/ε²) for C > 1varies
Outcome samplingO(D)O(D/ε²)O(D²/ε²)

(N = tree size, D = depth, p = player branching factor, k = path length, ε = target accuracy.)

Outcome sampling is the cheapest per iteration but needs the most iterations. External sampling is in the middle. Vanilla is most expensive per iteration but converges with the fewest.

For most large games, external sampling wins overall because the per-iteration savings outweigh the additional iterations. For huge games, outcome sampling combined with deep neural networks (deep CFR) is the state of the art.

When MCCFR works and when it does not

Works well when:

  • The game tree is too large for vanilla CFR
  • You can sample efficiently from chance distributions and strategies
  • You have a CPU/GPU budget that supports many iterations

Struggles when:

  • The game has very rare but high-value information sets that are unlikely to be sampled
  • You need very tight convergence (ε very small)
  • Memory for the regret table is the bottleneck (next lesson addresses this)

For our SSA conjunction game, MCCFR is overkill: vanilla CFR will converge fast enough. But understanding MCCFR is essential for understanding how the algorithm scales.

Variants and improvements

Probing variants (CFR+): only update positive regret estimates, smoother convergence behavior.

Discounted CFR: weight more recent iterations more heavily, helps with non-stationary regret patterns.

CFR with Linear weighting: weight recent iterations linearly more than earlier ones. Often dramatically faster than vanilla.

For the project, you can stick with vanilla CFR. But know that production CFR implementations always use one of these improved variants in practice.

Outcome sampling vs. external sampling: a closer look

Which nodes are sampled

The two variants differ in exactly which nodes they visit per iteration:

Outcome sampling visits a single root-to-leaf path. At every node — whether chance, traverser, or opponent — a single action is sampled. The result is one complete play-through of the game per iteration.

  • Visited nodes: where is the maximum depth of the game tree
  • Updated information sets: only the traverser's information sets that appear on the sampled path
  • Unvisited information sets: receive no update this iteration, regardless of how important they are

External sampling visits more of the tree. At opponent and chance nodes, one action is sampled. But at the traverser's nodes, all actions are explored.

  • Visited nodes: where is the traverser's branching factor and is the depth of the traverser's decisions
  • Updated information sets: all traverser information sets reachable under the sampled opponent/chance play
  • Every traverser information set reachable in this trajectory is updated with an exact regret estimate

Variance tradeoffs

Outcome sampling has high variance because a single trajectory is an extremely noisy estimate of the counterfactual values. An information set that is on the sampled path receives a large update; an information set just one step away receives nothing.

External sampling has lower variance for the traverser's estimates because all of the traverser's actions are explored exactly. The only noise comes from the sampled opponent/chance play. Across many iterations, the opponent play is sampled uniformly, providing an unbiased estimate of the counterfactual reach probabilities.

When each is preferred

ScenarioPreferred variantReason
Very deep game trees (depth > 100)Outcome samplingExternal sampling's per-iteration cost grows with branching factor^depth
Many information sets per depth levelExternal samplingFewer iterations needed to cover all traverser info sets
High-variance game (rare high-payoff outcomes)External samplingVariance from rare outcomes is absorbed by exact traversal of traverser nodes
Memory bottleneck (can't store all info sets)Outcome sampling (with Deep CFR)Enables sampling-based neural network training
Real-time decision under a tight time budgetOutcome samplingConstant per-iteration cost regardless of game size

In SSA contexts: for the satellite-vs-jammer hide-and-seek game with many frequency bands but a short decision horizon (depth ~ 5–10), external sampling works well. For a long-horizon ISR sensor scheduling game (depth ~ 50–100), outcome sampling is more practical.

The importance sampling correction

Why sampled outcomes need reweighting

When outcome sampling visits only a subset of the game tree, it must correct for the fact that some trajectories are sampled more frequently than others. Without correction, the algorithm would overweight common paths and underweight rare-but-important ones.

The correction factor is the importance weight: the ratio of the actual probability of an outcome to the sampling probability used to select it.

The w/q factor

In outcome sampling MCCFR, when a trajectory is sampled with probability under the current sampling distribution, the regret estimate for action at information set on that trajectory is:

Decoding:

  • : counterfactual reach of information set (opponents' and chance's contribution)
  • : the utility at the sampled terminal node if we had taken action at (holding the rest of the trajectory constant)
  • : the utility at the sampled terminal under the current strategy
  • : the probability of sampling this particular trajectory

The division by is the importance weight. It corrects for the fact that if we sample a trajectory with probability but it would naturally occur with probability , the update should be scaled by to make it unbiased. In outcome sampling, includes the player's own strategy probabilities, so this factor partially cancels with the strategy probabilities in .

Variance can increase with importance weighting

Importance weighting is unbiased in expectation, but it can dramatically increase variance. Consider:

  • A trajectory that naturally occurs with probability (very rare)
  • Under the sampling distribution, it is sampled with probability (10× undersampled)
  • The importance weight is
  • If this trajectory's terminal payoff is , the weighted update is — much larger than the actual effect on expected payoff

This variance amplification is the fundamental tension in MCCFR: sampling rare trajectories infrequently is efficient per iteration, but the resulting high-variance updates mean more iterations are needed for the estimates to stabilize.

import numpy as np

def importance_sampling_variance_demo(
    p_rare: float,
    q_rare: float,
    u_rare: float,
    u_common: float,
    n_samples: int = 10_000
):
    """
    Demonstrate variance amplification in importance-weighted estimators.

    We want to estimate E[U] = p_rare * u_rare + (1 - p_rare) * u_common.

    Two estimators:
    1. Direct sampling: sample from p, compute average.
    2. Importance-weighted sampling from q: correct with w = p/q.

    Args:
        p_rare: true probability of the rare event
        q_rare: sampling probability of the rare event (if q_rare < p_rare, undersampling)
        u_rare: utility of the rare event
        u_common: utility of the common event
        n_samples: number of samples
    """
    true_mean = p_rare * u_rare + (1 - p_rare) * u_common

    # Direct sampling
    samples_direct = np.where(
        np.random.random(n_samples) < p_rare,
        u_rare, u_common
    )
    direct_mean = samples_direct.mean()
    direct_var  = samples_direct.var()

    # Importance-weighted sampling from q
    is_samples = []
    for _ in range(n_samples):
        if np.random.random() < q_rare:
            # Sampled the rare event; importance weight p_rare / q_rare
            w = p_rare / q_rare
            is_samples.append(w * u_rare)
        else:
            w = (1 - p_rare) / (1 - q_rare)
            is_samples.append(w * u_common)
    is_samples = np.array(is_samples)
    is_mean = is_samples.mean()
    is_var  = is_samples.var()

    print(f"True mean: {true_mean:.4f}")
    print(f"Direct sampling:  mean={direct_mean:.4f}, variance={direct_var:.4f}")
    print(f"IS sampling:      mean={is_mean:.4f},    variance={is_var:.4f}")
    print(f"Variance ratio (IS / direct): {is_var / direct_var:.2f}x")


# SSA scenario: rare conjunction event (occurs 0.1% of time)
# Under IS, we sample it 10x less often (0.01%) to save computation
# Utility of conjunction if unhandled: -100 (catastrophic)
# Utility of nominal operations: +1
print("=== Rare conjunction event (IS undersamples by 10x) ===")
importance_sampling_variance_demo(
    p_rare=0.001, q_rare=0.0001, u_rare=-100.0, u_common=1.0, n_samples=50_000
)

print("\n=== Rare event (IS matches true probability, no correction needed) ===")
importance_sampling_variance_demo(
    p_rare=0.001, q_rare=0.001, u_rare=-100.0, u_common=1.0, n_samples=50_000
)
extern crate rand;
// rand = "0.10"
use rand::{Rng, RngExt, SeedableRng};

fn is_variance_demo(
    rng: &mut impl Rng,
    p_rare: f64, q_rare: f64,
    u_rare: f64, u_common: f64,
    n: usize,
) {
    let true_mean = p_rare * u_rare + (1.0 - p_rare) * u_common;

    // Direct sampling from the true distribution
    let direct: Vec<f64> = (0..n)
        .map(|_| if rng.random::<f64>() < p_rare { u_rare } else { u_common })
        .collect();
    let d_mean = direct.iter().sum::<f64>() / n as f64;
    let d_var  = direct.iter().map(|&x| (x - d_mean).powi(2)).sum::<f64>() / n as f64;

    // Importance-weighted sampling from q (may differ from p)
    let is: Vec<f64> = (0..n)
        .map(|_| {
            if rng.random::<f64>() < q_rare {
                (p_rare / q_rare) * u_rare           // importance weight p/q applied
            } else {
                ((1.0 - p_rare) / (1.0 - q_rare)) * u_common
            }
        })
        .collect();
    let is_mean = is.iter().sum::<f64>() / n as f64;
    let is_var  = is.iter().map(|&x| (x - is_mean).powi(2)).sum::<f64>() / n as f64;

    println!("True mean: {:.4}", true_mean);
    println!("Direct:  mean={:.4}  var={:.2}", d_mean, d_var);
    println!("IS:      mean={:.4}  var={:.2}", is_mean, is_var);
    println!("Variance ratio: {:.1}x", is_var / d_var);
}

fn main() {
    let mut rng = rand::rngs::SmallRng::seed_from_u64(42);

    println!("=== Rare conjunction event (IS undersamples by 10x) ===");
    is_variance_demo(&mut rng, 0.001, 0.0001, -100.0, 1.0, 50_000);

    println!("\n=== IS matches true probability (no correction needed) ===");
    is_variance_demo(&mut rng, 0.001, 0.001, -100.0, 1.0, 50_000);
}

The output demonstrates that when , variance increases proportionally to . Undersampling rare high-payoff events by 10× multiplies variance by up to 100×. This is why MCCFR practitioners are careful about the sampling distribution and why some variants use ε-greedy sampling (mixing the strategy with a small uniform component) to ensure all actions get sampled with a minimum probability.

Convergence bounds for MCCFR

The T^{-1/2} bound still holds

Like vanilla CFR, both outcome sampling and external sampling MCCFR converge at rate :

The convergence rate exponent is the same. This might seem surprising — sampling introduces variance, which should slow convergence. The reason the rate exponent is preserved: regret matching already achieves convergence regardless of the noise level in the per-iteration update, as long as the estimates are unbiased and have finite variance.

But the constant is larger

The crucial difference is in the constant . For vanilla CFR, the constant depends on the game structure (payoff ranges, number of information sets). For MCCFR:

where is the variance introduced by the sampling procedure. This variance depends on:

  • Outcome sampling: variance proportional to , where is the payoff range and is the minimum probability of sampling any terminal node.
  • External sampling: lower variance because the traverser's regrets are estimated exactly; variance only comes from the opponent's sampled actions.

Practical implication for iteration count

Let be the variance ratio. For a given target exploitability :

If outcome sampling has (estimated 10× higher constant due to variance), MCCFR needs 100× more iterations to match vanilla CFR's convergence. But since each MCCFR iteration is vs. for vanilla CFR:

For typical games, , so MCCFR wins overall. For shallow games with high branching (e.g., a 3-action game with depth 5, , ), the advantage is:

Only a 2× improvement — vanilla CFR is competitive. For a deeper game (depth 20, ):

A massive advantage for MCCFR. The deeper the game, the more MCCFR dominates.

import numpy as np

def mccfr_vs_vanilla_cost_comparison(
    game_branching: int,
    game_depth: int,
    variance_ratio_rho: float,
    target_epsilon: float,
    vanilla_constant: float = 1.0,
):
    """
    Compare total computational cost of vanilla CFR vs. MCCFR.

    Args:
        game_branching: average branching factor
        game_depth: game tree depth
        variance_ratio_rho: C_MCCFR / C_vanilla (how much variance MCCFR adds)
        target_epsilon: desired exploitability
        vanilla_constant: the C_vanilla constant in epsilon = C/sqrt(T)
    """
    N = game_branching ** game_depth  # approximate tree size
    D = game_depth

    mccfr_constant = vanilla_constant * variance_ratio_rho

    T_vanilla = (vanilla_constant / target_epsilon) ** 2
    T_mccfr   = (mccfr_constant   / target_epsilon) ** 2

    cost_vanilla = T_vanilla * N
    cost_mccfr   = T_mccfr   * D

    print(f"Game: branching={game_branching}, depth={game_depth}")
    print(f"  Tree size N = {N:,.0f}")
    print(f"  Target epsilon = {target_epsilon}")
    print(f"  Vanilla CFR:  {T_vanilla:,.0f} iters × {N:,.0f} nodes = {cost_vanilla:.2e} ops")
    print(f"  MCCFR:        {T_mccfr:,.0f} iters × {D:,.0f} nodes = {cost_mccfr:.2e} ops")
    print(f"  MCCFR speedup: {cost_vanilla / cost_mccfr:.1f}x")
    print()


# Small SSA conjunction game (realistic for vanilla CFR)
mccfr_vs_vanilla_cost_comparison(
    game_branching=2, game_depth=6,
    variance_ratio_rho=5.0, target_epsilon=0.01, vanilla_constant=1.0
)

# Medium ISR sensor scheduling game
mccfr_vs_vanilla_cost_comparison(
    game_branching=4, game_depth=10,
    variance_ratio_rho=8.0, target_epsilon=0.01, vanilla_constant=2.0
)

# Large satellite-jammer frequency game (many frequency bands)
mccfr_vs_vanilla_cost_comparison(
    game_branching=16, game_depth=15,
    variance_ratio_rho=10.0, target_epsilon=0.05, vanilla_constant=5.0
)

The comparison shows that for the small SSA game, vanilla CFR is competitive or even preferred. For the large frequency game, MCCFR is orders of magnitude faster — the depth savings overwhelm the variance penalty.

Key Takeaways

  • MCCFR replaces the full game tree traversal of vanilla CFR with a sampled traversal, trading per-iteration accuracy for the ability to run far more iterations; the same convergence rate applies, but with a larger constant.
  • Outcome sampling visits a single root-to-leaf trajectory per iteration (cost ), while external sampling explores all traverser actions but samples opponent/chance play (cost ); external sampling has lower variance and is preferred for medium-sized games.
  • Importance weighting (dividing regret updates by the sampling probability ) makes the MCCFR estimator unbiased, but can amplify variance by when rare events are undersampled — practitioners mitigate this with ε-greedy sampling to ensure all actions get a minimum sampling probability.
  • The convergence bound still holds for MCCFR, but the constant is larger than for vanilla CFR; the total cost advantage of MCCFR grows with game depth and is most dramatic for deep trees where .
  • For very large games (poker, large SSA scheduling problems), even tabular MCCFR cannot store regrets for all information sets; this motivates Deep CFR (Lesson 5), which replaces the regret table with a neural network.
  • In SSA applications, MCCFR is the practical algorithm for games with more than a few thousand information sets: the conjunction maneuver coordination game fits vanilla CFR, but a multi-satellite spectrum deconfliction game with many operators and frequency bands requires MCCFR or its deep variants.

Quiz