Lesson 5: Deep CFR

Module/Source: Brown et al. (2019) "Deep Counterfactual Regret Minimization" (ICML 2019) — the original Deep CFR paper. Heinrich and Silver (2016) "Deep Reinforcement Learning from Self-Play in Imperfect-Information Games" for Neural Fictitious Self-Play, a related approach. Brown and Sandholm (2019) "Superhuman AI for multiplayer poker" (Science) — Pluribus, which used a deep CFR blueprint strategy. Architecture and training details follow the notation in the Deep CFR paper and the OpenSpiel implementation. Game theory foundations: Zinkevich et al. (2007) and Lanctot et al. (2009). PyTorch implementation patterns follow the official PyTorch documentation.

Where this fits

Vanilla CFR and MCCFR store regrets in a table indexed by information set. For huge games, the table is too big. Deep CFR replaces the table with a neural network: at each information set, the network predicts the regrets for each action. This is the same idea as DQN (Module 3, lesson 4): replace a tabular representation with a function approximator that can generalize across similar inputs. Deep CFR has produced state-of-the-art results in poker games too large for tabular MCCFR. The pattern (table → network) is a recurring theme: every algorithm in this curriculum has both a tabular and a deep variant.

The basic structure

Deep CFR maintains:

A regret network $R_{i} (I, a; θ_{R})$ that predicts cumulative regret for player i, information set I, action a, parameterized by neural network weights $θ_{R}$ .
A strategy network $σ_{i} (I; θ_{σ})$ that predicts the average strategy at each information set.
A buffer of (information set, regret estimate) training examples for the regret network.
A buffer of (information set, strategy) training examples for the strategy network.

The high-level loop:

For each iteration:
    For each player p:
        Run external-sampling MCCFR using the regret network's predictions
        as if they were tabular regrets
        Collect new (info set, regret estimate) pairs into the buffer
        Train the regret network on the buffer
    Add (info set, current strategy) pairs to the strategy buffer
    Train the strategy network on the strategy buffer

Return the strategy network as the Nash equilibrium approximation

The conceptual structure

In tabular CFR, the regret update at information set I, action a is:

R(I, a) += counterfactual_regret(I, a)

In deep CFR, instead of updating a table cell, you generate a training example:

training_example = (encode(I), counterfactual_regret(I, a))
buffer.add(training_example)

Then periodically train the regret network on the buffer using MSE loss. The network's prediction of $R (I, a)$ becomes the target for regret matching.

The same logic applies to the strategy: instead of updating a strategy table, generate (information set, strategy) examples and train the strategy network.

Why deep CFR is a hard problem

In supervised learning, the labels are fixed. You train, you converge, you are done.

In deep CFR, the labels (regret values) are generated by the algorithm itself, which uses the current network. As the network changes, the labels change. As the labels change, the network has to change. This is similar to the moving-target problem in DQN, which is partially mitigated by target networks.

There are several practical engineering challenges:

Stale data: regrets generated by an old network are inconsistent with the current network. Most deep CFR variants weight recent examples more or use replay buffers that decay.

Network capacity: the regret network must be expressive enough to represent the regret function, which can have complex structure.

Convergence guarantees weaken: tabular CFR converges to Nash provably. Deep CFR converges in practice but the theoretical guarantees are weaker.

A simplified deep CFR sketch

A complete implementation is beyond what we will write in this curriculum, but here is the structure:

import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import deque
import numpy as np

class RegretNetwork(nn.Module):
    """Predicts regrets for each action given an information set encoding."""
    def __init__(self, info_set_dim, num_actions, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(info_set_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_actions),
        )
    
    def forward(self, info_set):
        return self.net(info_set)

class DeepCFR:
    def __init__(self, game, info_set_dim, num_actions):
        self.game = game
        self.regret_nets = [RegretNetwork(info_set_dim, num_actions) 
                           for _ in range(game.num_players())]
        self.strategy_net = RegretNetwork(info_set_dim, num_actions)
        
        self.regret_buffers   = [deque(maxlen=100_000) for _ in range(game.num_players())]
        self.strategy_buffer  = deque(maxlen=100_000)
        
        self.optimizers = [torch.optim.Adam(net.parameters(), lr=1e-3)
                          for net in self.regret_nets]
        self.strategy_opt = torch.optim.Adam(self.strategy_net.parameters(), lr=1e-3)
    
    def get_regret_strategy(self, info_set_tensor, player):
        """Compute current strategy from regret network's prediction."""
        with torch.no_grad():
            regrets = self.regret_nets[player](info_set_tensor)
            positive = torch.clamp(regrets, min=0)
            total = positive.sum()
            if total > 0:
                return (positive / total).cpu().numpy()
            return np.ones(len(positive)) / len(positive)
    
    def cfr_traverse(self, history, player):
        """External-sampling traversal, generating regret training examples."""
        if history.is_terminal():
            return history.returns()[player]
        
        if history.is_chance_node():
            outcomes = history.chance_outcomes()
            actions, probs = zip(*outcomes)
            sampled = np.random.choice(len(actions), p=probs)
            return self.cfr_traverse(history.apply(actions[sampled]), player)
        
        if history.current_player() == player:
            # Explore all actions; compute regret for this info set
            info_set_tensor = torch.tensor(history.info_set_tensor(), dtype=torch.float32)
            strategy = self.get_regret_strategy(info_set_tensor, player)
            legal = history.legal_actions()
            
            action_values = []
            for action in legal:
                v = self.cfr_traverse(history.apply(action), player)
                action_values.append(v)
            
            node_value = sum(strategy[i] * action_values[i] for i in range(len(legal)))
            
            # Compute regret estimates and add to buffer
            regret_estimates = np.array([action_values[i] - node_value for i in range(len(legal))])
            self.regret_buffers[player].append((info_set_tensor.numpy(), regret_estimates))
            
            return node_value
        else:
            # Sample one action for the other player
            info_set_tensor = torch.tensor(history.info_set_tensor(), dtype=torch.float32)
            other = history.current_player()
            strategy = self.get_regret_strategy(info_set_tensor, other)
            legal = history.legal_actions()
            sampled = np.random.choice(len(legal), p=strategy)
            
            # Add strategy training example
            self.strategy_buffer.append((info_set_tensor.numpy(), strategy))
            
            return self.cfr_traverse(history.apply(legal[sampled]), player)
    
    def train_networks(self, epochs=10, batch_size=128):
        # Train regret networks
        for player in range(self.game.num_players()):
            buf = self.regret_buffers[player]
            if len(buf) < batch_size:
                continue
            for _ in range(epochs):
                batch = random.sample(buf, batch_size)
                states, regrets = zip(*batch)
                states = torch.tensor(np.array(states), dtype=torch.float32)
                regrets = torch.tensor(np.array(regrets), dtype=torch.float32)
                
                preds = self.regret_nets[player](states)
                loss = F.mse_loss(preds, regrets)
                self.optimizers[player].zero_grad()
                loss.backward()
                self.optimizers[player].step()
        
        # Train strategy network similarly (omitted for brevity)
    
    def run(self, num_iterations=1000, traversals_per_iter=100):
        for it in range(num_iterations):
            for player in range(self.game.num_players()):
                for _ in range(traversals_per_iter):
                    initial = self.game.new_initial_state()
                    self.cfr_traverse(initial, player)
            self.train_networks()
            print(f"Iteration {it+1}/{num_iterations}: "
                  f"regret buffer = {len(self.regret_buffers[0])}, "
                  f"strategy buffer = {len(self.strategy_buffer)}")

The implementation is dense but the structure is clear: at each iteration, do many MCCFR-style traversals using the network for regret predictions, collect (info set, regret) examples, and train the network on the collected data.

Comparing tabular and deep CFR

Aspect	Tabular CFR / MCCFR	Deep CFR
Memory	O(N) for N info sets	O(network parameters)
Convergence	Provable to ε-Nash	Empirical, no proof
Best for	Small to medium games	Huge games with similar info set structure
Engineering complexity	Simple	Significant (replay, networks, scheduling)
Time to converge	Linear in N (rough)	Depends on network capacity

For our SSA conjunction game (small enough that vanilla CFR works), deep CFR is overkill. For poker-sized games, deep CFR (or its variants) is essential.

When to use deep CFR

In real research, deep CFR is the algorithm of choice when:

The game has too many information sets for a regret table
The information sets have meaningful structure (similar info sets get similar regrets)
You can afford the engineering complexity of the network training

Most CFR research today uses deep variants. Pluribus, the superhuman 6-player Hold'em bot, used a precomputed blueprint strategy from MCCFR-on-an-abstracted-game and then deep CFR variants for the actual gameplay (real-time subgame solving).

What to take away

The pattern from CFR is the same as everywhere else in this curriculum: tabular methods are simple and provably correct on small problems; neural network function approximation extends them to large problems at the cost of some theoretical guarantees and a lot of engineering. The same pattern from DQN (replacing Q tables with Q networks) appears in CFR.

For the project, you will implement vanilla CFR. Once you understand it concretely, the variants (MCCFR, deep CFR) are conceptual modifications, not new algorithms.

The advantage network

What it predicts: per-action instantaneous regrets

The regret network's prediction target is not cumulative regret but instantaneous counterfactual regret — how much better each action would have been than the current strategy at this information set in this traversal:

$\hat{A}_{i} (I, a) = v_{i} (I, a) - v_{i} (I) = v_{i} (I, a) - a^{'} \sum σ_{i} (I, a^{'}) \cdot v_{i} (I, a^{'})$

Decoding:

$v_{i} (I, a)$ : expected utility if action $a$ were always taken at $I$
$v_{i} (I)$ : expected utility under the current mixed strategy
$\hat{A}_{i} (I, a)$ : the "advantage" of $a$ over the average — positive means $a$ is better than the current mix

Why advantages, not values

Regret matching only cares about relative differences between actions: $σ^{t + 1} (a) \propto max (0, R^{t} (a))$ . A constant offset added to all regrets cancels out in the normalization and does not affect the resulting strategy. Predicting advantages (automatically zero-centered across actions at any given information set) removes this irrelevant degree of freedom and stabilizes training.

In the satellite-vs-jammer spectrum game, raw node values span -100 to +1; advantages span roughly -5 to +5 — a much easier regression target.

PyTorch code for the advantage network architecture

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class AdvantageNetwork(nn.Module):
    """Predicts per-action instantaneous counterfactual advantages for Deep CFR."""

    def __init__(self, info_set_dim: int, num_actions: int, hidden_dim: int = 256):
        super().__init__()
        self.input_layer = nn.Sequential(
            nn.Linear(info_set_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.ReLU()
        )
        self.residual = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim),
        )
        self.output_layer = nn.Linear(hidden_dim, num_actions)
        # No output activation: advantages can be any real value

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.input_layer(x)
        x = x + F.relu(self.residual(x))   # residual connection
        return self.output_layer(x)

    def get_strategy(self, x: torch.Tensor) -> torch.Tensor:
        """Convert advantages to strategy via regret matching."""
        with torch.no_grad():
            adv = self.forward(x)
            pos = torch.clamp(adv, min=0.0)
            total = pos.sum(dim=-1, keepdim=True)
            uniform = torch.ones_like(pos) / pos.shape[-1]
            return torch.where(total > 0, pos / total, uniform)


def train_advantage_network(network, optimizer, buffer, batch_size=256, epochs=5):
    """Train on (info_set_encoding, instantaneous_regrets) pairs with MSE loss."""
    if len(buffer) < batch_size:
        return float('nan')
    final_loss = 0.0
    for _ in range(epochs):
        batch = [buffer[i] for i in np.random.choice(len(buffer), batch_size, replace=False)]
        states, targets = zip(*batch)
        s_t = torch.tensor(np.array(states),  dtype=torch.float32)
        r_t = torch.tensor(np.array(targets), dtype=torch.float32)
        loss = F.mse_loss(network(s_t), r_t)
        optimizer.zero_grad(); loss.backward()
        torch.nn.utils.clip_grad_norm_(network.parameters(), 1.0)
        optimizer.step()
        final_loss = loss.item()
    return final_loss


# Spectrum game: info set = [own freq one-hot (8), observed jammer history (8), round (1)]
NUM_FREQ_BANDS = 8
INFO_SET_DIM   = NUM_FREQ_BANDS * 2 + 1
adv_net = AdvantageNetwork(INFO_SET_DIM, NUM_FREQ_BANDS, hidden_dim=128)
dummy   = torch.randn(32, INFO_SET_DIM)
print(f"Output shape: {adv_net(dummy).shape}")          # (32, 8)
print(f"Strategy sums: {adv_net.get_strategy(dummy).sum(-1)[:3]}")  # all 1.0

The strategy network

What it predicts: average strategy

While the advantage network predicts instantaneous regrets (used to drive the current strategy), the strategy network predicts the time-averaged strategy — the Nash approximation returned at the end of CFR training.

In tabular CFR, the average strategy is maintained by accumulating $σ_{sum} (I, a) + = σ^{t} (I, a)$ each iteration. The strategy network fits this accumulated average directly, weighted by the player's reach probability at each information set each iteration.

Why it is a separate network

The advantage network tracks fast-changing current regrets; the strategy network tracks the slow-converging historical average. Sharing one network would cause the fast advantage updates to destabilize the strategy estimates, just as using the current strategy instead of the average would prevent CFR from converging.

An analogy: the strategy network is the neural equivalent of the strategy_sum table in the vanilla CFR code, while the advantage network is the neural equivalent of the regrets table.

Training with behavioral cloning

The strategy network is trained by behavioral cloning: at each CFR traversal, when the opponent visits an information set, we record (info_set_encoding, current_strategy) as a training example. The strategy network minimizes cross-entropy loss against these observed strategies.

import torch, torch.nn as nn, torch.nn.functional as F, numpy as np


class StrategyNetwork(nn.Module):
    """Predicts the time-averaged strategy (probability distribution over actions)."""

    def __init__(self, info_set_dim: int, num_actions: int, hidden_dim: int = 256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(info_set_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),   nn.LayerNorm(hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, num_actions),  # logits; softmax applied in forward
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Returns log-probabilities (use .exp() for probabilities)."""
        return F.log_softmax(self.net(x), dim=-1)

    def get_strategy(self, x: torch.Tensor) -> torch.Tensor:
        with torch.no_grad():
            return self.forward(x).exp()


def train_strategy_network_bc(network, optimizer, buffer, batch_size=256, epochs=5):
    """Behavioral cloning: minimize cross-entropy against observed CFR strategies."""
    if len(buffer) < batch_size:
        return float('nan')
    final_loss = 0.0
    for _ in range(epochs):
        batch = [buffer[i] for i in np.random.choice(len(buffer), batch_size, replace=False)]
        info_sets, targets = zip(*batch)
        s_t = torch.tensor(np.array(info_sets), dtype=torch.float32)
        t_t = torch.tensor(np.array(targets),   dtype=torch.float32)
        loss = -(t_t * network(s_t)).sum(dim=-1).mean()  # cross-entropy
        optimizer.zero_grad(); loss.backward()
        torch.nn.utils.clip_grad_norm_(network.parameters(), 1.0)
        optimizer.step()
        final_loss = loss.item()
    return final_loss

Memory buffer design

Reservoir sampling for old iterations

Deep CFR needs training examples from all past CFR iterations. In standard RL (DQN), old replay buffer data becomes stale as the policy changes, so many RL algorithms discount or discard it. CFR is different: the Nash equilibrium is the time average over all past strategies, so old data is not stale — it is an essential part of the average being computed. Discarding it would corrupt the average and slow convergence.

Reservoir sampling maintains a uniform random sample of all items seen so far in a fixed-capacity buffer:

For each incoming item x at position t in the stream:
    If t <= capacity: add x to buffer
    Else: with probability capacity/t, replace a random buffer entry with x

After processing any number of items, every item has equal probability $capacity / t$ of being in the buffer — no bias toward recent data.

import random
from typing import Any, List, Optional


class ReservoirBuffer:
    """Uniform random sample over all items seen — correct for Deep CFR strategy buffers."""

    def __init__(self, capacity: int, seed: Optional[int] = None):
        self.capacity  = capacity
        self.buffer: List[Any] = []
        self.total_seen = 0
        self.rng = random.Random(seed)

    def add(self, item: Any) -> None:
        self.total_seen += 1
        if len(self.buffer) < self.capacity:
            self.buffer.append(item)
        else:
            j = self.rng.randint(0, self.total_seen - 1)
            if j < self.capacity:
                self.buffer[j] = item

    def sample(self, n: int) -> List[Any]:
        return self.rng.sample(self.buffer, min(n, len(self.buffer)))

    def __len__(self) -> int:
        return len(self.buffer)

    @property
    def is_ready(self) -> bool:
        return len(self.buffer) >= min(self.capacity // 10, 1000)

Use ReservoirBuffer for the strategy network (all iterations equal weight). For the advantage network, a weighted variant that gives more weight to recent iterations can improve empirical convergence, since recent regrets better reflect the current strategy.

Practical considerations

Batch size and training frequency

Run $K \cdot ∣ I_{i} ∣$ traversals before each network update ( $K \in [5, 20]$ , $∣ I_{i} ∣$ = estimated info sets for player $i$ ). Too few traversals → overfitting to recent data; too many → network lags behind the true regret function. Mini-batch size 256–1024 is typical; gradient clipping (norm 1.0) prevents variance spikes from high-magnitude importance weights.

Warm-up period before iterates are useful

The advantage network is randomly initialized, so its early predictions are meaningless. Using them to guide strategy immediately would flood the buffer with misleading regret estimates. The standard fix:

Run $T_{warmup} = 10$ –100 iterations with uniform strategy (ignoring the network).
Collect regret estimates into the buffer under this uniform play.
Train the network once on the initial buffer.
Switch to network-guided play.

By the time the network starts influencing the strategy, it has seen enough diverse game states to make non-trivial predictions.

How to evaluate: exploitability estimation

The natural metric is exploitability: the maximum gain any player could achieve by best-responding to the current average strategy. At a Nash equilibrium, exploitability is zero.

Three practical methods:

Exact best response (small games): traverse the full game tree to compute best-response value exactly.
Local best response (LBR): run a greedy best-response search for a limited number of nodes; the gain is a lower bound on exploitability.
Head-to-head win rate against a fixed baseline (easy but less theoretically grounded).

Expected exploitability trajectory on the spectrum game:

Iteration 0 (uniform): ~0.33 (worst case for 3-action game)
Iteration 100: ~0.05–0.10
Iteration 1000: ~0.005–0.02

A plateauing exploitability curve signals insufficient network capacity or buffer size. An oscillating curve signals training instability — reduce the learning rate or increase the warm-up period.

Key Takeaways

The advantage network predicts per-action instantaneous counterfactual regrets (advantages) rather than raw values, because regret matching is invariant to constant offsets — advantages are automatically zero-centered and have uniform scale across information sets, making them easier to learn.
The strategy network is a separate network trained by behavioral cloning to predict the time-averaged strategy (the Nash approximation); the separation reflects the two-table structure of tabular CFR: fast-changing regrets table vs. slow-accumulating strategy_sum table.
Reservoir sampling is the correct buffer design for Deep CFR because the Nash approximation is a time average over all past iterations — old data is equally valuable to new data, unlike RL where old policy data becomes stale.
The warm-up period (10–100 iterations with uniform strategy) prevents the buffer from being polluted with meaningless regret estimates from the randomly initialized network before it has seen enough game states.
Exploitability (maximum gain from best-responding) is the principled evaluation metric; local best response (LBR) provides a cheap lower bound; a plateauing curve indicates capacity problems, an oscillating curve indicates training instability.
Deep CFR follows the same tabular-to-deep pattern as DQN and deep MCTS: neural function approximation scales CFR to games too large for a regret table, at the cost of weaker convergence guarantees and significantly more engineering complexity.

ML for Spacepower Simulations