Lesson 2: Extensive-Form Games in Detail

Module/Source: An Introduction to Game Theory (Osborne, 2004), Chapters 6–7 (extensive-form games, subgame perfect equilibrium, backward induction). Formal definitions of information sets and reach probabilities follow the notation in Zinkevich et al. (2007) "Regret Minimization in Games with Incomplete Information" and Lanctot et al. (2009) "Monte Carlo Sampling for Regret Minimization in Extensive Games." The OpenSpiel library uses the same vocabulary throughout its API.

Where this fits

Lesson 1 introduced extensive-form games at a high level. CFR (lesson 3) operates on the detailed structure of these games: information sets, reach probabilities, and strategies defined as policies over information sets. This lesson develops that structure precisely. The vocabulary here is exactly the vocabulary used in OpenSpiel's API and in CFR research papers.

The components of an extensive-form game

A formal extensive-form game has:

A finite set of players (we will mostly use two-player games)
A game tree with nodes representing decision points and edges representing actions
A player function that says whose turn it is at each non-terminal node (one of the players, or "chance")
A chance function that gives the probability distribution over actions at chance nodes
Information sets: for each player, a partition of their decision nodes into sets of nodes they cannot distinguish
Utility functions: for each player, a function from terminal nodes to real numbers (the payoff)

That is a lot. Let us walk through each piece with our running SSA-flavored example.

A small extensive-form SSA game

Two satellite operators, Alice and Bob, face a potential conjunction. Each operator has a "mission state" that is private (hidden from the other): either "high-priority" or "low-priority." Maneuvering costs more for high-priority operators (they want to stay on station for their mission).

The game proceeds:

Chance assigns each operator a mission state (50/50 high or low, independently)
Alice decides whether to maneuver (M) or hold (H)
Bob, who can see whether Alice maneuvered (but not Alice's mission state), decides M or H
Payoffs are determined by the joint action and both operators' mission states

This is a 2-player game with chance nodes (the random mission assignments) and information sets (each operator only knows their own mission state).

The game tree:

                              [Chance: assign Alice's mission]
                             /                                \
                  [A=high, p=0.5]                    [A=low, p=0.5]
                          |                                  |
              [Chance: assign Bob's mission]    [Chance: assign Bob's mission]
                /                  \                /                  \
       [B=high, p=0.5]     [B=low, p=0.5]   [B=high, p=0.5]    [B=low, p=0.5]
                |                  |                 |                  |
        [Alice decides]    [Alice decides]   [Alice decides]    [Alice decides]
            /     \           /     \           /     \           /     \
          M         H        M         H        M         H        M         H
          |         |        |         |        |         |        |         |
       [Bob]    [Bob]    [Bob]    [Bob]    [Bob]    [Bob]    [Bob]    [Bob]
       /  \    /  \    /  \    /  \    /  \    /  \    /  \    /  \
      M  H  M  H  M  H  M  H  M  H  M  H  M  H  M  H

There are 16 terminal nodes (4 chance combinations × 2 Alice actions × 2 Bob actions). Each terminal has a payoff for Alice and a payoff for Bob.

Information sets in this game

Alice's information sets: when Alice has to decide, she knows her own mission state but not Bob's. So:

"Alice's information set 1": all nodes where Alice has high mission and is to move (regardless of Bob's hidden mission). 2 nodes here.
"Alice's information set 2": all nodes where Alice has low mission and is to move. 2 nodes here.

Alice has 2 information sets total. Within each, she must use the same strategy because she cannot distinguish the underlying nodes.

Bob's information sets: when Bob has to decide, he knows his own mission state AND has observed Alice's action. So:

"Bob's information set 1": Bob high, Alice maneuvered (across both possibilities for Alice's hidden mission). 2 nodes.
"Bob's information set 2": Bob high, Alice held. 2 nodes.
"Bob's information set 3": Bob low, Alice maneuvered. 2 nodes.
"Bob's information set 4": Bob low, Alice held. 2 nodes.

Bob has 4 information sets total. He gets more information than Alice (he sees her move first), so he has finer information sets.

Strategies as functions over information sets

In CFR and other extensive-form game algorithms, a player's strategy is a function from their information sets to probability distributions over actions:

$σ_{i} (I) \in Δ (A (I))$

Decoding:

$σ_{i}$ : the strategy of player i (sigma is conventional notation for a strategy)
$I$ : an information set
$Δ (A (I))$ : the set of probability distributions over the actions available at information set I

For our game, Alice's strategy is two probability distributions:

$σ_{A} (Alice high) = (p_{M}, p_{H})$ : probability of maneuver and hold when Alice has high mission
$σ_{A} (Alice low) = (q_{M}, q_{H})$ : probability of maneuver and hold when Alice has low mission

Bob's strategy is four probability distributions, one per information set.

A complete strategy specifies all of these simultaneously. The size of the strategy space grows with the number of information sets, which is what makes large extensive-form games hard.

The crucial subtlety: strategies vs. policies

In RL (Module 3), we used the word "policy" for a function from states to action distributions. In game theory, we use "strategy" for a function from information sets to action distributions. These are subtly different concepts.

A policy is conditioned on the observable state. In a perfect-information game, every node is its own information set, and policy = strategy. In an imperfect-information game, multiple nodes share the same information set, and the strategy must be the same at all of them.

This distinction matters for CFR: regret is computed per information set, not per node, because the player cannot distinguish the underlying nodes.

Reach probabilities

Given a strategy profile (one strategy per player, plus the chance distributions), the reach probability of a node is the probability that the game actually arrives at that node when played according to the strategy profile.

For a node deep in the tree, the reach probability is the product of:

The chance probabilities along the path
The strategy probabilities of the actions along the path (each player's strategy applied to the relevant action)

Formally, the reach probability is decomposed:

$π^{c} (h)$ : chance reach probability (product of chance probabilities on path to h)
$π_{i}^{σ} (h)$ : player i's reach probability (product of player i's strategy probabilities on path to h)
$π^{σ} (h) = π^{c} (h) \cdot \prod_{i} π_{i}^{σ} (h)$ : total reach probability

The reach probability tells you how much the game "weights" each node when computing expected payoffs. Nodes with high reach probability contribute more to expected outcomes than nodes with low reach probability.

Counterfactual reach probabilities

Here is a concept that is crucial for CFR but takes some unpacking. The counterfactual reach probability of an information set, from player i's perspective, is the probability of reaching that information set if player i were trying to reach it.

Specifically, it is the product of chance probabilities and all OTHER players' strategy probabilities along the path:

$π_{- i}^{σ} (I) = π^{c} (I) \cdot j \neq = i \prod π_{j}^{σ} (I)$

The "-i" subscript means "everyone except player i." We are computing the probability of reaching this information set assuming player i played to reach it, while all other players played their strategies normally.

This is the weight that CFR uses when updating regrets. It says: "how often would I face this decision if I were trying to face it?" If a particular information set is rarely reached anyway (because of opponent or chance), it gets less weight in the update.

The math gets thick here. The intuition: we want to update strategies based on how relevant each information set is given the current play of the other players. Counterfactual reach captures that relevance.

Expected payoff and value

The expected payoff for player i under strategy profile σ is the average payoff over all terminal nodes, weighted by reach probability:

$u_{i} (σ) = z \in Z \sum π^{σ} (z) \cdot u_{i} (z)$

where Z is the set of terminal nodes and $u_{i} (z)$ is player i's payoff at terminal node z.

Players try to maximize this expected payoff. CFR's job is to find a strategy profile σ where everyone is approximately maximizing simultaneously: a Nash equilibrium.

What "approximate Nash" means

In the limit of infinite computation, CFR converges to an exact Nash equilibrium: a strategy profile where no player can improve by deviation.

In practice, CFR is run for some finite number of iterations, producing an ε-Nash equilibrium: a strategy profile where no player can improve by more than ε. As iterations increase, ε shrinks. For most practical purposes, ε in the range 0.01 to 0.001 is good enough.

The iteration counts needed depend on the size of the game. Vanilla CFR on a game with thousands of information sets might need millions of iterations. MCCFR (lesson 4) is the workhorse for larger games.

Why the structure matters for CFR

CFR exploits the structure of extensive-form games to make Nash equilibrium computation tractable. Specifically:

It computes regret per information set (not per node), exploiting the fact that strategies are constant within information sets.
It uses reach probabilities to weight updates, exploiting the recursive structure of the game tree.
It only needs to traverse the game tree, not enumerate strategy profiles, exploiting the compact representation.

Without the formal structure of extensive-form games, none of these exploitations would be possible. The next lesson uses this structure to define and run CFR on a small game.

Subgame perfect equilibrium

Nash equilibrium is the right concept for simultaneous-move games, but extensive-form games allow sequential moves, which introduces a new issue: non-credible threats.

Consider a simplified version of the conjunction game where Alice moves first and can threaten to sue Bob if he does not maneuver. In the normal-form representation, "I will sue if you don't maneuver" might be a Nash equilibrium strategy because Bob, fearing the lawsuit, maneuvers. But if actually carrying out the lawsuit would cost Alice more than she would gain, the threat is non-credible. Alice would not follow through.

Backward induction eliminates these non-credible threats. Starting from the terminal nodes and working back:

At each final decision point, the player chooses the action that maximizes their payoff.
Replace that decision point with the resulting payoff values.
Move one level up and repeat.

The strategies surviving backward induction form a subgame perfect equilibrium (SPE): a Nash equilibrium where the strategies remain Nash equilibria in every subgame (every sub-tree rooted at any reachable node).

Why SPE eliminates non-credible threats

A threat is credible only if carrying it out is the rational action at the point where it would be executed. Backward induction checks exactly this: it asks "would this player really do this if we actually reached this node?" If the answer is no, the equilibrium is thrown out.

SSA example: credible commitments in spectrum deconfliction

Suppose two operators share a frequency band. Operator A (incumbent) threatens to transmit at full power to jam Operator B if B transmits during A's window, even though doing so would also degrade A's own signal.

         [B considers transmitting]
        /                           \
    [B transmits]              [B stays silent]
         |                           |
    [A decides]                  (A: 0, B: 5)  <- B takes full slot
    /         \
[A jams]    [A ignores]
 (-2, -3)     (-5, 8)   <- A ignores, B takes the slot

Payoffs: (A's payoff, B's payoff).

Nash equilibrium analysis: is (A threatens to jam, B stays silent) a Nash equilibrium?

If B believes A will jam, B prefers to stay silent (5 vs. -3). So B staying silent is a best response to "A will jam."
If B is staying silent, A never has to execute the threat, so any threat strategy is technically a Nash equilibrium.

But applying backward induction to the subgame where B actually transmitted:

A's payoff from jamming = -2, from ignoring = -5.
A will jam. The threat is actually credible here.

Now suppose the payoffs change: jamming costs A severely (due to international telecommunications regulations), making the jam payoff (-8, -3) instead of (-2, -3):

A's payoff from jamming = -8, from ignoring = -5.
A will NOT jam. The threat is non-credible.
Knowing this, B will transmit (payoff 8 > 5).
The SPE is: B transmits, A ignores.

The SPE analysis correctly identifies that credibility depends on the actual payoffs at the point of execution, not just the announced threat.

Information sets and perfect recall

Formal definition of an information set

An information set $I$ for player $i$ is a set of decision nodes ${h_{1}, h_{2}, \dots, h_{k}}$ such that:

All nodes in $I$ belong to player $i$ (same player moves at each).
The same actions are available at every node in $I$ .
Player $i$ cannot distinguish between the nodes within $I$ based on their observable history.

Decoding: condition 3 is the key one. It says that if the game reaches any node in $I$ , player $i$ only knows "I am at some node in $I$ " but not which specific node. Their strategy must therefore be the same at all nodes in $I$ — they cannot condition on information they do not have.

In our Alice–Bob conjunction game, Alice's information set 1 contains two nodes: (chance says A=high, B=high, Alice's turn) and (chance says A=high, B=low, Alice's turn). Alice knows she is high-priority but not whether Bob is high or low priority. Both nodes are in the same information set because, from Alice's perspective, they are indistinguishable.

Perfect recall

A player has perfect recall if they always remember their own past actions and observations. Formally, within an information set, all nodes must have identical sequences of (player $i$ 's actions, information sets) along the path from the root.

Perfect recall is standard in game theory and in CFR: it ensures that information sets have a well-behaved structure that supports the counterfactual reasoning CFR relies on.

What happens when perfect recall fails: if a player can "forget" what they did earlier (imperfect recall), the same player's decision nodes can end up in the same information set even though they arose from different action sequences by that player. This creates fundamental problems for CFR:

The standard regret decomposition breaks down — you cannot cleanly separate regret by information set.
Nash equilibrium computation becomes PSPACE-hard rather than polynomial-time.
The "strategy" of a player can depend on where in the information set they are (which defeats the purpose of having information sets).

In SSA contexts, imperfect recall arises naturally in situations with limited telemetry: a ground station might issue commands to a satellite but not retain the record of what commands were sent. For the purposes of game-theoretic analysis, we generally assume perfect recall or explicitly model the information state to restore it.

When partial observability differs from imperfect recall

Partial observability (not knowing the opponent's state) is different from imperfect recall (forgetting your own past actions). Both create information sets, but:

Partial observability of the opponent's state is handled naturally by information sets with no complications for CFR.
Imperfect recall of your own actions creates information sets that violate the standard CFR assumptions.

In SSA, the classic partial observability scenario is a satellite-vs-jammer hide-and-seek game: the satellite does not know the jammer's location; the jammer does not know whether the satellite has detected it. This is partial observability of the opponent, handled cleanly by information sets.

Reach probabilities: detailed computation

Given a strategy profile $σ$ , the reach probability $π^{σ} (h)$ of a history $h$ is computed as a product over all actions taken along the path from the root to $h$ :

$π^{σ} (h) = h^{'} \cdot a ⊑ h \prod σ_{P (h^{'})} (h^{'}, a)$