Introduction

Module 1: Foundations

Where this module fits

Everything you'll build later in this curriculum (value functions, policy gradients, MCTS rollouts, CFR regret updates, neural network forward passes) reduces to three operations: pushing numbers through matrices, taking derivatives of those numbers with respect to other numbers, and reasoning about what those numbers mean when they're random. This module gives you working intuition for those three things and nothing else.

This is not a math course. We are picking exactly the pieces of probability, linear algebra, and calculus that show up when you read OpenSpiel source code, and skipping everything else. If a topic feels truncated, that's because the rest doesn't matter for our goal. When we hit a later algorithm that genuinely needs more (eigenvectors for alpha-rank, for instance), we'll handle it then, in context.

What we cover

Probability (lessons 1-4). Distributions, expectation, conditional probability, Monte Carlo sampling, and the information-theoretic quantities (entropy, cross-entropy, KL divergence) that show up everywhere from policy gradients to regret matching. The MCTS, MCCFR, and policy gradient methods later in the curriculum are, deep down, ways of making smart estimates from random samples. If you internalize "expectation under a distribution, estimated by sampling," you've already got the shape of half the algorithms in OpenSpiel.

Linear algebra (lessons 5-6). Vectors as state representations (an orbital state vector is, mechanically, just a vector). Dot products as similarity and projection. Matrix-vector multiplication as the operation that defines a single neural network layer. That's it for now. We're skipping eigendecomposition, determinants, matrix inverses, and rank, because they don't show up until later (and some never do).

Calculus (lesson 7). Derivatives as slopes, partial derivatives as slopes-along-one-axis, gradients as the vector pointing uphill, and the chain rule. The chain rule is the entire mathematical content of backpropagation; if you can see it visually, the rest of "how neural nets learn" is bookkeeping.

Lessons

Module project: Monte Carlo conjunction probability

You'll write a small Python program that estimates the probability of a collision between two satellites whose positions and velocities are known only to within some uncertainty. It uses every concept in this module: state vectors (linear algebra), uncertainty distributions (probability), Monte Carlo sampling (expectation under randomness), and a small sensitivity analysis that previews what gradients are good for.

This is a real problem in your field. JSpOC and the commercial conjunction services do something more sophisticated, but the bones are the same: simulate possible futures, average over them, use the average to make a decision. It is also a microcosm of the rest of the curriculum: every algorithm we build later is, in some way, doing exactly this with more structure on top.

What this module is not

We are not doing epsilon-delta proofs. We are not deriving Cauchy-Schwarz. We are not classifying matrices into normal forms. We are not covering measure-theoretic probability. If you came in hoping this would be the time you finally Get linear algebra, this module will frustrate you. It exists to make you fluent enough to read RL and game-theory code without bouncing off the notation, and that is the entire bar.

How to read the lessons

Every lesson follows the same shape: where it fits, the concept (intuition first), the math (only when load-bearing), code, a worked example small enough to verify by hand, and a quiz. If you find yourself stuck on math notation, that's a signal to reread the symbol-decoding paragraph rather than to push through. The notation is a compression of the intuition; if the intuition isn't there, the notation will not magically install it.

Lesson 1: Probability, Distributions, and Expectation

Where this fits

Every algorithm in this curriculum, from MCTS rollouts to CFR regret minimization to policy gradient training, is answering one core question: given that I am uncertain about the world, what should I expect? That question has a precise mathematical answer, and this lesson is that answer. Once you understand expectation, you understand the conceptual shape of half the algorithms in OpenSpiel. The rest is detail.

A space scenario to motivate everything

Imagine you are working a conjunction assessment shift at a space operations center. Your ground radar just detected a new Resident Space Object (RSO) in low Earth orbit. Based on the radar cross-section measurement and preliminary orbital elements, your analyst has assigned probabilities to what this object might be:

Object type	Probability
Active satellite	0.80
Debris	0.15
Dead satellite	0.05
Total	1.00

You cannot wait for perfect information. You need to decide right now how much sensor time to allocate to continued tracking, which operators to notify, and how urgently to treat this object. Different object types require different responses. This situation, having to reason and act when you do not know the truth for certain, is exactly the problem probability is designed for.

What is a random variable?

A random variable is a number (or category) whose value you do not know yet, but where you know something about what values it could take.

The object type in your scenario is a random variable. It could be active satellite, debris, or dead satellite. Right now, before you have more sensor data, you are uncertain which it is. The "random" part just means you are uncertain. The "variable" part means it is a slot waiting to be filled with a value.

You will constantly encounter random variables in this curriculum:

The action an adversary's satellite takes during a conjunction (uncertain because you cannot read their intentions)
The reward an RL agent receives after making a move (uncertain because the environment is stochastic)
The object type of a newly detected RSO (uncertain because your sensor is imperfect)
The exact position of a satellite given imperfect tracking data (uncertain because measurement errors exist)

They are all the same idea. A quantity that depends on something you have not fully observed yet.

What is a distribution?

A distribution is the complete description of your uncertainty. It lists every possible value the random variable could take, and the probability of each one.

Your RSO analysis produced a distribution:

Object type	Probability
Active satellite	0.80
Debris	0.15
Dead satellite	0.05

Notice the probabilities sum to 1.00. This is always required. The distribution has to account for all possibilities. One of these outcomes will happen (or already has happened, you just do not know which yet). The probabilities just describe how likely each one is.

Two distributions you will see constantly

Categorical distribution: a distribution over a finite list of named categories. The object-type example above is categorical. In reinforcement learning, your policy is a categorical distribution over actions: "there are four possible moves, with these probabilities."

Gaussian (Normal) distribution: a distribution over all real numbers, shaped like a bell curve. The position of a satellite you are tracking is often modeled as Gaussian: you have a best estimate of where it is, and uncertainty spreads symmetrically around that estimate. A satellite with position uncertainty of 0.2 km is much more precisely tracked than one with uncertainty of 5 km, even if both have the same best estimate.

What is expectation? Building it from arithmetic

Now suppose each object type has an associated sensor priority score:

Object type	Probability	Priority score
Active satellite	0.80	30
Debris	0.15	90
Dead satellite	0.05	80

Question: what is the average priority score for this object, given your uncertainty about its type?

Here is how to think about it without any formulas yet. Suppose you processed 1,000 similar radar detections using this same probability model. Based on those probabilities, you would expect:

About 800 to be active satellites (priority score 30)
About 150 to be debris (priority score 90)
About 50 to be dead satellites (priority score 80)

To find the average priority score across all 1,000 detections:

Total priority points from active satellites: 800 × 30 = 24,000
Total priority points from debris:            150 × 90 = 13,500
Total priority points from dead satellites:    50 × 80 =  4,000
                                              ─────────────────
Total priority points:                                   41,500

Average = 41,500 / 1,000 = 41.5

Now look at what those 800, 150, and 50 are. Divide each by 1,000 and you get 0.80, 0.15, and 0.05. Those are the probabilities. So the exact same arithmetic can be written more directly:

(0.80 × 30) + (0.15 × 90) + (0.05 × 80)
= 24.0 + 13.5 + 4.0
= 41.5

That is expectation. Multiply each value by its probability, then add up the products. The result is the probability-weighted average, called the expected value or expectation.

A few things to notice:

The expected value (41.5) is not one of the possible values (30, 90, 80). That is fine. Expectation is a property of the distribution, not a prediction of any single outcome.
If you actually processed that radar contact, you would see a priority score of 30, 90, or 80, nothing else. The 41.5 is what you should plan around on average, before you know which one you got.
If the debris probability were much higher (say, 0.95), the expected priority would be much higher too. The expected value follows the probability mass.

The formula, built from the arithmetic you just did

Here is the same calculation written compactly. Let us use symbols to represent the quantities:

Let \(n\) be the number of possible outcomes (in our case, 3)
Let \(p_i\) be the probability of outcome \(i\)
Let \(x_i\) be the value (priority score) for outcome \(i\)

The expected value is written:

\[ \mathbb{E}[X] = \sum_{i=1}^{n} p_i \cdot x_i \]

Decoding every symbol, one at a time:

\(\mathbb{E}[X]\): Read this as "the expected value of X" or just "E of X." The double-struck capital E is a conventional notation for "take the expectation of." X is the random variable (our priority score). The brackets just mean we are asking for the expectation of that particular thing.

\(\sum_{i=1}^{n}\): This is the capital Greek letter sigma, used here as a summation sign. Read it as "add up the following thing for every i, starting at i = 1 and ending at i = n." It is literally a for loop:

total = 0
for i in range(1, n + 1):
    total += p_i * x_i

\(p_i\): The probability of outcome i. The subscript i (written below and slightly to the right of p) connects this probability to the i-th outcome. When i = 1, this is the probability of outcome 1 (active satellite, 0.80). When i = 2, it is the probability of outcome 2 (debris, 0.15). And so on.

\(x_i\): The value (priority score) for outcome i. Same subscript convention: x with subscript 1 is the priority score when outcome 1 occurs (30), x with subscript 2 is the score when outcome 2 occurs (90), and so on.

\(p_i \cdot x_i\): The dot means multiplication. This is "probability of outcome i times value of outcome i."

Reading the whole formula in English: "For each possible outcome (from i = 1 to i = n), multiply its probability by its value. Add up all those products. That total is the expected value."

That is the calculation you already did by hand.

Expectation of a function

One more version you will see often. Instead of taking the expectation of a raw value, sometimes you take the expectation of a function applied to the outcome:

\[ \mathbb{E}[f(X)] = \sum_{i=1}^{n} p_i \cdot f(x_i) \]

Here \(f(x_i)\) means "apply the function f to outcome i, then use that result." For example, if \(f(x) = x^2\), then \(f(x_i)\) is the priority score squared.

In RL, \(f\) is usually "the total reward you collect starting from this state." In CFR, \(f\) is "the regret from taking this action." The structure is always the same: for each outcome, compute f of that outcome, weight by probability, sum up.

Variance: how spread out is the distribution?

Expectation gives you the average. But two distributions can have the same average while behaving very differently.

Scenario A: You always track active satellites, every single contact. Priority score is always 30. Expected priority: 30. Variance: zero. Your planning is perfectly predictable.

Scenario B: 50% chance of a priority-10 object, 50% chance of a priority-50 object. Expected priority: (0.5 × 10) + (0.5 × 50) = 30. Same average, but your actual experience swings between 10 and 50.

Variance measures the average squared distance from the expected value. "Squared distance" means you take the difference between an outcome and the expected value, then square it.

For Scenario B:

Outcome 1 is priority 10. Distance from expected (30) is 10 - 30 = -20. Squared: 400.
Outcome 2 is priority 50. Distance from expected (30) is 50 - 30 = +20. Squared: 400.
Expected squared distance: (0.5 × 400) + (0.5 × 400) = 400.

So variance is 400. The square root of variance is the standard deviation: √400 = 20. A typical sample lands about 20 priority points away from the mean. In Scenario A, standard deviation is zero: you always land exactly on the mean.

Variance will come back in lesson 3 when it determines how noisy your Monte Carlo estimates are. High variance means you need more samples to get a reliable estimate.

Code

import torch
from torch.distributions import Categorical

# Our RSO probability estimate.
probs = torch.tensor([0.80, 0.15, 0.05])
dist = Categorical(probs=probs)

# Sample from the distribution: returns 0 (active sat), 1 (debris), or 2 (dead sat).
sample = dist.sample()
print(f"Sampled object type index: {sample.item()}")

# Sample many times to see the frequencies.
many_samples = dist.sample(sample_shape=(10_000,))
for i, label in enumerate(["Active sat", "Debris", "Dead sat"]):
    freq = (many_samples == i).float().mean()
    print(f"  {label}: {freq:.3f}  (expected: {probs[i]:.3f})")

Computing expected priority directly:

import torch

probs           = torch.tensor([0.80, 0.15, 0.05])
priority_scores = torch.tensor([30.0, 90.0, 80.0])

# E[priority] = sum of (p_i * x_i).
# Step 1: multiply each probability by its priority score.
products = probs * priority_scores
print(f"Products:        {products.tolist()}")  # [24.0, 13.5, 4.0]

# Step 2: sum the products.
expected = products.sum()
print(f"Expected priority: {expected.item()}")  # 41.5

Notice how the Python arithmetic directly mirrors the formula. probs * priority_scores is the elementwise multiplication of all the \(p_i \cdot x_i\) terms. .sum() is the \(\sum\) symbol.

Worked example: dwell time planning across two RSOs

You are planning sensor dwell time for two simultaneously tracked RSOs on an upcoming pass. Each object type requires different dwell times:

Object type	Dwell time needed (seconds)
Active satellite	5
Debris	15
Dead satellite	10

Your current probability estimates:

Object type	RSO Alpha	RSO Beta
Active satellite	0.70	0.10
Debris	0.20	0.80
Dead satellite	0.10	0.10

Expected dwell for RSO Alpha:

Step 1, for each object type, multiply probability by dwell time:

Active satellite: 0.70 × 5 = 3.50 seconds
Debris: 0.20 × 15 = 3.00 seconds
Dead satellite: 0.10 × 10 = 1.00 second

Step 2, sum:

3.50 + 3.00 + 1.00 = 7.5 seconds expected dwell

Expected dwell for RSO Beta:

Step 1:

Active satellite: 0.10 × 5 = 0.50 seconds
Debris: 0.80 × 15 = 12.00 seconds
Dead satellite: 0.10 × 10 = 1.00 second

Step 2:

0.50 + 12.00 + 1.00 = 13.5 seconds expected dwell

Total expected dwell: 7.5 + 13.5 = 21 seconds for this pass.

If your radar has 30 seconds of dwell capacity, you are comfortable. If it has 15 seconds, you have a prioritization problem to solve. Expectation gives you the planning number.

import torch

dwell_times = torch.tensor([5.0, 15.0, 10.0])
alpha_probs = torch.tensor([0.70, 0.20, 0.10])
beta_probs  = torch.tensor([0.10, 0.80, 0.10])

alpha_dwell = (alpha_probs * dwell_times).sum()
beta_dwell  = (beta_probs  * dwell_times).sum()

print(f"Alpha expected dwell: {alpha_dwell.item():.1f}s")  # 7.5s
print(f"Beta expected dwell:  {beta_dwell.item():.1f}s")   # 13.5s
print(f"Total:                {(alpha_dwell + beta_dwell).item():.1f}s")  # 21.0s

Quiz

Lesson 2: Conditional Probability and Bayes' Rule

Where this fits

In partially observable settings, which describes almost every real SSA scenario and every game with hidden information, an agent maintains a belief: a probability distribution over what it cannot directly observe. Every time a new observation arrives, that belief gets updated. The mechanism for that update is Bayes' rule. When you later read about belief states in POMDPs, reach probabilities in CFR, or opponent modeling in multi-agent RL, you are reading about Bayes' rule with domain-specific packaging. This lesson is the unpackaged version.

A scenario: classifying an unknown radar contact

Your ground radar just flagged a new contact. You can measure the object's radar cross-section (RCS), which gives you some evidence about what type of object it might be. But RCS is noisy and imperfect: a debris fragment and a small satellite can produce similar returns.

Before the measurement, your catalog tells you that in this particular orbital regime, the objects are distributed like this:

Object type	Fraction in catalog
Active satellite	60%
Debris	30%
Rocket body	10%

This is your starting belief. You believe there is a 60% chance the contact is a satellite, 30% chance it is debris, 10% chance it is a rocket body. You have not seen the RCS measurement yet.

Then the RCS measurement comes in. It shows a medium-small return. Your sensor physics models tell you how likely that specific measurement is for each object type:

| Object type | P(this RCS reading | object is this type) | |-------------------|------------------------------------------------------| | Active satellite | 0.70 | | Debris | 0.20 | | Rocket body | 0.40 |

Now you have new evidence. The question is: given that measurement, how should your beliefs change?

That question is Bayes' rule.

What is conditional probability?

Conditional probability is the probability of one thing given that you know another thing has happened.

The notation is \(P(A \mid B)\), read as "probability of A given B." The vertical bar means "given that."

For your RCS example:

\(P(\text{debris} \mid \text{medium-small RCS})\) means: "what is the probability that this is debris, given that we observed a medium-small RCS return?"
\(P(\text{medium-small RCS} \mid \text{debris})\) means: "if this were debris, how likely is this particular RCS reading?"

These look similar but they are answering completely different questions. The first is what you want to know. The second is what your sensor model gives you. Bayes' rule connects them.

Conditioning means restricting your universe

Here is a concrete way to think about conditional probability.

Imagine you have a catalog of 1,000 past contacts in this orbital regime:

600 active satellites
300 debris
100 rocket bodies

Of those 600 active satellites, suppose 420 produced a medium-small RCS reading (that is 70% of 600). Of those 300 debris, suppose 60 produced a medium-small RCS reading (that is 20% of 300). Of those 100 rocket bodies, suppose 40 produced a medium-small RCS reading (that is 40% of 100).

Now your sensor reports a medium-small RCS. How many objects in the catalog showed that reading?

420 + 60 + 40 = 520 objects produced a medium-small RCS reading.

Of those 520 objects, how many were active satellites? 420. So:

\(P(\text{active sat} \mid \text{medium-small RCS}) = 420 / 520 \approx 0.808\)

How many were debris? 60. So:

\(P(\text{debris} \mid \text{medium-small RCS}) = 60 / 520 \approx 0.115\)

How many were rocket bodies? 40. So:

\(P(\text{rocket body} \mid \text{medium-small RCS}) = 40 / 520 \approx 0.077\)

Your belief shifted. You started at 60% / 30% / 10%. After seeing the medium-small RCS, you are now at 80.8% / 11.5% / 7.7%. The measurement strongly favored active satellites (because satellites produce this return 70% of the time, while debris produce it only 20% of the time), so the satellite probability went up and debris went down.

Bayes' rule: the formula

Bayes' rule is just the efficient way to do the catalog calculation you just did, without needing an actual catalog.

Here it is, first in words:

Updated belief = (how well the evidence fits this hypothesis) × (prior belief) / (total probability of this evidence)

Now with symbols. We call the hypothesis H (e.g., "this is debris") and the evidence E (e.g., "medium-small RCS reading"):

\[ P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)} \]

Decoding each piece:

\(P(H \mid E)\): The posterior. This is what we want: the probability of hypothesis H after seeing evidence E. "Posterior" means "after." Before the evidence, we had a prior. After the evidence, we have a posterior.

\(P(H)\): The prior. This is what we believed about H before seeing the evidence. In the catalog example, this was 0.60 for active satellite, 0.30 for debris, 0.10 for rocket body. The word "prior" means "before."

\(P(E \mid H)\): The likelihood. This is how probable the evidence is, assuming H is true. Your sensor model gives you this. "If this contact is debris, how likely is this RCS reading?" That is a likelihood.

\(P(E)\): The marginal probability of the evidence, sometimes called the normalizing constant. This is the total probability of seeing this evidence, regardless of which hypothesis is true. In the catalog, it was 520/1000 = 0.52 (the fraction of contacts that produced a medium-small reading at all).

You calculate \(P(E)\) by summing over all hypotheses: \[ P(E) = P(E \mid H_1) \cdot P(H_1) + P(E \mid H_2) \cdot P(H_2) + P(E \mid H_3) \cdot P(H_3) + \ldots \]

This just says: the total probability of seeing this evidence is the sum over each possible explanation, weighted by how likely each explanation was.

The shortcut: you usually do not compute \(P(E)\) directly. Instead, compute the numerator (\(P(E \mid H) \cdot P(H)\)) for every hypothesis, then divide each by the sum. That sum is \(P(E)\), and you get it for free as a byproduct of normalization.

Applying Bayes' rule step by step

Let us walk through the calculation systematically.

Step 1: Write down your priors.

Hypothesis \(H\)	Prior \(P(H)\)
Active satellite	0.60
Debris	0.30
Rocket body	0.10

Step 2: Write down the likelihoods. For each hypothesis, how probable is the evidence (medium-small RCS)?

Hypothesis \(H\)	Likelihood \(P(E \mid H)\)
Active satellite	0.70
Debris	0.20
Rocket body	0.40

Step 3: Multiply prior by likelihood for each hypothesis. This gives you the unnormalized posterior.

Hypothesis	Prior × Likelihood	Unnormalized posterior
Active satellite	0.60 × 0.70 = 0.420	0.420
Debris	0.30 × 0.20 = 0.060	0.060
Rocket body	0.10 × 0.40 = 0.040	0.040
Total		0.520

Step 4: Divide each unnormalized posterior by the total. The total (0.520) is \(P(E)\). Dividing by it makes the posteriors sum to 1.

Hypothesis	Posterior \(P(H \mid E)\)
Active satellite	0.420 / 0.520 ≈ 0.808
Debris	0.060 / 0.520 ≈ 0.115
Rocket body	0.040 / 0.520 ≈ 0.077
Total	1.000

These match the catalog calculation from before. Active satellite went from 60% to 81%. Debris dropped from 30% to 12%. The RCS measurement was informative: it was much more likely under "active satellite" than under "debris," so it pushed probability mass toward the satellite hypothesis.

Code

import torch

# Step 1: Priors
prior = torch.tensor([0.60, 0.30, 0.10])  # active sat, debris, rocket body

# Step 2: Likelihoods P(medium-small RCS | each hypothesis)
likelihood = torch.tensor([0.70, 0.20, 0.40])

# Step 3: Unnormalized posterior = prior * likelihood
unnormalized = prior * likelihood
print(f"Unnormalized: {unnormalized.tolist()}")  # [0.42, 0.06, 0.04]

# Step 4: Normalize so they sum to 1
posterior = unnormalized / unnormalized.sum()
print(f"Posterior:    {posterior.tolist()}")
# approximately [0.808, 0.115, 0.077]

labels = ["Active sat", "Debris    ", "Rocket body"]
for label, p in zip(labels, posterior.tolist()):
    print(f"  {label}: {p:.3f}")

That four-line calculation is the complete Bayes update. Every belief update you will see in POMDPs and multi-agent RL follows this same structure.

Sequential updates: learning from multiple observations

Bayes' rule can be applied repeatedly. Each posterior becomes the new prior for the next observation.

Suppose after the RCS reading, you also get a photometric brightness measurement. Your sensor model says that this specific brightness reading would be observed with these probabilities:

| Hypothesis | P(this brightness | hypothesis) | |------------------|--------------------------------------| | Active satellite | 0.30 (satellites tend to be brighter) | | Debris | 0.50 (debris often has tumbling glints)| | Rocket body | 0.40 |

Use the previous posterior as the new prior:

# Previous posterior becomes the new prior
new_prior = posterior  # [0.808, 0.115, 0.077]

# New likelihood from brightness measurement
likelihood_2 = torch.tensor([0.30, 0.50, 0.40])

# Bayes update (same four steps as before)
unnormalized_2 = new_prior * likelihood_2
posterior_2 = unnormalized_2 / unnormalized_2.sum()

for label, p in zip(labels, posterior_2.tolist()):
    print(f"  {label}: {p:.3f}")

The brightness reading favored debris (50% likely for debris vs 30% for active satellite), so the active satellite probability will drop somewhat and debris will recover. Two observations have nudged our belief, and we could apply a third, fourth, and so on.

This is exactly what a tracking filter does in SSA: it takes a stream of sensor measurements and updates a probability distribution over the object's state at each step.

The base rate trap: the most common mistake in probabilistic reasoning

There is one error so common and so important that it deserves its own section.

The mistake: ignoring the prior and treating the likelihood as if it were the posterior.

Suppose someone tells you "our sensor has a 90% detection rate for rocket bodies" and you detect a signal. A naive interpretation: "90% chance this is a rocket body."

This is almost always wrong. The correct interpretation requires the prior.

If rocket bodies represent only 1% of all objects in this orbital regime (prior = 0.01), then even with a 90% detection rate, most signals will not be rocket bodies. The vast majority of the time, the sensor is detecting one of the 99% non-rocket-body objects at whatever rate that detection applies.

Bayes' rule mechanically prevents this error because the prior appears explicitly in the formula. The moment you write down \(P(H) = 0.01\) before doing the calculation, you cannot forget it.

For SSA, this matters practically. If you are looking for adversarial satellite maneuvers and only 1 in 1,000 maneuvers is adversarial (with 999 being routine station-keeping), a detector that is 95% accurate at identifying adversarial maneuvers will still produce many false positives if you ignore the base rate.

Why this matters going forward

In partially observable games, an agent cannot see the full game state. All it has is its own actions and the observations it has received. At each step it maintains a belief over what the hidden state might be, and it updates that belief using Bayes' rule every time a new observation arrives. This is the belief state in a POMDP.

CFR in extensive-form games uses "reach probabilities" that track, for each decision point, how likely it is that the game reached this point under a particular strategy. These are a form of conditional probability, and updating them follows the same logic you just practiced.

When you eventually read code that says belief_state.update(observation) or reach_prob *= policy[action], you will know what is happening inside those lines.

Quiz

Lesson 3: Sampling and Monte Carlo Estimation

Where this fits

In lesson 1 you learned that expectation is a weighted sum over all possible outcomes. That works perfectly when there are three object types or six dice faces. It becomes completely impossible when there are millions of possible game trajectories, or when the quantity you want to average does not have a tidy formula. Monte Carlo estimation is how you compute expectations when direct computation is hopeless. MCTS, MCCFR, and the REINFORCE policy gradient estimator are all, at their core, Monte Carlo estimation with extra structure on top. This lesson is where those algorithms get their conceptual foundation.

The problem: some expectations cannot be computed directly

Consider a simple version of an SSA planning problem. You are deciding whether to maneuver a satellite now or wait. The outcome depends on:

Whether an approaching RSO turns out to be debris or an active satellite (two possibilities)
What orbital regime it settles into after the next atmospheric drag update (many possibilities)
What other operators in the region decide to do (unknown)
Small stochastic perturbations from solar radiation pressure (continuous, infinite possibilities)

The exact expected cost of maneuvering versus not maneuvering involves averaging over all combinations of these factors. If each factor has only 10 possible values, you have 10 × 10 × 10 × 10 = 10,000 combinations to sum over. With 20 factors, you have 10^20 combinations, which is more than the number of atoms in a gram of carbon. With continuous values, it is literally infinite.

In game theory the situation is similar. A chess game has roughly 10^120 possible game states. Computing the exact expected value of a move by summing over all of them is not possible in any practical sense.

Monte Carlo estimation is the answer: instead of summing over all possibilities, you draw a random sample, compute the quantity for that sample, and repeat. The average of many samples is a reliable estimate of the true expectation.

The core idea: sampling and averaging

Here is the key insight, stated plainly:

If you cannot enumerate all outcomes, simulate some of them and average the results.

The average of your simulated outcomes will be close to the true expectation. The more samples you take, the closer it will be.

Let us see this concretely with a simple SSA-flavored example before we look at any formulas.

Scenario: Your satellite is about to pass through a debris field. You estimate there is a 30% chance of a collision with a piece of debris large enough to cause damage. If there is a collision, the mission cost is 1,000 (arbitrary units). If there is no collision, the cost is 0.

The expected cost is straightforward here: 0.30 × 1,000 + 0.70 × 0 = 300. You can compute it directly.

But suppose you did not know how to compute it directly, and instead you simulated 10 passes through the debris field:

Pass 1: no collision (cost 0)
Pass 2: no collision (cost 0)
Pass 3: collision (cost 1,000)
Pass 4: no collision (cost 0)
Pass 5: no collision (cost 0)
Pass 6: no collision (cost 0)
Pass 7: collision (cost 1,000)
Pass 8: no collision (cost 0)
Pass 9: no collision (cost 0)
Pass 10: no collision (cost 0)

Average cost: (0 + 0 + 1000 + 0 + 0 + 0 + 1000 + 0 + 0 + 0) / 10 = 2000 / 10 = 200.

That is not exactly 300, but it is in the right ballpark. With 10 samples, you got 2 collisions instead of the "expected" 3. With 1,000 samples you would typically get something much closer to 300.

The formula for Monte Carlo estimation

Suppose you want to estimate \(\mathbb{E}[f(X)]\), the expected value of a function \(f\) applied to a random variable \(X\).

Instead of computing the infinite (or intractable) sum, you:

Draw \(N\) samples \(x_1, x_2, \ldots, x_N\) from the distribution of \(X\)
Compute \(f(x_i)\) for each sample
Average the results

The Monte Carlo estimate is written:

\[ \hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} f(x_i) \]

Decoding the symbols:

\(\hat{\mu}\): Read as "mu hat." The Greek letter mu (\(\mu\)) is conventional notation for a mean or expected value. The hat (\(\hat{}\)) means "estimated." So \(\hat{\mu}\) is "our estimate of the true mean." The hat distinguishes the estimate (which we computed from samples) from the true value (which we might never know exactly).

\(\frac{1}{N}\): Divide by \(N\), the number of samples. This is just computing an average.

\(\sum_{i=1}^{N}\): Add up the following thing for i from 1 to N. Same summation sign as in lesson 1, now looping over samples rather than outcomes.

\(f(x_i)\): Apply function \(f\) to the i-th sample. In the debris example, f(x) = the cost of that pass (1,000 if collision, 0 if not).

Reading it in English: "Draw N samples, compute f for each one, add them all up, divide by N to get the average." That is the entire thing.

Two properties that make this useful

Property 1: The estimate is unbiased.

"Unbiased" means that if you ran your Monte Carlo estimator many times (each time drawing a fresh set of N samples), the average of your estimates would equal the true expectation. There is no systematic error in one direction or the other. Individual runs might be too high or too low, but they are wrong in a random, symmetric way.

Property 2: The error shrinks as \(1/\sqrt{N}\).

The standard error (a measure of how wrong a typical estimate is) follows this formula:

\[ \text{standard error} = \frac{\sigma}{\sqrt{N}} \]

Where \(\sigma\) (sigma, the Greek lowercase letter for standard deviation) describes how spread out your samples are, and \(N\) is the number of samples.

Decoding: the standard error gets smaller as N gets bigger. But it shrinks by a square root factor. To get twice as accurate, you need four times as many samples. To get ten times as accurate, you need one hundred times as many samples.

Let us see what that looks like numerically for the debris example:

Samples (N)	Typical error in Pc estimate
10	±0.145 (huge)
100	±0.046
1,000	±0.014
10,000	±0.005 (about ±0.5%)
100,000	±0.0014 (about ±0.1%)

At 10 samples, your estimate of a 30% probability might range from 15% to 45%. At 100,000 samples, it is accurate to within a tenth of a percent. The cost of that accuracy is 10,000 times more computation.

This 1/√N trade-off is fundamental. It is why MCTS does many rollouts to improve its value estimates. It is why MCCFR accumulates regret over many iterations rather than converging in a handful. Samples are cheap compared to exact computation, but they are not free.

Watching the convergence happen

Here is a Monte Carlo estimation of a simple orbital probability: the fraction of a 95-minute low Earth orbit that a satellite spends in eclipse (behind Earth's shadow). We do not need to derive the analytic answer; we just simulate orbital positions and count how many are in shadow.

We will use an extremely simplified model: a circular orbit at 400 km altitude. A position is "in eclipse" if the angle from the Sun direction is more than about 90 + a few degrees (accounting for Earth's radius). We will approximate this with a random position on a circle.

import torch

def estimate_eclipse_fraction(N):
    """Estimate fraction of orbit spent in eclipse via Monte Carlo."""
    # Sample N random positions along a circular orbit (uniform angles).
    angles = torch.rand(N) * 2 * torch.pi  # uniform in [0, 2*pi]
    
    # An extremely simplified eclipse model: in eclipse if angle
    # from sun direction (0 radians) is between 110 and 250 degrees.
    # (This is a rough approximation; real eclipse geometry is more complex.)
    angle_deg = torch.rad2deg(angles)
    in_eclipse = (angle_deg >= 110) & (angle_deg <= 250)
    
    return in_eclipse.float().mean().item()

# True fraction for this simplified model: (250 - 110) / 360 ≈ 0.389
true_fraction = (250 - 110) / 360
print(f"True fraction (simplified model): {true_fraction:.4f}")
print()

# Watch convergence with increasing N
for N in [10, 100, 1_000, 10_000, 100_000]:
    # Run 5 times to see the spread
    runs = [estimate_eclipse_fraction(N) for _ in range(5)]
    runs_t = torch.tensor(runs)
    mean = runs_t.mean().item()
    std  = runs_t.std().item()
    error = abs(mean - true_fraction)
    print(f"N={N:>6}: mean={mean:.4f}, std={std:.4f}, error={error:.4f}")

When you run this, you will see the error and standard deviation shrink roughly by a factor of 3 each time N increases by a factor of 10 (because √10 ≈ 3.16). That is the 1/√N convergence, made visible.

The canonical Monte Carlo example: estimating pi

This example appears in virtually every introduction to Monte Carlo methods because it makes the sampling process visually obvious.

Imagine a unit square (width 1, height 1) with a quarter-circle of radius 1 inscribed in it. The area of the square is 1. The area of the quarter-circle is π/4. So the fraction of random points that fall inside the quarter-circle is π/4, and we can estimate π by multiplying that fraction by 4.

import torch

torch.manual_seed(42)  # makes results reproducible

def estimate_pi(N):
    # Draw N random points uniformly in the unit square [0,1] x [0,1].
    points = torch.rand(N, 2)
    
    # A point (x, y) is inside the quarter-circle if x^2 + y^2 <= 1.
    # points**2 squares each coordinate. .sum(dim=1) sums x^2 + y^2 per point.
    distance_squared = (points ** 2).sum(dim=1)
    inside = distance_squared <= 1.0
    
    # Fraction inside * 4 estimates pi.
    return 4 * inside.float().mean().item()

print(f"True pi:  {torch.pi:.6f}")
print()
for N in [100, 1_000, 10_000, 100_000, 1_000_000]:
    estimate = estimate_pi(N)
    error = abs(estimate - torch.pi.item())
    print(f"N={N:>7}: estimate={estimate:.5f}, error={error:.5f}")

With N = 100, you might get 3.08 or 3.24, off by a noticeable amount. With N = 1,000,000, you will reliably get something like 3.14163, accurate to five decimal places.

Notice that going from 100 samples to 1,000,000 samples is a factor of 10,000 increase in computation, but the accuracy only improved from roughly ±0.05 to roughly ±0.002, a factor of 25. That is the 1/√N scaling at work. Getting two more decimal places of pi costs 10,000 times more samples.

How this connects to game-playing algorithms

When MCTS evaluates a game position, it cannot sum over all possible continuations of the game (there are too many). Instead, it runs random rollouts: simulations of a complete game from that position to the end, following some approximate policy. The fraction of those rollouts that end in a win is the Monte Carlo estimate of the winning probability from that position.

Each rollout is one sample. The winning probability estimate improves as more rollouts are run. The estimate is noisy with few rollouts and reliable with many. That is exactly the convergence behavior you just saw with pi and eclipse fraction.

The first M in MCCFR (Monte Carlo Counterfactual Regret Minimization) refers to the same idea. Instead of computing counterfactual regret over all possible game trajectories, MCCFR samples trajectories and estimates the regret from those samples. It converges to the correct solution as the number of samples grows, at the 1/√N rate.

The REINFORCE policy gradient algorithm computes the gradient of expected return by running episodes of the agent's policy and averaging the resulting gradients. Each episode is a sample. The gradient estimate improves with more episodes, again at 1/√N.

The pattern is universal: whenever "exact computation is intractable," the answer involves some variant of "sample and average."

Variance reduction: getting more accuracy without more samples

Because standard error = σ/√N, there are two ways to get a more accurate estimate:

Take more samples (increase N)
Reduce how spread out the samples are (decrease σ)

Techniques that reduce σ without changing what you are estimating are called variance reduction methods. You will meet several of them later:

Baselines in policy gradient methods: subtract a fixed value from each reward before computing the gradient estimate. Does not change the expected gradient but reduces how much it varies from estimate to estimate.
Outcome sampling in MCCFR: choose which game trajectories to sample based on their importance rather than uniformly at random.
Importance sampling: sample from a different distribution that has lower variance, then correct for the bias.

You do not need to understand these yet. Just file away that "variance reduction" means "same answer, less noise per sample," and it is an active area of research precisely because 1/√N is expensive.

Quiz

Lesson 4: Entropy, Cross-Entropy, and KL Divergence

Where this fits

Three quantities, all measuring something about probability distributions. They show up in specific and important places downstream. Policy gradient methods add entropy bonuses to encourage exploration. PPO and TRPO constrain policy updates using KL divergence. Cross-entropy is the training loss for nearly every classification network. If you have ever seen a training log print "cross-entropy loss = 0.43" or a paper say "we constrain the KL between old and new policy," this lesson is where those terms become concrete.

The good news: all three quantities reduce to one idea, which we will build up from an SSA scenario.

Starting from scratch: what is surprise?

Your space operations center receives automated alerts whenever the catalog detects a significant event. These alerts are categorized:

Alert type	Probability	Your reaction
Routine conjunction warning	0.70	Expected, handled by procedure
Debris cloud update	0.20	Notable, moderate attention
Uncontrolled reentry warning	0.08	Significant, escalate
Adversarial maneuver detected	0.02	Urgent, emergency response

When a routine conjunction warning comes in (probability 0.70), you are barely surprised. This happens all the time. When an adversarial maneuver alert comes in (probability 0.02), you are very surprised. This almost never happens.

Surprise is inversely related to probability. Common events are unsurprising. Rare events are surprising.

We can make this precise. The mathematical definition of surprise for an event with probability \(p\) is:

\[ \text{surprise}(p) = -\log(p) \]

Let us compute surprise for each alert type and see if it matches our intuition:

Alert type	Probability	\(-\log(p)\) (natural log)
Routine conjunction warning	0.70	0.357 (not very surprising)
Debris cloud update	0.20	1.609 (somewhat surprising)
Uncontrolled reentry warning	0.08	2.526 (quite surprising)
Adversarial maneuver	0.02	3.912 (very surprising)

Rare events (low probability) get high surprise scores. Common events (high probability) get low surprise scores. A guaranteed event (probability 1.0) gets a surprise of −log(1) = 0 (no surprise at all).

Why the negative sign? Because log(p) is negative when p < 1 (log of a fraction is negative), and we want surprise to be a positive number. The negative sign flips it to positive.

Why a logarithm? Two reasons. First, independent events should have additive surprise. If two independent events each occur, you should be exactly as surprised as the sum of surprises for each individually. Logarithms turn multiplication into addition: −log(p₁ × p₂) = −log(p₁) + (−log(p₂)). Second, the logarithm grows slowly at first then quickly, capturing the intuition that going from 50% to 10% feels less dramatic than going from 2% to 0.1%, even though both are 5× reductions in probability.

Entropy: average surprise

Now here is the key question: how surprising is your alert system, on average?

You do not know which alert will come next. But you know the probabilities. The expected surprise is the average amount of surprise per alert, weighted by how often each alert type occurs.

Using the expectation formula from lesson 1:

\[ \text{average surprise} = \sum_{\text{all types}} P(\text{type}) \times \text{surprise}(\text{type}) \]

Let us compute it:

Alert type	Prob \(p\)	Surprise \(-\log(p)\)	Contribution \(p \times (-\log p)\)
Routine conjunction	0.70	0.357	0.250
Debris cloud	0.20	1.609	0.322
Uncontrolled reentry	0.08	2.526	0.202
Adversarial maneuver	0.02	3.912	0.078
Total			0.852

The average surprise is 0.852. This quantity is the entropy of the alert distribution.

Entropy measures how uncertain a distribution is. High entropy means you are often surprised (the distribution is spread out, unpredictable). Low entropy means you are rarely surprised (one or a few outcomes dominate and you almost always know what is coming).

The entropy formula

Entropy of a distribution P is written:

\[ H(P) = -\sum_{x} P(x) \log P(x) \]

Decoding each symbol:

\(H(P)\): The entropy of distribution P. H stands for "Hartley" (an early information theorist), and P is the distribution. The parentheses just mean "the entropy of P."

\(-\): The negative sign. Without it, the expression would be negative (since log of a probability < 1 is negative). The negative sign makes entropy positive.

\(\sum_{x}\): Sum over all possible outcomes x. In our alert example, x ranges over the four alert types.

\(P(x)\): The probability of outcome x under distribution P.

\(\log P(x)\): The logarithm of that probability.

\(P(x) \log P(x)\): Probability times log-probability. Note that this is different from the surprise calculation: surprise is \(-\log P(x)\), but the contribution to entropy is \(P(x) \times (-\log P(x))\), the surprise weighted by how often it occurs.

Reading in English: "For each possible outcome, multiply its probability by its log-probability, sum all those products, and negate the result."

This is just the expectation of surprise: \(H(P) = \mathbb{E}[-\log P(X)]\).

Maximum and minimum entropy

Maximum entropy occurs when the distribution is uniform: all outcomes equally likely. For four outcomes, maximum entropy would be −4 × (0.25 × log 0.25) = log 4 ≈ 1.386. A uniform policy over actions is maximally uncertain; you have no idea which action the agent will take.

Minimum entropy (zero) occurs when one outcome has probability 1 and all others have probability 0. A completely determined distribution. A deterministic policy has zero entropy; you know exactly which action it will take.

Your alert distribution (entropy ≈ 0.852) is between these extremes, closer to the minimum. You are not completely surprised on average, because most alerts are routine.

Cross-entropy: surprise when using the wrong model

Now suppose a new analyst joins your team. Based on their prior experience at a different space ops center, they have a different model of alert probabilities:

Alert type	True probability \(P\)	Analyst's model \(Q\)
Routine conjunction	0.70	0.40
Debris cloud	0.20	0.30
Uncontrolled reentry	0.08	0.20
Adversarial maneuver	0.02	0.10

The analyst thinks adversarial maneuvers are much more common than they actually are (10% vs 2%), and underestimates routine conjunctions (40% vs 70%).

When alerts actually arrive (following the true distribution P), how surprised will the analyst be on average?

The analyst's surprise when alert type x occurs is \(-\log Q(x)\), because they are using their model Q to form expectations. The actual frequency of each alert type follows P. So the analyst's average surprise is:

\[ \sum_{x} P(x) \times (-\log Q(x)) \]

This is the cross-entropy of P and Q:

\[ H(P, Q) = -\sum_{x} P(x) \log Q(x) \]

Let us compute it for your analyst:

Alert type	True prob \(P(x)\)	Analyst surprise \(-\log Q(x)\)	Contribution
Routine conjunction	0.70	\(-\log(0.40)\) = 0.916	0.641
Debris cloud	0.20	\(-\log(0.30)\) = 1.204	0.241
Uncontrolled reentry	0.08	\(-\log(0.20)\) = 1.609	0.129
Adversarial maneuver	0.02	\(-\log(0.10)\) = 2.303	0.046
Total			1.057

The analyst's average surprise is 1.057, compared to 0.852 for someone who knows the true distribution. The analyst experiences more surprise than necessary because their model is wrong.

Notice: when Q = P (the analyst's model matches reality perfectly), cross-entropy equals entropy. The cross-entropy is always at least as large as the entropy, and the gap tells you how much extra surprise the wrong model causes.

KL divergence: the extra surprise from being wrong

The extra surprise caused by using model Q instead of the true distribution P is:

\[ \text{KL}(P | Q) = H(P, Q) - H(P) \]

Or expanded:

\[ \text{KL}(P | Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \]

For your analyst: KL = 1.057 − 0.852 = 0.205. The analyst experiences 0.205 extra units of surprise per alert because their model is miscalibrated.

Decoding the expanded formula:

\(\text{KL}(P | Q)\): The KL divergence from P to Q. The double bars and the order matter. \(\text{KL}(P | Q)\) asks: "if reality is P, how much extra surprise does using model Q cause?"

\(\sum_x\): Sum over all outcomes.

\(P(x)\): Weight by the actual frequency (what really happens).

\(\log \frac{P(x)}{Q(x)}\): The log-ratio. When \(Q(x) = P(x)\), this is \(\log 1 = 0\): no extra surprise for that outcome. When \(Q(x) < P(x)\), you underestimated how often x occurs, and your surprise for that outcome is higher than it should be.

Key properties:

KL divergence is always ≥ 0. It equals 0 only when P and Q are identical.
KL divergence is asymmetric: \(\text{KL}(P | Q) \neq \text{KL}(Q | P)\) in general. "How surprised is the analyst when reality is P and model is Q" is a different question from "how surprised is the analyst when reality is Q and model is P."
The asymmetry matters in ML. Minimizing \(\text{KL}(P | Q)\) over choices of Q produces a different result than minimizing \(\text{KL}(Q | P)\). The first makes Q try to cover all regions where P is large. The second makes Q try to concentrate wherever P is large.

Code

import torch
from torch.distributions import Categorical

# True alert distribution P
P_probs = torch.tensor([0.70, 0.20, 0.08, 0.02])
P = Categorical(probs=P_probs)

# Analyst's model Q
Q_probs = torch.tensor([0.40, 0.30, 0.20, 0.10])
Q = Categorical(probs=Q_probs)

# Entropy of P (average surprise under the true distribution)
# Using the exact formula: -sum(p * log(p))
entropy_P = -(P_probs * torch.log(P_probs)).sum()
print(f"Entropy of P:         {entropy_P.item():.4f}")  # 0.852

# PyTorch also computes it directly
print(f"Entropy via PyTorch:  {P.entropy().item():.4f}")  # same answer

# Cross-entropy: average surprise analyst experiences
# Formula: -sum(P(x) * log(Q(x)))
cross_entropy = -(P_probs * torch.log(Q_probs)).sum()
print(f"Cross-entropy H(P,Q): {cross_entropy.item():.4f}")  # 1.057

# KL divergence: the extra surprise from the wrong model
kl_PQ = torch.distributions.kl_divergence(P, Q)
print(f"KL(P || Q):           {kl_PQ.item():.4f}")  # 0.205

# Verify: cross-entropy - entropy = KL
print(f"Verification:         {(cross_entropy - entropy_P).item():.4f}")  # 0.205

# KL is asymmetric: Q || P is different from P || Q
kl_QP = torch.distributions.kl_divergence(Q, P)
print(f"KL(Q || P):           {kl_QP.item():.4f}")  # different number

Entropy of a sensor allocation policy

Let us look at entropy through an SSA operations lens: your sensor allocation policy over five candidate target objects.

import torch
from torch.distributions import Categorical

# Policy A: uniform allocation (maximum uncertainty about which target gets sensor time)
policy_A = Categorical(probs=torch.ones(5) / 5)

# Policy B: focused primarily on target 1 (satellite in highest-risk conjunction)
policy_B = Categorical(probs=torch.tensor([0.80, 0.05, 0.05, 0.05, 0.05]))

# Policy C: deterministic (always sensor target 1)
policy_C = Categorical(probs=torch.tensor([1.0, 0.0, 0.0, 0.0, 0.0]))

print(f"H(uniform policy A):        {policy_A.entropy().item():.4f}")  # 1.6094 = log(5)
print(f"H(focused policy B):        {policy_B.entropy().item():.4f}")  # lower
print(f"H(deterministic policy C):  {policy_C.entropy().item():.4f}")  # 0.0

Policy A has the maximum possible entropy for five targets: you have no idea which target will get sensor time, and you are equally uncertain about all of them. Policy C has zero entropy: you always know exactly which target will be observed. Policy B is in between.

In RL, an agent that has learned a near-deterministic policy has low entropy: it reliably takes the action it has determined is best. An agent still in early exploration has high entropy: its policy is spread across many actions.

Why KL divergence appears in policy gradient methods

When training a neural network policy, you want to update the policy to improve expected reward. But if you take a very large gradient step, the new policy might be dramatically different from the old one, to the point where your reward estimates (which were based on the old policy) are no longer valid.

PPO and TRPO solve this by adding a constraint: the new policy should not diverge too far from the old policy, as measured by KL divergence. Specifically, they constrain:

\[ \text{KL}(\pi_\text{old} | \pi_\text{new}) \leq \delta \]

where \(\delta\) is a small threshold (like 0.01). This says: after the update, the average extra surprise under the old policy's expectations should not exceed \(\delta\). This keeps updates stable and prevents the policy from collapsing after a lucky or unlucky batch of experience.

Now that you know what KL divergence measures, this constraint makes intuitive sense: "update the policy, but not so much that the new policy would surprise the old policy."

Quiz

Lesson 5: Vectors and Dot Products

Where this fits

Every state in a reinforcement learning system, every observation your agent receives, every action embedding, every intermediate representation inside a neural network, is a vector. The dot product is the single most common operation performed on those vectors: it is how a neural network layer evaluates whether its input "matches" a learned pattern. If you can look at a vector and say what it represents, and look at a dot product and say what it is measuring, you have the geometric intuition for 90% of what deep learning does internally.

What is a vector? Start with the orbital state vector

Suppose you are tracking a satellite. At any given moment, you need to know two things to describe its complete dynamical state: where it is and how fast it is moving. In three-dimensional space, position requires three numbers and velocity requires three numbers. You need six numbers total.

You could write these numbers separately:

Position x: 6,371 km     Velocity x: 7.5 km/s
Position y: 0 km          Velocity y: 0.0 km/s
Position z: 0 km          Velocity z: 0.0 km/s

Or you could write them as an ordered list:

state = [6371, 0, 0, 7.5, 0.0, 0.0]

That ordered list is a vector. It contains six numbers, so we say it is a "six-dimensional vector" or a "vector of length 6." The order matters: the first number is always the x-position, the fourth is always the x-velocity, and so on.

A vector is nothing more than an ordered list of numbers. The numbers can represent anything: positions, velocities, sensor readings, action probabilities, learned features. The abstract mathematical concept does not care what the numbers mean; it just provides tools for working with lists of numbers.

Here are some vectors that appear constantly in our work:

A 6D orbital state vector (position + velocity in Earth-centered inertial frame):

[x, y, z, vx, vy, vz]
[6371.0, 500.0, -200.0, 7.2, 0.3, -0.1]

A 4D action probability distribution for an RL agent with 4 possible actions:

[P(action 0), P(action 1), P(action 2), P(action 3)]
[0.10, 0.20, 0.30, 0.40]

A 3D observation vector from a tracking sensor (range, angle, angular rate):

[range_km, azimuth_deg, elevation_deg]
[850.3, 42.1, 15.6]

Each of these is just a list of numbers. The mathematics of vectors applies equally to all of them.

Visualizing vectors as arrows

In two dimensions, a vector [a, b] can be drawn as an arrow starting at the origin and ending at the point (a, b). The length of the arrow is the magnitude of the vector, and the direction the arrow points captures the relationship between the two components.

This geometric picture generalizes to higher dimensions even though we cannot draw a 6D vector. The key properties of vectors as arrows:

Longer arrows represent vectors with larger magnitude (we will make this precise shortly with norms)
Direction captures the ratio and sign relationship between components
Two arrows pointing in the same direction represent vectors with the same ratios between components, even if one is longer

For two velocity vectors representing satellites in similar orbits:

v1 = [7.5, 0.0, 0.1]   # moving mostly in +x direction
v2 = [7.2, 0.3, 0.0]   # also mostly in +x, slightly in +y

These arrows point in nearly the same direction. Both satellites are moving primarily along the x-axis with small components in other directions. This "similarity of direction" is exactly what the dot product measures.

The length of a vector: norms

Before we talk about dot products, we need to know how to measure the length of a vector.

For a 2D vector [a, b], the Pythagorean theorem gives the length: √(a² + b²). A vector [3, 4] has length √(9 + 16) = √25 = 5.

For a vector of any length [v₁, v₂, ..., vₙ], the same idea extends:

\[ |\mathbf{v}| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2} \]

Decoding each symbol:

\(|\mathbf{v}|\): The "norm" or "magnitude" of vector v. The double vertical bars mean "length of." The bold v (as opposed to a plain v) indicates that v is a vector rather than a single number.

\(\sqrt{\ldots}\): Square root.

\(v_i^2\): The i-th component of the vector, squared. The subscript i selects one component.

\(\sum_{i=1}^{n}\): Sum all n components.

In plain English: "Square every component, add them all up, take the square root." This is the Euclidean distance from the origin to the point represented by the vector.

For the orbital state vector example:

v = [6371.0, 500.0, -200.0, 7.2, 0.3, -0.1]
||v|| = sqrt(6371^2 + 500^2 + 200^2 + 7.2^2 + 0.3^2 + 0.1^2)
      = sqrt(40,589,641 + 250,000 + 40,000 + 51.84 + 0.09 + 0.01)
      ≈ sqrt(40,879,693)
      ≈ 6,394 (a mix of km and km/s, so not physically meaningful,
               but mathematically valid)

In code:

import torch
v = torch.tensor([6371.0, 500.0, -200.0, 7.2, 0.3, -0.1])
norm = torch.linalg.norm(v)
print(norm.item())  # approximately 6394

The dot product: measuring alignment

The dot product of two vectors of the same length is computed by:

Multiplying corresponding components together
Adding all the products

For vectors v = [v₁, v₂, ..., vₙ] and w = [w₁, w₂, ..., wₙ]:

\[ \mathbf{v} \cdot \mathbf{w} = v_1 w_1 + v_2 w_2 + \ldots + v_n w_n = \sum_{i=1}^{n} v_i w_i \]

Decoding:

\(\mathbf{v} \cdot \mathbf{w}\): "The dot product of v and w." The bold letters indicate vectors. The centered dot is the dot product operation (not regular multiplication, which would give a vector).

\(v_i w_i\): Component i of v times component i of w. Subscripts connect corresponding components.

\(\sum_{i=1}^{n} v_i w_i\): Add up all those pairwise products.

Let us compute it step by step for two small vectors:

v = [2, 3, -1]
w = [4, -2, 5]

Step 1: Multiply corresponding components
  2 × 4  = 8
  3 × (-2) = -6
  (-1) × 5 = -5

Step 2: Add the products
  8 + (-6) + (-5) = -3

Dot product: -3

In code:

import torch
v = torch.tensor([2.0, 3.0, -1.0])
w = torch.tensor([4.0, -2.0, 5.0])

# Three equivalent ways to compute the dot product
print((v * w).sum().item())      # -3.0
print(torch.dot(v, w).item())    # -3.0
print((v @ w).item())            # -3.0

What the dot product is measuring: alignment

The arithmetic definition is straightforward. But what does the dot product actually tell us?

The dot product has a geometric interpretation:

\[ \mathbf{v} \cdot \mathbf{w} = |\mathbf{v}| \cdot |\mathbf{w}| \cdot \cos(\theta) \]

where \(\theta\) (the Greek letter theta) is the angle between the two vectors.

Decoding:

\(|\mathbf{v}|\): The norm (length) of v. \(|\mathbf{w}|\): The norm (length) of w. \(\cos(\theta)\): The cosine of the angle between them.

You do not need to remember the details of cosine, but you need to know these key facts:

cos(0°) = 1: vectors pointing in exactly the same direction
cos(90°) = 0: vectors that are perpendicular (at right angles)
cos(180°) = -1: vectors pointing in exactly opposite directions

So what does the dot product tell you?

Large positive dot product: the vectors point in roughly the same direction
Zero dot product: the vectors are perpendicular (completely "unrelated" in direction)
Large negative dot product: the vectors point in roughly opposite directions

In SSA terms: if two satellites have velocity vectors with a large positive dot product, they are moving in roughly the same direction. Their relative speed is low and they will not approach each other quickly. If their dot products are large negative, they are moving toward each other on nearly head-on trajectories: higher collision risk. This is not how real conjunction analysis works, but the intuition is correct.

An SSA example: comparing approach geometries

Two satellites are on potential collision courses. You want a quick sense of how "head-on" versus "overtaking" the geometry is.

import torch

# Satellite 1 moving in +x direction at orbital velocity
v1 = torch.tensor([7.5, 0.0, 0.0])  # km/s

# Case A: satellite 2 moving in -x direction (head-on collision geometry)
v2_headon = torch.tensor([-7.5, 0.0, 0.0])

# Case B: satellite 2 moving in +x direction but slower (overtaking geometry)
v2_overtake = torch.tensor([6.8, 0.1, 0.0])

# Case C: satellite 2 on an inclined orbit (crossing geometry)
v2_cross = torch.tensor([0.0, 7.5, 0.0])

# Dot products
dot_headon   = torch.dot(v1, v2_headon)
dot_overtake = torch.dot(v1, v2_overtake)
dot_cross    = torch.dot(v1, v2_cross)

print(f"Head-on geometry:    dot product = {dot_headon.item():.1f}")    # -56.25 (strongly negative)
print(f"Overtaking geometry: dot product = {dot_overtake.item():.1f}")  # +51.0  (positive)
print(f"Crossing geometry:   dot product = {dot_cross.item():.1f}")     # 0.0    (perpendicular)

# Cosine similarity: normalize out the lengths to get just the direction
norm_v1 = torch.linalg.norm(v1)
norm_headon   = torch.linalg.norm(v2_headon)
norm_overtake = torch.linalg.norm(v2_overtake)
norm_cross    = torch.linalg.norm(v2_cross)

cos_headon   = dot_headon   / (norm_v1 * norm_headon)
cos_overtake = dot_overtake / (norm_v1 * norm_overtake)
cos_cross    = dot_cross    / (norm_v1 * norm_cross)

print(f"\nCosine similarity:")
print(f"Head-on:    {cos_headon.item():.3f}   (angle: {torch.rad2deg(torch.acos(cos_headon)).item():.1f}°)")
print(f"Overtaking: {cos_overtake.item():.3f}   (angle: {torch.rad2deg(torch.acos(cos_overtake)).item():.1f}°)")
print(f"Crossing:   {cos_cross.item():.3f}   (angle: {torch.rad2deg(torch.acos(cos_cross)).item():.1f}°)")

The head-on geometry gives a highly negative cosine (angle ≈ 180°), the crossing geometry gives zero (exactly 90°), and the overtaking geometry gives a high positive value (small angle, similar direction).

Dot products as scoring: the bridge to neural networks

Here is the connection to machine learning that makes the dot product so important.

Suppose you want to score how much an observation "favors" a particular action. For example, you are operating a sensor, and based on the current observation vector (describing the state of the space environment), you want to score each possible pointing action.

You define a weight vector \(\mathbf{w}\) for each action. The weight vector describes what kind of observation the action is best suited for. The score for taking that action given observation \(\mathbf{o}\) is the dot product \(\mathbf{w} \cdot \mathbf{o}\).

If the observation looks like what the weight vector describes (same direction, high alignment), the score is high. If the observation is perpendicular or opposite to the weight vector, the score is low or negative.

This is exactly what a single neuron in a neural network computes. The neuron has a learned weight vector. Its output is the dot product of the weight vector and the input. The network learns weight vectors that give high scores to the kinds of inputs that should lead to good outputs.

In the next lesson, we will stack many neurons in parallel. That will give us matrix-vector multiplication: the operation that defines a neural network layer.

Worked example: hand-computing a dot product for a sensor scoring task

Your sensor has a 4D observation vector describing the current environment:

o = [conjunction_risk, debris_density, solar_activity, comms_window_fraction]
o = [0.8, 0.2, 0.1, 0.6]

You have two candidate sensor pointing strategies, each represented by a weight vector that describes what conditions each strategy "cares about":

Strategy A (conjunction-focused): w_A = [1.0, 0.3, 0.0, 0.2]
  (heavily weights conjunction risk, somewhat weights debris)

Strategy B (comms-window-focused): w_B = [0.1, 0.0, 0.0, 1.0]
  (mainly weights communications window availability)

Score for Strategy A:

Step 1: Multiply corresponding components:

0.8 × 1.0 = 0.80
0.2 × 0.3 = 0.06
0.1 × 0.0 = 0.00
0.6 × 0.2 = 0.12

Step 2: Add:

0.80 + 0.06 + 0.00 + 0.12 = 0.98

Score for Strategy B:

Step 1:

0.8 × 0.1 = 0.08
0.2 × 0.0 = 0.00
0.1 × 0.0 = 0.00
0.6 × 1.0 = 0.60

Step 2:

0.08 + 0.00 + 0.00 + 0.60 = 0.68

Strategy A scores 0.98, Strategy B scores 0.68. Given the current high conjunction risk (0.8) and available comms window (0.6), the conjunction-focused strategy is more strongly indicated by the dot-product scoring.

import torch

o   = torch.tensor([0.8, 0.2, 0.1, 0.6])
w_A = torch.tensor([1.0, 0.3, 0.0, 0.2])
w_B = torch.tensor([0.1, 0.0, 0.0, 1.0])

score_A = torch.dot(o, w_A)
score_B = torch.dot(o, w_B)
print(f"Strategy A score: {score_A.item():.2f}")  # 0.98
print(f"Strategy B score: {score_B.item():.2f}")  # 0.68

In a neural network, the weight vectors are learned from data rather than hand-designed. But the scoring mechanism is exactly this dot product. When we stack many weight vectors in the next lesson, we compute scores for many strategies simultaneously. That is a matrix-vector multiplication.

Quiz

Lesson 6: Matrices and Matrix-Vector Multiplication

Where this fits

In lesson 5, you saw that the dot product of a weight vector and an observation vector scores how well the observation matches that weight vector's "interest." A neural network layer does that simultaneously for many weight vectors at once, producing a score for each one. That simultaneous scoring is matrix-vector multiplication. Once you understand what \(W\mathbf{x} + \mathbf{b}\) computes, you know what a neural network layer does. Every modern deep learning architecture, from the policy networks in AlphaZero to the value networks in deep CFR, is built by stacking this operation repeatedly with nonlinearities in between.

What is a matrix?

A matrix is a rectangular grid of numbers arranged in rows and columns.

When we say a matrix is "m by n" (written m × n), we mean it has:

m rows (horizontal lines of numbers)
n columns (vertical lines of numbers)

Here is a 3 × 4 matrix (3 rows, 4 columns):

\[ W = \begin{pmatrix} 1 & 0 & -1 & 2 \\ 0 & 1 & 1 & 0 \\ -1 & 1 & 0 & 1 \end{pmatrix} \]

Each row is a list of 4 numbers. There are 3 such rows. In total, the matrix contains 3 × 4 = 12 numbers.

We refer to individual entries using row and column indices. The notation \(W_{ij}\) means "the entry in row \(i\), column \(j\)." Row index first, column index second.

From the matrix above:

\(W_{11} = 1\) (row 1, column 1)
\(W_{13} = -1\) (row 1, column 3)
\(W_{23} = 1\) (row 2, column 3)
\(W_{31} = -1\) (row 3, column 1)

The key insight: each row of a matrix is a vector. A 3 × 4 matrix contains three row-vectors, each of length 4. Matrix-vector multiplication uses each of those row vectors to compute a dot product.

Matrix-vector multiplication: the core idea

Suppose we have a weight matrix \(W\) with shape m × n and an input vector \(\mathbf{x}\) of length n. The matrix-vector product \(W\mathbf{x}\) produces an output vector \(\mathbf{y}\) of length m.

The rule: each entry of the output \(\mathbf{y}\) is the dot product of one row of \(W\) with the input \(\mathbf{x}\).

Specifically:

\(y_1\) = (row 1 of W) · x
\(y_2\) = (row 2 of W) · x
\(y_m\) = (row m of W) · x

In formula form:

\[ y_i = \sum_{j=1}^{n} W_{ij} \cdot x_j \]

Decoding:

\(y_i\): The i-th component of the output vector.

\(\sum_{j=1}^{n}\): Sum over j from 1 to n. This loops through the columns.

\(W_{ij}\): The entry in row i, column j of the weight matrix.

\(x_j\): The j-th component of the input vector.

\(W_{ij} \cdot x_j\): Multiply the matrix entry by the input component.

Reading in English: "The i-th output is computed by taking each entry in row i of W, multiplying it by the corresponding entry in x, and adding all those products up." That is a dot product.

Step-by-step example

Let us work through a complete example by hand.

Scenario: You have a sensor processing pipeline. Your sensor returns a 4-dimensional observation:

\[ \mathbf{x} = \begin{pmatrix} 0.8 \ 0.2 \ 0.1 \ 0.6 \end{pmatrix} \]

These represent [conjunction_risk, debris_density, solar_activity, comms_window].

You want to compute scores for 3 possible operational responses. Your scoring matrix (one row per response, one column per observation feature) is:

\[ W = \begin{pmatrix} 1.0 & 0.5 & 0.0 & 0.2 \\ 0.1 & 0.0 & 0.0 & 1.0 \\ 0.3 & 0.8 & 0.2 & 0.1 \end{pmatrix} \]

Row 1 weights for Response A (conjunction-focused). Row 2 weights for Response B (comms-focused). Row 3 weights for Response C (debris-monitoring).

Computing y = Wx:

Output y₁ (score for Response A): Dot product of row 1 with x:

Row 1 entry	×	x entry	=	Product
1.0 (col 1)	×	0.8 (x₁)	=	0.80
0.5 (col 2)	×	0.2 (x₂)	=	0.10
0.0 (col 3)	×	0.1 (x₃)	=	0.00
0.2 (col 4)	×	0.6 (x₄)	=	0.12
Sum				1.02

Output y₂ (score for Response B): Dot product of row 2 with x:

Row 2 entry	×	x entry	=	Product
0.1 (col 1)	×	0.8 (x₁)	=	0.08
0.0 (col 2)	×	0.2 (x₂)	=	0.00
0.0 (col 3)	×	0.1 (x₃)	=	0.00
1.0 (col 4)	×	0.6 (x₄)	=	0.60
Sum				0.68

Output y₃ (score for Response C): Dot product of row 3 with x:

Row 3 entry	×	x entry	=	Product
0.3 (col 1)	×	0.8 (x₁)	=	0.24
0.8 (col 2)	×	0.2 (x₂)	=	0.16
0.2 (col 3)	×	0.1 (x₃)	=	0.02
0.1 (col 4)	×	0.6 (x₄)	=	0.06
Sum				0.48

Result:

\[ \mathbf{y} = W\mathbf{x} = \begin{pmatrix} 1.02 \ 0.68 \ 0.48 \end{pmatrix} \]

Response A scores highest (1.02), Response B is second (0.68), Response C is lowest (0.48). Given the high conjunction risk (0.8) in the input, the conjunction-focused response dominates. Makes operational sense.

In code:

import torch

W = torch.tensor([
    [1.0, 0.5, 0.0, 0.2],
    [0.1, 0.0, 0.0, 1.0],
    [0.3, 0.8, 0.2, 0.1]
])

x = torch.tensor([0.8, 0.2, 0.1, 0.6])

y = W @ x  # @ is the matrix-vector multiplication operator in Python
print(y.tolist())  # [1.02, 0.68, 0.48]

# Verify by computing row 1's dot product manually
row1_dot = torch.dot(W[0], x)
print(f"Row 1 dot product: {row1_dot.item()}")  # 1.02

The @ operator is Python's matrix multiplication operator. For a matrix times a vector, it does exactly the row-by-row dot products you just computed by hand.

Shape rules: why dimensions must match

A matrix-vector multiplication \(W\mathbf{x}\) is only defined when the number of columns in \(W\) matches the length of \(\mathbf{x}\).

If \(W\) is m × n and \(\mathbf{x}\) has length n, the result \(\mathbf{y} = W\mathbf{x}\) has length m.
The "inner" dimension (columns of W, length of x) must match.
The "outer" dimensions (rows of W, length of output) determine the result's shape.

In our example: W is 3 × 4, x has length 4. The 4s match (column dimension of W equals length of x). The result has length 3 (the number of rows).

This matters practically: if you have an observation of length 4 and want to compute scores for 3 responses, your weight matrix must be 3 × 4. Not 4 × 3. Not 3 × 3. The dimensions encode the data flow.

Adding a bias: the full neural network layer

In a real neural network layer, matrix multiplication is followed by adding a bias vector \(\mathbf{b}\):

\[ \mathbf{y} = W\mathbf{x} + \mathbf{b} \]

The bias vector \(\mathbf{b}\) has length m (same as the output). Adding the bias shifts each output score by a fixed amount, regardless of the input. This lets the network set a baseline level for each output even when the input is zero.

Extending the example:

\[ \mathbf{b} = \begin{pmatrix} -0.5 \ 0.3 \ -0.1 \end{pmatrix} \]

\[ \mathbf{y} = W\mathbf{x} + \mathbf{b} = \begin{pmatrix} 1.02 \ 0.68 \ 0.48 \end{pmatrix} + \begin{pmatrix} -0.5 \ 0.3 \ -0.1 \end{pmatrix} = \begin{pmatrix} 0.52 \ 0.98 \ 0.38 \end{pmatrix} \]

Now Response B scores highest. The bias shifted the scores, making Response B look more attractive even though its raw dot product was second. In a learned network, the bias values are adjusted during training to capture the prior attractiveness of each output independently of the input.

PyTorch's nn.Linear: what it does internally

PyTorch's nn.Linear module is a pre-packaged version of \(W\mathbf{x} + \mathbf{b}\):

import torch
import torch.nn as nn

# Create a linear layer: input dimension 4, output dimension 3
layer = nn.Linear(in_features=4, out_features=3)

# What does it contain?
print(f"Weight shape: {layer.weight.shape}")  # torch.Size([3, 4]) = 3 rows, 4 columns
print(f"Bias shape:   {layer.bias.shape}")    # torch.Size([3])   = 3 entries

# Apply it to an input
x = torch.tensor([0.8, 0.2, 0.1, 0.6])
y = layer(x)
print(f"Output shape: {y.shape}")  # torch.Size([3])

# Verify it is computing W @ x + b
y_manual = layer.weight @ x + layer.bias
print(f"Manual matches: {torch.allclose(y, y_manual)}")  # True

The weight matrix is stored as shape (out_features, in_features), meaning rows correspond to output dimensions and columns correspond to input dimensions. This is the same convention we have been using: each row is a weight vector for one output neuron, and its dot product with the input gives that neuron's pre-activation value.

Why stacking layers requires nonlinearities

You might wonder: if each layer is just \(W\mathbf{x} + \mathbf{b}\), what happens when you stack two layers?

y = W₂(W₁x + b₁) + b₂
  = W₂W₁x + W₂b₁ + b₂
  = W'x + b'

Where \(W' = W_2 W_1\) and \(b' = W_2 b_1 + b_2\). Stacking two linear layers gives you... another linear layer. The composition of linear functions is still linear.

This means that without anything else, a deep network with many layers would be no more powerful than a single layer. It could only learn linear transformations of the input.

What breaks this is the activation function: a nonlinear function applied elementwise to the output of each layer before passing it to the next. The most common is ReLU (Rectified Linear Unit): max(0, x). It is literally just: if the value is negative, set it to zero. If positive, leave it alone.

With a nonlinearity between layers, the composition is no longer equivalent to a single linear layer. The network can represent curved decision boundaries, complex patterns, and sophisticated functions of its input. That is what makes deep neural networks powerful.

In module 2 you will see this in full, including how the weights W are learned from data using the gradients from lesson 7. For now, just hold onto the idea: a layer is \(W\mathbf{x} + \mathbf{b}\), you can compute it as row-by-row dot products, and the W and b are what the network learns.

Quiz

Lesson 7: Derivatives, Gradients, and the Chain Rule

Where this fits

This is the final foundational lesson before we start building neural networks. Every learning algorithm in the rest of this curriculum trains by gradient descent: compute a loss function, figure out which direction to adjust the parameters to reduce that loss, take a small step in that direction. The gradient is the mathematical object that tells you which direction that is. The chain rule is what makes computing the gradient tractable when the loss function involves a long composition of operations (which it always does in a neural network). Backpropagation is the chain rule applied systematically to a computational graph. If you understand this lesson, the training of neural networks is bookkeeping, not magic.

What is a derivative? Starting from slope

Suppose you are controlling a satellite's orbit-raising thruster. The altitude of the satellite (in km) after a burn of duration t seconds is given by some function:

altitude = h(t)

If you increase the burn duration by a tiny amount, how much does the altitude change?

That is the question a derivative answers. The derivative of h with respect to t is the rate of change of h as t changes. It tells you:

"If I change t by a tiny amount Δt (delta-t, a small change), the altitude changes by approximately h'(t) × Δt."

The notation \(h'(t)\) (read "h prime of t") is one way to write the derivative. Another common notation is \(\frac{dh}{dt}\) (read "dh by dt"), which emphasizes that we are asking how much h changes per unit change in t.

A concrete simple example

Let the altitude function be \(h(t) = t^2\) (a simple made-up example for illustration).

At \(t = 3\) seconds:

Current altitude: \(h(3) = 3^2 = 9\) km
One second later: \(h(4) = 4^2 = 16\) km
Change over 1 second: 16 - 9 = 7 km

If we zoom in to a much smaller interval (0.01 seconds):

\(h(3.01) = 3.01^2 = 9.0601\) km
Change over 0.01 seconds: 9.0601 - 9 = 0.0601 km
Rate of change: 0.0601 / 0.01 = 6.01 km/s

If we zoom in even more (0.001 seconds):

\(h(3.001) = 3.001^2 = 9.006001\) km
Change over 0.001 seconds: 9.006001 - 9 = 0.006001 km
Rate of change: 0.006001 / 0.001 = 6.001 km/s

As we take smaller and smaller intervals, the rate of change converges to exactly 6 km/s. The derivative of \(h(t) = t^2\) at \(t = 3\) is 6.

This derivative can be computed using calculus rules that you do not need to derive yourself. For the function \(h(t) = t^2\), the derivative is \(h'(t) = 2t\). At \(t = 3\): \(h'(3) = 2 \times 3 = 6\). This matches what we computed numerically.

The derivative as slope: if you plotted \(h(t)\) on a graph and drew a tangent line at \(t = 3\), the slope of that tangent line would be 6. That is all a derivative is: the slope of the function at a specific point.

Key interpretations of the derivative:

Positive derivative: the function is increasing at this point. Moving t in the positive direction increases h.
Negative derivative: the function is decreasing. Moving t in the positive direction decreases h.
Zero derivative: you are at a flat spot (a local minimum, local maximum, or saddle point).

Partial derivatives: functions of multiple inputs

In lesson 5 you saw that a sensor scoring function might take a 4-dimensional input:

score = f(conjunction_risk, debris_density, solar_activity, comms_window)
score = f(x₁, x₂, x₃, x₄)

The output depends on all four inputs simultaneously. A derivative asks "how does the output change if I vary one input?" When there are multiple inputs, we need to specify which input we are varying. That is what a partial derivative does.

The partial derivative of \(f\) with respect to \(x_1\) is written \(\frac{\partial f}{\partial x_1}\) (using the curly ∂ instead of d to indicate "partial"). It means: "how does f change if I vary \(x_1\) while holding \(x_2, x_3, x_4\) fixed?"

The curly ∂ symbol (called "del" or "partial") is just a stylistic convention to distinguish partial derivatives from regular derivatives. It means the same thing: rate of change, but with respect to one specific variable.

A concrete partial derivative example

Let us use a simple two-variable function: the combined risk score:

\[ \text{risk} = f(c, d) = c^2 + 2cd + d \]

where c = conjunction_risk and d = debris_density.

At the point (c = 0.5, d = 0.3):

\(f(0.5, 0.3) = 0.5^2 + 2(0.5)(0.3) + 0.3 = 0.25 + 0.30 + 0.30 = 0.85\)

Partial derivative with respect to c (∂f/∂c):

Treat d as a constant and differentiate with respect to c:

\(c^2\) differentiates to \(2c\)
\(2cd\) differentiates to \(2d\) (since d is treated as a constant)
\(d\) differentiates to \(0\) (no c dependence)

Result: \(\frac{\partial f}{\partial c} = 2c + 2d\)

At our point: \(2(0.5) + 2(0.3) = 1.0 + 0.6 = 1.6\)

This means: if I increase conjunction_risk by a small amount while keeping debris_density fixed, the risk score increases by approximately 1.6 times that amount.

Partial derivative with respect to d (∂f/∂d):

Treat c as a constant:

\(c^2\) → 0 (no d dependence)
\(2cd\) → 2c (c is constant)
\(d\) → 1

Result: \(\frac{\partial f}{\partial d} = 2c + 1\)

At our point: \(2(0.5) + 1 = 2.0\)

This means: if I increase debris_density by a small amount while keeping conjunction_risk fixed, the risk score increases by approximately 2.0 times that amount.

The gradient: all partial derivatives together

The gradient collects all the partial derivatives of a function into a single vector. For a function \(f(x_1, x_2, \ldots, x_n)\), the gradient is:

\[ \nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right) \]

Decoding:

\(\nabla f\): The gradient of f. The symbol ∇ is called "nabla" or "del." Read it as "the gradient of f" or "grad f."

\(\left( \ldots \right)\): A vector (the parentheses enclose the components of the gradient vector).

\(\frac{\partial f}{\partial x_i}\): The partial derivative with respect to the i-th input. This is one component of the gradient vector.

Reading in English: "The gradient is a vector containing the partial derivative of f with respect to each of its inputs."

For our risk function at (c = 0.5, d = 0.3):

\[ \nabla f = \left( \frac{\partial f}{\partial c}, \frac{\partial f}{\partial d} \right) = (1.6, 2.0) \]

What the gradient tells you: the gradient points in the direction that increases f most steeply. Each component of the gradient tells you how sensitive f is to changes in that input. A large gradient component means f is very sensitive to that input. A small component means f barely changes when that input changes.

For gradient descent (the training algorithm for neural networks), you want to minimize f (the loss). So you move in the direction opposite to the gradient: decrease each parameter by a small multiple of the gradient component for that parameter. That small multiple is the learning rate.

The chain rule: derivatives through compositions

Neural networks are compositions of functions. The input goes through layer 1, then layer 2, then layer 3, and so on. Each layer applies \(W\mathbf{x} + \mathbf{b}\) followed by a nonlinearity. If you want to compute how the final output (the loss) changes as you change a weight in layer 1, you need to trace the effect all the way through every subsequent layer.

The chain rule tells you how to do this.

Simple case: two composed functions

If \(y = f(u)\) and \(u = g(x)\), so overall \(y = f(g(x))\), then:

\[ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \]

Decoding:

\(\frac{dy}{dx}\): How much does y change per unit change in x? This is what we want.

\(\frac{dy}{du}\): How much does y change per unit change in u? We can compute this from the definition of f.

\(\frac{du}{dx}\): How much does u change per unit change in x? We can compute this from the definition of g.

The multiplication: the rate of change of y with respect to x is the product of these two rates.

Intuition with a pipeline analogy: imagine water flowing through two pipes in series. The first pipe takes in x and outputs u, with flow rate 3 (meaning 3 units of u per unit of x). The second pipe takes u and outputs y, with flow rate 2 (2 units of y per unit of u). If x increases by 1, u increases by 3, and then y increases by 3 × 2 = 6. The total rate from x to y is the product of the individual rates: 3 × 2 = 6.

Working through an example

Let \(y = (2x + 1)^2\).

We can split this into two operations:

\(u = 2x + 1\) (the inner function g)
\(y = u^2\) (the outer function f)

Step 1: find du/dx

\(u = 2x + 1\). The derivative of 2x is 2 (constant times x), and the derivative of 1 is 0. So:

\[ \frac{du}{dx} = 2 \]

Step 2: find dy/du

\(y = u^2\). The derivative of \(u^2\) with respect to u is \(2u\):

\[ \frac{dy}{du} = 2u \]

Step 3: apply the chain rule

\[ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = 2u \cdot 2 = 4u = 4(2x + 1) \]

Step 4: verify with a numerical check

At \(x = 1\): \(y = (2 \cdot 1 + 1)^2 = 9\) At \(x = 1.001\): \(y = (2 \cdot 1.001 + 1)^2 = (3.002)^2 = 9.012\)

Rate of change ≈ (9.012 - 9) / 0.001 = 12.0

Analytic answer at x = 1: \(4(2 \cdot 1 + 1) = 4 \times 3 = 12\). They match.

import torch

x = torch.tensor(1.0, requires_grad=True)
y = (2 * x + 1) ** 2

y.backward()  # PyTorch computes the derivative using the chain rule internally
print(f"dy/dx at x=1: {x.grad.item()}")  # 12.0

How PyTorch computes gradients automatically

When you write y.backward(), PyTorch walks backward through the computational graph it recorded during the forward pass, applying the chain rule at each operation. This is backpropagation.

The graph for \(y = (2x + 1)^2\) looks like:

x → [multiply by 2] → [add 1] → u → [square] → y

Going backward (from right to left):

Start at y. We want dy/dy = 1.
Apply chain rule through "square": dy/du = 2u
Apply chain rule through "add 1": du/(u_before_add) = 1 (adding a constant does not change the rate)
Apply chain rule through "multiply by 2": d(u_before_add)/dx = 2
Total: dy/dx = 1 × 2u × 1 × 2 = 4u = 4(2x+1) = 12 at x=1

Every neural network, regardless of how many layers, uses this same backward walk through the computational graph. The graph is more complex (involving matrices and nonlinearities), but the principle is identical.

A complete training step

Here is a full gradient descent step on a simple problem, showing every part:

Problem: find the value of x that minimizes \(L(x) = (x - 3)^2\). The minimum is clearly at x = 3, but we will find it by gradient descent.

import torch

# Start with an initial guess
x = torch.tensor(0.0, requires_grad=True)
learning_rate = 0.2

print("Starting gradient descent to minimize L(x) = (x - 3)^2")
print(f"{'Step':>5} | {'x':>8} | {'L(x)':>8} | {'dL/dx':>8}")
print("-" * 40)

for step in range(10):
    # Forward pass: compute the loss
    L = (x - 3) ** 2
    
    # Backward pass: compute dL/dx using the chain rule
    L.backward()
    
    # Read the gradient
    gradient = x.grad.item()
    
    # Print the current state
    print(f"{step:>5} | {x.item():>8.4f} | {L.item():>8.4f} | {gradient:>8.4f}")
    
    # Update step: move x in the opposite direction of the gradient
    with torch.no_grad():
        x -= learning_rate * x.grad
    
    # Reset the gradient for the next iteration
    x.grad.zero_()

What you will see:

At step 0: x = 0, L = 9, gradient = -6. The gradient is negative, meaning increasing x decreases L. So we add a positive amount to x: x += 0.2 × 6 = 1.2. New x = 1.2.

At step 1: x = 1.2, L = 3.24, gradient = -3.6. Still moving toward x = 3. New x = 1.2 + 0.2 × 3.6 = 1.92.

Each step, x gets closer to 3 and L gets closer to 0. By step 10, x is very close to 3.

Why with torch.no_grad():? When we update x, we do not want PyTorch to record this update as part of the computational graph. That context manager tells PyTorch to pause its graph-recording temporarily.

Why x.grad.zero_()? PyTorch accumulates gradients by default (adds new gradients to existing ones). In a training loop, you almost always want a fresh gradient each step, so you clear it before the next forward pass.

The learning rate: why not take the full gradient step?

Notice we multiplied the gradient by learning_rate = 0.2 instead of just subtracting the gradient directly. Why?

The gradient tells you the slope at your current location. It is a local approximation: it is accurate close to where you are, but the function might curve away from the linear approximation if you move too far.

If you take too large a step, you might overshoot the minimum and end up on the other side, potentially further away than you started. A small learning rate keeps you in the regime where the linear approximation is trustworthy.

Choosing the learning rate is one of the most practically important decisions in training a neural network. Too large and training oscillates or diverges. Too small and training is painfully slow. We will discuss this more in module 2 when we actually train networks.

Why this matters for the rest of the curriculum

Every algorithm from here on trains by gradient descent:

Policy gradient methods (module 3): compute the gradient of expected return with respect to policy parameters, step parameters in the positive direction (we want to maximize, not minimize).
Q-learning with neural networks (module 3): compute the gradient of a temporal-difference error with respect to value function parameters, step to reduce the error.
Deep CFR (module 5): compute gradients of a regret prediction loss and step to make regret predictions more accurate.

In each case, you will write a forward pass (compute the loss from the current parameters), call .backward() (chain rule through the computational graph), and update the parameters (subtract learning_rate × gradient). The specific loss function and what you are minimizing will differ. The gradient descent structure will be the same.

Quiz

Module 1 Project: Monte Carlo Conjunction Probability

What you're building

You're going to write a small Python program that estimates the probability that two satellites will pass within some unsafe distance of each other, given that we know their states only with some uncertainty. This is a real problem in your field. The 18th Space Defense Squadron does an industrial-strength version of it for every conjunction screen they publish, and commercial services like LeoLabs and ComSpOC do dressed-up versions for their customers.

We are using a simplified version: linear motion, position-only uncertainty, isotropic Gaussian noise. That's not realistic. It is, however, the right level of complexity to exercise everything we learned in this module without drowning in orbital mechanics we haven't covered.

What this exercises

Vectors and matrices (lessons 5-6): satellite states and velocity propagation.
Probability distributions (lesson 1): Gaussian uncertainty in initial position.
Sampling and Monte Carlo (lesson 3): the actual estimator.
Bayes intuition (lesson 2): conditioning on the observed nominal trajectory.
Variance and convergence (lesson 3 again): the sensitivity analysis.
Gradient intuition (lesson 7): we'll do a numerical sensitivity analysis that mirrors what gradients give you.

Setup

Two satellites at time \(t = 0\), both with known nominal positions and velocities. Each has uncertainty in its initial position, modeled as an isotropic 3D Gaussian (the same standard deviation in each axis, no correlations between axes). Velocities are assumed known exactly. (This is the unrealistic part; in reality, velocity uncertainty matters a lot. We'll fix this in a later module when we have proper covariance propagation.)

Satellite A:
  nominal position (km):   [0, 0, 0]
  nominal velocity (km/s): [7.5, 0.0, 0.0]
  position uncertainty:    sigma = 0.10 km in each axis

Satellite B:
  nominal position (km):   [100, 0.5, 0]
  nominal velocity (km/s): [-7.5, 0.0, 0.0]
  position uncertainty:    sigma = 0.10 km in each axis

Safety threshold: 1.0 km
Time window:     [0, 20] seconds, sampled at 0.1 s intervals

The two satellites are moving directly toward each other, with a half-kilometer cross-track offset, and meet in the middle of the window. Their minimum distance is going to be small. The question is: given the position uncertainty, how often will they actually come within 1 km of each other?

Step-by-step plan

Step 1: Encode the nominal scenario

Use vectors. Don't use individual x, y, z variables; that defeats the point.

import torch

# Nominals
r0_A = torch.tensor([  0.0, 0.0, 0.0])  # km
r0_B = torch.tensor([100.0, 0.5, 0.0])  # km
v_A  = torch.tensor([ 7.5, 0.0, 0.0])   # km/s
v_B  = torch.tensor([-7.5, 0.0, 0.0])   # km/s

# Uncertainty
sigma = 0.10  # km in each axis, both satellites

# Time grid
dt = 0.1
t = torch.arange(0.0, 20.0 + dt, dt)  # shape: (T,) where T = 201

Step 2: Compute the nominal minimum distance

Before adding noise, do the deterministic version. This is a sanity check and gives you something to compare your Monte Carlo estimate against. Propagate linearly: \(\mathbf{r}(t) = \mathbf{r}_0 + \mathbf{v} \cdot t\). For each \(t\), compute \(|\mathbf{r}_A(t) - \mathbf{r}_B(t)|\). Find the minimum over the time window.

Hint: PyTorch broadcasting will let you do this without a loop. t.unsqueeze(1) gives shape (T, 1), and v.unsqueeze(0) gives shape (1, 3), and their product broadcasts to (T, 3).

You should find a nominal minimum distance of about 0.5 km (the cross-track offset, which the two satellites can't close given their parallel-but-opposite velocities along x).

Step 3: Add uncertainty and sample

Now the Monte Carlo part. For each of \(N\) trials:

Sample a perturbation \(\delta_A \sim \mathcal{N}(0, \sigma^2 I)\) and add it to \(\mathbf{r}_0^A\).
Sample \(\delta_B \sim \mathcal{N}(0, \sigma^2 I)\) and add it to \(\mathbf{r}_0^B\).
Propagate both linearly over the time window.
Find the minimum distance over the window.
Record whether that minimum was below the safety threshold.

The probability of conjunction \(P_c\) is then the fraction of trials with a min-distance below threshold.

def estimate_pc(N, sigma=0.10, threshold=1.0):
    # Sample perturbations: shape (N, 3)
    deltas_A = sigma * torch.randn(N, 3)
    deltas_B = sigma * torch.randn(N, 3)
    
    # Perturbed initial positions: shape (N, 3)
    r0A = r0_A + deltas_A
    r0B = r0_B + deltas_B
    
    # Propagate. We want positions of shape (N, T, 3).
    # r(t) = r0 + v*t
    # t has shape (T,), v has shape (3,), so v*t.unsqueeze(1) is (T, 3)
    trajA = r0A.unsqueeze(1) + (v_A.unsqueeze(0) * t.unsqueeze(1)).unsqueeze(0)
    trajB = r0B.unsqueeze(1) + (v_B.unsqueeze(0) * t.unsqueeze(1)).unsqueeze(0)
    # trajA, trajB: shape (N, T, 3)
    
    # Distances at each timestep: shape (N, T)
    diffs = trajA - trajB
    dists = torch.linalg.norm(diffs, dim=2)
    
    # Min distance per trial: shape (N,)
    min_dists = dists.min(dim=1).values
    
    # Probability estimate
    pc = (min_dists < threshold).float().mean()
    return pc.item(), min_dists

If the broadcasting is making your head hurt, write it with a for loop first, get correct numbers, then refactor to vectorized form. Vectorized PyTorch will be 10 to 100 times faster, which matters when we crank \(N\) up.

Step 4: Convergence study

Run the estimator with \(N \in {100, 1,000, 10,000, 100,000}\), repeating 10 times for each \(N\). Plot or print the mean and standard deviation of \(P_c\) across the 10 runs. You should see the standard deviation shrink as roughly \(1/\sqrt{N}\), exactly as lesson 3 promised.

import torch

for N in [100, 1_000, 10_000, 100_000]:
    runs = [estimate_pc(N)[0] for _ in range(10)]
    runs_t = torch.tensor(runs)
    print(f"N={N:>6}: Pc mean = {runs_t.mean():.4f}, std = {runs_t.std():.4f}")

Your absolute \(P_c\) value will depend on your scenario and threshold. The point is the convergence behavior, not any specific number.

Step 5: Sensitivity analysis

This is the part that previews gradients without using them yet. We want to know: how sensitive is \(P_c\) to the uncertainty level \(\sigma\)?

Compute \(P_c\) for \(\sigma \in {0.05, 0.10, 0.20, 0.50, 1.00}\) km, all with \(N = 50,000\). In this geometry you should see \(P_c\) decrease as \(\sigma\) grows. The reason: the nominal minimum distance (0.5 km) is already below the 1.0 km threshold, so the baseline scenario is almost always a conjunction. Wider uncertainty scatters samples further from the nominal, pushing more of them above threshold and lowering \(P_c\).

The direction flips if you move the nominal above threshold. Change r0_B[1] from 0.5 to 1.5 (a 1.5 km cross-track offset, giving a nominal miss of 1.5 km, safely outside the 1.0 km threshold) and rerun. Now \(P_c\) increases with \(\sigma\): uncertainty is occasionally bridging the gap into the danger zone.

The lesson: "more uncertainty means more Pc" is not a law. The direction of \(dP_c/d\sigma\) depends on which side of the threshold the nominal sits. This matters operationally: a maneuver that nudges the nominal further from the threshold can either increase or decrease the sensitivity of Pc to state uncertainty, depending on the geometry. This is, conceptually, a finite-difference approximation to \(dP_c/d\sigma\): you're seeing how the output changes as the input parameter changes. If you wrote the entire pipeline in PyTorch with requires_grad=True on \(\sigma\), you could get this gradient analytically with .backward(). We're doing it the slow way for now, but the fact that "sensitivity of an output to an input" is exactly what gradients give you is the bridge to module 2.

Step 6: Reflect

At the end of your script (or in a comment block), write down answers to:

How does the standard deviation of your \(P_c\) estimates scale with \(N\)?
What's the smallest \(N\) at which you'd trust the answer to two decimal places?
If \(P_c\) for the nominal scenario is, say, 0.04, and you have a "decision threshold" of 0.001 (the value above which JSpOC might issue a maneuver recommendation), how many samples do you need before your estimator's noise is small compared to that threshold? (Hint: the standard error needs to be much smaller than 0.001.)
What's missing from this model that you'd want to add to make it realistic? (Velocity uncertainty? Correlated position uncertainty? Nonlinear dynamics? Time-varying covariance?)

These questions are not graded; they are the things you should be able to answer after doing the project, and they are the things that distinguish "I ran the code" from "I understood the code."

Stretch goals (optional)

Pick one or two if you want extra reps:

Different geometries. Modify the scenario so the nominal trajectories cross exactly (head-on with no offset). What does \(P_c\) do as a function of \(\sigma\) here, vs. the near-miss case?
Velocity uncertainty. Add Gaussian noise to the initial velocities too. Does this change the convergence rate of your estimator? (Spoiler: no, it's still \(1/\sqrt{N}\). What changes is the variance of individual trial outcomes, which affects the multiplicative constant.)
A baseline comparison. Implement a "delta-r" approximation: assume the relative position vector at the nominal time of closest approach is Gaussian, and use the standard analytic formula for probability of falling inside a sphere of radius threshold around the origin. Compare to your Monte Carlo estimate. They should agree to within Monte Carlo noise. This is a useful sanity check.

What you should hand yourself afterward

A single Python file (or notebook) that:

Imports torch.
Defines the scenario as named variables.
Has a function estimate_pc(N, sigma, threshold) that returns the estimate.
Runs the convergence study and prints results.
Runs the sensitivity analysis and prints results.
Has a comment block at the bottom with your answers to the reflection questions.

Don't make this fancy. The code should be 100 to 200 lines, including comments. The point isn't a polished tool; it's that you've now used every concept from this module on a single small problem and seen them work together.

What's next

Module 2 will build neural networks: stacks of the matrix-vector multiplications you saw in lesson 6, with the gradients you saw in lesson 7 doing the training. The Monte Carlo machinery you built here will come back when we get to RL in module 3, where each "rollout" of a policy is exactly the same kind of sample-and-average loop you just wrote.

ML for Spacepower Simulations