Lesson 9: IMPALA and Distributed Reinforcement Learning

Module: Reinforcement Learning — M03: Sequential Decision-Making Source: [cite: Espeholt et al. "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" ICML 2018; Schulman et al. "Proximal Policy Optimization Algorithms" 2017; Liang et al. "RLlib: Abstractions for Distributed Reinforcement Learning" ICML 2018]

Where this fits

Lessons 5 and 6 built policy gradient and actor-critic methods that work correctly on a single machine running a single environment. Those algorithms are conceptually complete, but they have a throughput ceiling that matters for this curriculum. The SSA orbital dominance wargame being developed in later modules requires training over millions of interactions across hundreds of parallel game instances. A single synchronous actor-critic loop running one environment at a time would take days or weeks to collect enough data. Research backing this curriculum explicitly recommends IMPALA/APPO as the training backbone, targeting 250,000 frames per second throughput.

This lesson explains why synchronous on-policy methods hit a wall at scale, how IMPALA's decoupled actor-learner architecture breaks through it, how V-trace corrects the resulting off-policy bias, and how to configure APPO in RLlib for the SSA wargame setup. Module 6 (Multi-Agent RL) runs the same distributed infrastructure for multi-agent training — the architecture introduced here is reused there directly.

The scaling problem with on-policy RL

Recall the synchronous A2C training loop from lesson 6:

Run N environment steps to collect a trajectory batch
Compute advantages using the critic
Update the policy and critic with gradient descent
Discard the trajectory (data is stale after the update)
Repeat

The GPU executes step 3. Everything else — environment simulation, advantage computation, data transfer — runs on CPU. The GPU sits idle during steps 1, 2, 4, and 5. For a typical SSA game instance running in Python:

Collecting one step: ~10 ms (Python environment overhead)
One gradient update over a batch of 512 steps: ~5 ms (GPU)

In a synchronous loop, the timeline is: collect (10 ms) → update (5 ms) → collect (10 ms) → update (5 ms) → ...

GPU utilization: 5 / (10 + 5) = 33%. The GPU is idle two-thirds of the time.

Scaling up to 500 parallel environments in synchronous mode helps throughput — you collect 500 environments' data simultaneously — but the GPU still waits for the slowest actor to finish its batch before each update. Stragglers, garbage collection pauses, and Python GIL contention can make the slowest actor significantly slower than the average, wasting even more time. This is sometimes called the straggler problem.

The fundamental issue: on-policy algorithms require that every gradient update uses data collected under the current policy. This creates a hard serialization: collect → update → collect → update. You cannot overlap collection and learning.

Decoupled actor-learner architecture

IMPALA's key insight is to break the serial dependency by separating actors and the learner into independent processes with a shared queue between them.

Actor 0  ─────────────────────────┐
Actor 1  ─────────────────────────┤
Actor 2  ─────────────────────────┤──► Trajectory Queue ──► Learner (GPU)
   ...                            │        (FIFO)
Actor N  ─────────────────────────┘

Actors (CPU workers): Each actor holds a copy of the current policy. It runs one or more environment instances continuously, collecting (state, action, reward, done) tuples into short trajectory segments. When a segment is complete, the actor pushes it onto the trajectory queue and immediately starts collecting the next segment — it never waits for the learner.

Learner (GPU): The learner pulls trajectory segments from the queue continuously. It runs a gradient update on each batch of segments and broadcasts the updated policy weights back to all actors. It never waits for a specific actor to finish.

Trajectory queue: A shared FIFO buffer (typically an in-memory queue managed by Ray) that decouples the production rate (actors) from the consumption rate (learner).

The result: near-100% GPU utilization — the learner always has data available — and near-100% CPU utilization — actors always have work to do. The two processes proceed at their own natural rates.

This decoupling is the entire architectural contribution of IMPALA. The mathematical challenge it creates (the learner is now training on data generated by an older policy) is what V-trace solves.

The off-policy problem and why it matters

In the decoupled architecture, there is always a lag between when actors collect experience and when the learner trains on it. By the time a trajectory segment reaches the front of the queue, the learner may have performed several gradient updates since the actors generated that segment.

Concretely: suppose actors are running policy $μ$ (the behavior policy — the policy that actually generated the actions in the trajectory). The learner updates and is now running policy $π$ (the target policy — the current learner policy that we want to improve). If the learner has updated 5 times since the actors sent that trajectory, $π$ and $μ$ differ.

Using standard on-policy gradient estimates on off-policy data (data generated by $μ$ but evaluated as if generated by $π$ ) introduces bias. The policy gradient theorem requires that the data distribution matches the current policy. When it does not, the gradient estimate can point in a systematically wrong direction.

Numerically, the problem appears through the importance ratio: the ratio $π (a ∣ s) / μ (a ∣ s)$ measures how much more (or less) likely the current policy is to take the same action the old policy took. If an actor used a slightly exploratory policy $μ$ that assigned probability 0.1 to action $a$ , but the learner's new policy $π$ now assigns 0.8 to that action, the importance ratio is 8.0. Multiplying gradient estimates by this ratio corrects for the off-policy distribution shift, but a ratio of 8 dramatically amplifies the variance of the estimate. With many such large ratios in a trajectory, the gradient update can become unstable.

SSA context: with 512 actor workers and a GPU updating the policy every 50ms, actors will typically be 2–10 policy versions behind the learner. In a fast-moving training run, the behavior policy can diverge enough from the target policy to make naive on-policy gradient estimates noisy. V-trace handles this lag gracefully by clipping rather than accumulating the correction.

V-trace: off-policy correction with clipped importance ratios

V-trace is IMPALA's correction mechanism. It modifies the standard TD target to account for the behavior/target policy mismatch, but clips the importance ratios to limit variance.

The importance ratios

Define the per-step importance ratio:

$ρ_{t} = \frac{π ( a _{t} ∣ s _{t} )}{μ ( a _{t} ∣ s _{t} )}$

Decoding:

$π (a_{t} ∣ s_{t})$ : the probability that the current learner policy would take action $a_{t}$ in state $s_{t}$
$μ (a_{t} ∣ s_{t})$ : the probability that the behavior policy (the actor's policy at the time of collection) actually took action $a_{t}$
$ρ_{t} > 1$ : the current policy is more likely to take this action than the old policy was — the action has become more preferred
$ρ_{t} < 1$ : the current policy is less likely to take this action — the action has become less preferred
$ρ_{t} = 1$ : the policies agree on this action — no correction needed

V-trace uses two clipped versions of this ratio:

$ρ_{t} = min (\overset{ρ}{ˉ}, \frac{π ( a _{t} ∣ s _{t} )}{μ ( a _{t} ∣ s _{t} )}) \overset{ρ}{ˉ} \in [1, \infty)$

$c_{t} = min (\overset{c}{ˉ}, \frac{π ( a _{t} ∣ s _{t} )}{μ ( a _{t} ∣ s _{t} )}) \overset{c}{ˉ} \in [1, \infty)$

Decoding:

$\overset{ρ}{ˉ}$ (rho-bar): clips the importance ratio for the TD error weight. Typically set to 1.0. This directly bounds how much any single transition can influence the value estimate.
$\overset{c}{ˉ}$ (c-bar): clips the importance ratio for the trace accumulation across time steps. Also typically 1.0. This controls how far back in the trajectory the correction propagates.
Both are set to 1.0 by default in IMPALA. Larger values trust the correction over longer time lags; smaller values are conservative and stable.

The V-trace target

The V-trace target for the value function at position $s$ in a trajectory of length $n$ is:

$v_{s} = V (x_{s}) + t = s \sum s + n - 1 γ^{t - s} (i = s \prod t - 1 c_{i}) ρ_{t} (r_{t} + γV (x_{t + 1}) - V (x_{t}))$

Decoding each symbol:

$V (x_{s})$ : the current value estimate at the start of the trajectory segment. This is the baseline from which the correction is measured.
$\sum_{t = s}^{s + n - 1}$ : sum over the $n$ steps of the trajectory segment, starting at position $s$
$γ^{t - s}$ : the standard discount factor applied to rewards further in the future
$\prod_{i = s}^{t - 1} c_{i}$ : the product of clipped importance ratios from $s$ to $t - 1$ . This determines how much the off-policy correction propagates backward through the trajectory. With $\overset{c}{ˉ} = 1$ , $c_{i} \leq 1$ always, so this product shrinks as $t - s$ grows — corrections fade for steps far in the past.
$ρ_{t}$ : the clipped importance ratio at step $t$ , scaling the TD error at that step
$r_{t} + γV (x_{t + 1}) - V (x_{t})$ : the one-step TD error at step $t$ — the difference between the bootstrapped return and the current value estimate

Reading the formula as a whole: the V-trace target starts from the current value estimate $V (x_{s})$ and adds a discounted, importance-weighted sum of TD errors. Each TD error is clipped (via $ρ_{t}$ ) to prevent any single step from dominating, and the accumulation is clipped (via $\prod c_{i}$ ) to prevent corrections from old data from propagating too far back.

When $π = μ$ (on-policy case), all importance ratios equal 1.0 and the clipping has no effect. V-trace reduces exactly to an $n$ -step return. V-trace generalizes the standard on-policy TD target to the off-policy case.

V-trace policy gradient

The policy gradient update in V-trace uses a modified advantage estimate based on the V-trace target:

$\hat{A}_{s} = ρ_{s} (r_{s} + γ v_{s + 1} - V (x_{s}))$

Decoding:

$ρ_{s}$ : the clipped importance ratio at step $s$ — scales how much this step's gradient contributes based on policy divergence
$r_{s} + γ v_{s + 1} - V (x_{s})$ : the V-trace-corrected advantage — how much better was this step than the V-trace value estimate predicted?
The clipping bounds the contribution of any single off-policy step at $\overset{ρ}{ˉ}$ , limiting how much stale data can shift the policy

Intuition: why clipping rather than full correction?

A naive off-policy correction would multiply the gradient by the full importance ratio $π / μ$ . If this ratio is large (say, 20), the gradient step becomes 20 times larger than intended. Over a trajectory, these ratios multiply: five steps each with a ratio of 2 give a trajectory-level ratio of 32. This makes training catastrophically unstable.

V-trace clips ratios at 1.0, accepting some bias in exchange for bounded variance. The bias means V-trace gives a slightly conservative value estimate when policies diverge — it underestimates how much the target policy's performance differs from the behavior policy's experience. In practice this is a good tradeoff: stable training with a slight negative bias is far more useful than unbiased-but-exploding gradients.

A PyTorch illustration of the clipping mechanism:

import torch

def vtrace_correction(
    log_probs_target: torch.Tensor,   # log π(a_t | s_t), shape (T,)
    log_probs_behavior: torch.Tensor, # log μ(a_t | s_t), shape (T,)
    rewards: torch.Tensor,            # r_t, shape (T,)
    values: torch.Tensor,             # V(x_t), shape (T+1,) -- last is bootstrap
    gamma: float = 0.99,
    rho_bar: float = 1.0,
    c_bar: float = 1.0,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute V-trace targets and advantages for one trajectory segment.

    Returns:
        vtrace_targets: shape (T,), used as value function regression targets
        pg_advantages:  shape (T,), used to weight the policy gradient
    """
    T = rewards.shape[0]

    # Raw importance ratios: π(a) / μ(a) = exp(log π - log μ)
    log_rho = log_probs_target - log_probs_behavior
    rho      = torch.exp(log_rho).clamp(max=rho_bar)  # clip for value targets
    c        = torch.exp(log_rho).clamp(max=c_bar)     # clip for trace product

    # TD errors: δ_t = r_t + γ V(x_{t+1}) - V(x_t)
    td_errors = rewards + gamma * values[1:] - values[:-1]

    # V-trace targets: accumulate backward through the trajectory
    vtrace_targets = torch.zeros(T)
    running = 0.0
    for t in reversed(range(T)):
        running      = rho[t] * td_errors[t] + gamma * c[t] * running
        vtrace_targets[t] = values[t] + running

    # Policy gradient advantages: ρ_s * (r_s + γ v_{s+1} - V(x_s))
    # Use the next V-trace target as v_{s+1}
    v_next = torch.cat([vtrace_targets[1:], values[-1:]])
    pg_advantages = rho * (rewards + gamma * v_next - values[:-1])

    return vtrace_targets, pg_advantages


# Demonstrate with a small example
torch.manual_seed(42)
T = 5

# Simulate a mild policy lag: behavior slightly more exploratory than target
log_probs_target   = torch.tensor([-0.5, -0.8, -0.4, -1.0, -0.6])
log_probs_behavior = torch.tensor([-0.9, -1.1, -0.7, -1.3, -1.0])  # lower probs (more exploratory)

rewards = torch.tensor([1.0, 0.5, 2.0, 0.0, 1.5])
values  = torch.tensor([3.0, 2.8, 2.5, 2.0, 1.5, 0.0])  # length T+1

targets, advantages = vtrace_correction(
    log_probs_target, log_probs_behavior, rewards, values
)

raw_rho = torch.exp(log_probs_target - log_probs_behavior)

print("V-trace correction example:")
print(f"{'t':>3}  {'raw ρ':>8}  {'clipped ρ':>10}  {'V-trace target':>16}  {'PG advantage':>14}")
for t in range(T):
    clipped = min(raw_rho[t].item(), 1.0)
    print(
        f"{t:>3}  {raw_rho[t].item():>8.3f}  {clipped:>10.3f}  "
        f"{targets[t].item():>16.4f}  {advantages[t].item():>14.4f}"
    )
print("\nImportance ratios > 1.0 are clipped: the correction is bounded.")

APPO in RLlib: the practical implementation

IMPALA is the full architecture. APPO (Asynchronous PPO) is RLlib's implementation that combines IMPALA's actor-learner decoupling with PPO's clipped surrogate objective. It is the recommended algorithm for large-scale training in this curriculum.

Configuration

from ray.rllib.algorithms.appo import APPOConfig

config = (
    APPOConfig()
    .environment("SSAConjunctionEnv")
    .rollouts(
        num_rollout_workers=32,     # number of Ray actor processes (CPU workers)
        num_envs_per_worker=16,     # parallel game instances per worker
        rollout_fragment_length=50, # steps per trajectory segment before pushing to queue
    )
    .training(
        train_batch_size=4096,      # total steps per gradient update
        lr=5e-4,                    # learning rate
        gamma=0.99,                 # discount factor
        vtrace=True,                # enable V-trace off-policy correction
        vtrace_clip_rho_threshold=1.0,    # rho-bar: clips TD error importance ratios
        vtrace_clip_pg_rho_threshold=1.0, # rho-bar for policy gradient
        entropy_coeff=0.01,         # entropy bonus coefficient
        grad_clip=40.0,             # gradient clipping norm
    )
    .resources(num_gpus=1)
)

Decoding each parameter:

num_rollout_workers=32: 32 separate Ray actor processes. Each runs as an independent Python process, bypassing the GIL. These are the "actors" in the IMPALA architecture.
num_envs_per_worker=16: each worker runs 16 game instances simultaneously. Total parallel environments: 32 × 16 = 512.
rollout_fragment_length=50: each actor collects 50 steps from its environments before pushing a trajectory segment to the queue. Shorter fragments mean lower latency (fresher data); longer fragments amortize the overhead of pushing to the queue.
train_batch_size=4096: the learner pulls enough segments from the queue to accumulate 4,096 steps before running one gradient update.
vtrace=True: enables the V-trace off-policy correction. Without this, APPO uses the data as if it were on-policy, which is biased.
vtrace_clip_rho_threshold=1.0: sets $\overset{ρ}{ˉ} = 1$ in the V-trace formula — the conservative default. Increasing this allows more aggressive off-policy correction but risks instability if actors are very stale.
grad_clip=40.0: clips the gradient norm before each optimizer step. V-trace-corrected gradients can spike if the behavior and target policies diverge suddenly; clipping prevents a single bad batch from destabilizing training.

Registering a custom SSA environment

import ray
from ray.tune.registry import register_env
from ray.rllib.algorithms.appo import APPOConfig

# Define the custom environment factory
def ssa_env_creator(config):
    from ssa_wargame import SSAConjunctionEnv
    return SSAConjunctionEnv(
        n_objects=config.get("n_objects", 20),
        horizon=config.get("horizon", 200),
        seed=config.get("seed", None),
    )

# Register with Ray's environment registry
register_env("SSAConjunctionEnv", ssa_env_creator)

# Initialize Ray (connect to existing cluster or start a local one)
ray.init(ignore_reinit_error=True)

# Build the algorithm
config = (
    APPOConfig()
    .environment(
        "SSAConjunctionEnv",
        env_config={"n_objects": 20, "horizon": 200},
    )
    .rollouts(num_rollout_workers=32, num_envs_per_worker=16)
    .training(train_batch_size=4096, lr=5e-4, gamma=0.99, vtrace=True)
    .resources(num_gpus=1)
)

algo = config.build()

Training loop with checkpointing

import os

checkpoint_dir = "/tmp/ssa_appo_checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

best_mean_reward = float("-inf")
n_iterations = 500

for i in range(n_iterations):
    result = algo.train()

    mean_reward = result["episode_reward_mean"]
    timesteps   = result["timesteps_total"]
    throughput  = result.get("num_env_steps_sampled_this_iter", 0)

    if (i + 1) % 10 == 0:
        print(
            f"Iter {i+1:>4} | "
            f"reward={mean_reward:>8.2f} | "
            f"steps={timesteps:>10,} | "
            f"throughput={throughput:>6} steps/iter"
        )

    # Checkpoint whenever performance improves
    if mean_reward > best_mean_reward:
        best_mean_reward = mean_reward
        checkpoint_path = algo.save(checkpoint_dir)
        print(f"  New best! Saved checkpoint: {checkpoint_path}")

print(f"\nTraining complete. Best mean reward: {best_mean_reward:.2f}")
algo.stop()
ray.shutdown()

What result contains: each call to algo.train() returns a dictionary with keys including episode_reward_mean, episode_reward_max, episode_len_mean, timesteps_total, and learner statistics (loss, entropy, explained variance). The throughput in steps per iteration divided by wall-clock time gives frames per second.

Throughput and hardware math

The case for the IMPALA architecture becomes concrete when you calculate expected throughput.

Python game logic

512 parallel SSA environments, each environment step takes 20 ms (typical for a Python-based orbital mechanics simulation):

$Throughput = \frac{512 envs}{0.020 s/step} = 25, 600 steps/second$

For a 50M-step training run:

$Training time = \frac{50 , 000 , 000}{25 , 600} \approx 1, 953 seconds \approx 32 minutes$

Rust game logic

512 parallel environments with a Rust-based game engine, where each environment step takes 2 ms:

$Throughput = \frac{512 envs}{0.002 s/step} = 256, 000 steps/second$

For the same 50M-step training run:

$Training time = \frac{50 , 000 , 000}{256 , 000} \approx 195 seconds \approx 3.25 minutes$

The 10x step time improvement in the game engine translates directly to a 10x reduction in wall-clock training time. This is why Module 8 of the curriculum discusses a Rust implementation of the SSA wargame: the bottleneck for a well-configured IMPALA setup is environment simulation speed, not GPU compute. When the environment throughput exceeds the GPU's processing capacity, adding more GPUs does not help — you need faster environments.

Sanity check: are actors the bottleneck?

With 32 workers × 16 envs × (1/0.002 steps/s) = 256,000 steps/s from actors, and a modern GPU capable of processing roughly 500,000 steps/s in gradient updates at a typical network size, the actors are the bottleneck for Rust environments. This means:

Adding more GPUs will not improve throughput until you also add more actors
Reducing actor count below ~32 will leave the GPU underutilized
For Rust environments, 64–96 workers keep a single GPU near-saturated

Synchronous vs. asynchronous: when to use which

Algorithm	Architecture	Off-policy correction	Stability	Throughput at scale	When to use
A3C	Async, gradient push	None (ignored)	Unstable at scale	High (biased)	Largely superseded
A2C	Sync, single actor	N/A (on-policy)	Stable	Low (GPU idle)	Small-scale baselines
PPO	Sync, batched actors	N/A (clipped surrogate)	Very stable	Medium	Single-machine production
IMPALA	Async, actor-learner	V-trace	Stable	Very high	Large-scale multi-machine
APPO	Async, actor-learner	V-trace + PPO clip	Very stable	Very high	Recommended for SSA wargame

A3C was the first decoupled actor-learner algorithm (2016). Its actors push gradients directly to the learner, with no queue and no off-policy correction. This works when the policy changes slowly, but at scale the gradient staleness causes systematic bias that degrades performance. IMPALA replaced it by pushing trajectories (not gradients) and adding V-trace correction.

A2C is synchronous: all actors collect a batch, the learner updates, repeat. It has zero off-policy bias but keeps the GPU idle most of the time. It is the right choice when you have a single machine with a few CPU cores and need a stable reference implementation.

PPO is the current industry standard for single-machine training. Its clipped surrogate objective prevents large policy updates without requiring V-trace. At scale, PPO's synchronous collect-then-update loop becomes the bottleneck even with many parallel workers.

APPO inherits the best of both: IMPALA's asynchronous throughput and PPO's clipped surrogate stability. For the SSA wargame with 500+ environments and a research compute budget of 1–4 GPUs and 32–128 CPU cores, APPO is the right choice.

Multi-GPU scaling

For larger training runs, the learner can be sharded across multiple GPUs. Each GPU handles a portion of the batch.

from ray.rllib.algorithms.appo import APPOConfig

# Multi-GPU learner configuration for SSA wargame research setup
config = (
    APPOConfig()
    .environment("SSAConjunctionEnv")
    .rollouts(
        num_rollout_workers=64,     # more actors to feed multiple GPUs
        num_envs_per_worker=16,     # 64 x 16 = 1024 parallel environments
        rollout_fragment_length=50,
    )
    .training(
        train_batch_size=8192,      # larger batch to distribute across GPUs
        lr=5e-4,
        gamma=0.99,
        vtrace=True,
        num_sgd_iter=1,             # APPO: one pass per batch (unlike PPO's multiple)
    )
    .resources(
        num_gpus=2,                 # learner sharded across 2 GPUs
        num_cpus_per_worker=1,
    )
)

How multi-GPU sharding works in RLlib: with num_gpus=2, the learner splits each training batch in half. GPU 0 handles the first half; GPU 1 handles the second half. Gradients are averaged across GPUs before the optimizer step. The actors are unaffected — they see a single policy and push to a single queue regardless of how many GPUs the learner uses.

Practical recommendations for SSA wargame research:

1 GPU: 32 workers, 16 envs/worker (512 envs total). Good for initial experiments and hyperparameter search.
2 GPUs: 64 workers, 16 envs/worker (1,024 envs total). Appropriate for longer training runs where 1 GPU becomes the bottleneck.
4 GPUs: 96 workers, 16 envs/worker (1,536 envs total). Best for final policy training with long horizons and Rust game logic where actor throughput is very high.

The learner benefits from multiple GPUs only if the actor throughput can keep the queue full. With Python environments and 512 envs, a single GPU is typically the right starting point.

A complete working example

The following is a minimal but complete distributed training script that a student can run on a machine with at least 4 CPU cores and optionally one GPU:

"""
Minimal APPO training script for an SSA-like scheduling environment.
Runs locally with Ray; no cluster required.

Requirements:
    pip install "ray[rllib]" gymnasium torch
"""

import ray
from ray.tune.registry import register_env
from ray.rllib.algorithms.appo import APPOConfig
import gymnasium as gym
import numpy as np


class SimpleSatelliteEnv(gym.Env):
    """
    Simplified satellite scheduling environment for demonstration.
    5 satellites, 20-step episodes, discrete action space.
    State: [time_remaining, staleness_0..4, priority_0..4] (11-dimensional)
    Action: integer 0-4, which satellite to observe
    Reward: priority * freshness * success
    """
    def __init__(self, config=None):
        self.n_satellites = 5
        self.episode_len  = 20
        self.observation_space = gym.spaces.Box(
            low=0.0, high=1.0,
            shape=(1 + self.n_satellites * 2,),
            dtype=np.float32,
        )
        self.action_space = gym.spaces.Discrete(self.n_satellites)
        self.reset()

    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        self.t          = 0
        self.priorities = self.np_random.uniform(0.1, 1.0, self.n_satellites).astype(np.float32)
        self.staleness  = np.zeros(self.n_satellites, dtype=np.float32)
        return self._obs(), {}

    def _obs(self):
        time_remaining = np.array(
            [(self.episode_len - self.t) / self.episode_len], dtype=np.float32
        )
        return np.concatenate([
            time_remaining,
            self.staleness / self.episode_len,
            self.priorities,
        ])

    def step(self, action):
        success   = self.np_random.random() > 0.2
        freshness = 1.0 / (1.0 + self.staleness[action])
        reward    = (
            float(self.priorities[action] * freshness * 10.0) * success
            - 0.5 * (not success)
        )
        self.staleness    += 1.0
        self.staleness[action] = 0.0
        self.t            += 1
        terminated = self.t >= self.episode_len
        return self._obs(), reward, terminated, False, {}


def main():
    register_env("SimpleSatelliteEnv", lambda cfg: SimpleSatelliteEnv(cfg))
    ray.init(ignore_reinit_error=True, num_cpus=4)

    # Small-scale config: works on a laptop (no GPU required)
    config = (
        APPOConfig()
        .environment("SimpleSatelliteEnv")
        .rollouts(
            num_rollout_workers=3,      # 3 actor processes
            num_envs_per_worker=4,      # 12 total parallel envs
            rollout_fragment_length=20,
        )
        .training(
            train_batch_size=240,
            lr=5e-4,
            gamma=0.99,
            vtrace=True,
            vtrace_clip_rho_threshold=1.0,
            entropy_coeff=0.01,
            grad_clip=40.0,
        )
        .resources(num_gpus=0)  # set to 1 if a GPU is available
    )

    algo = config.build()

    print("Training APPO on SimpleSatelliteEnv...")
    print(f"{'Iter':>6}  {'Mean reward':>14}  {'Total steps':>14}")
    print("-" * 42)

    for i in range(50):
        result = algo.train()
        if (i + 1) % 5 == 0:
            print(
                f"{i+1:>6}  "
                f"{result['episode_reward_mean']:>14.3f}  "
                f"{result['timesteps_total']:>14,}"
            )

    algo.stop()
    ray.shutdown()
    print("Done.")


if __name__ == "__main__":
    main()

What to observe when running this script:

Early iterations: mean reward fluctuates around 0 as the policy is random
After 10–20 iterations: the policy learns to prefer high-priority satellites (mean reward climbs)
The timesteps_total counter grows quickly despite only 12 environments — the asynchronous architecture keeps the small learner busy
Increasing num_rollout_workers to 16 and num_envs_per_worker to 8 (128 total envs) on a machine with enough CPU cores will roughly 10x the per-iteration throughput

For the full SSA wargame, replace SimpleSatelliteEnv with the wargame environment from Module 8, scale up workers, and enable a GPU.

Key Takeaways

Synchronous on-policy RL wastes GPU time: collecting experience is CPU-bound; gradient updates are GPU-bound; doing them sequentially keeps each idle while the other runs. With 10ms collection and 5ms updates, synchronous training achieves only 33% GPU utilization — worse with stragglers across 500+ workers.
IMPALA's decoupled actor-learner architecture achieves near-100% GPU and CPU utilization by making actors push trajectory segments to a shared queue continuously, while the learner pulls from the queue continuously. Neither side waits for the other.
V-trace corrects off-policy bias by weighting TD errors with clipped importance ratios $ρ_{t} = min (\overset{ρ}{ˉ}, π / μ)$ . Clipping at $\overset{ρ}{ˉ} = 1$ sacrifices a small amount of correction fidelity in exchange for bounded variance — stale actor data produces conservative rather than explosive gradient updates.
APPO combines IMPALA throughput with PPO stability: the asynchronous actor-learner queue provides throughput; V-trace handles off-policy correction; the PPO-style clipped surrogate prevents destructive large policy updates. It is the recommended training backbone for the SSA wargame.
Throughput scales with environment speed: with 512 Python environments at 20ms per step, APPO achieves ~25,600 steps/second; with Rust game logic at 2ms per step, it reaches ~256,000 steps/second. A 50M-step training run shrinks from 32 minutes to 3 minutes — this is the direct motivation for Module 8's Rust game implementation.
Multi-GPU scaling is actor-limited: adding GPUs helps only if the actor throughput can keep the queue full. For Python environments and 512 envs, start with 1 GPU and 32 workers; for Rust environments, 2–4 GPUs and 64–96 workers are appropriate for the SSA wargame research setup.

Keyboard shortcuts

ML for Spacepower Simulations