Lesson 7: Proximal Policy Optimization

Module: Reinforcement Learning — M03: Sequential Decision-Making Source: Schulman et al. (2017) "Proximal Policy Optimization Algorithms"; Schulman et al. (2015) "Trust Region Policy Optimization"; Schulman et al. (2016) "High-Dimensional Continuous Control Using Generalized Advantage Estimation"; Liang et al. (2018) "RLlib: Abstractions for Distributed Reinforcement Learning"

Where this fits

Lessons 5 and 6 established the two sides of modern deep RL: policy gradient methods (REINFORCE) and actor-critic architecture. REINFORCE has high variance; actor-critic reduces variance by using a learned critic as a baseline. But actor-critic still has a fundamental instability: if one gradient step is too large, the policy can collapse — moving so far from the previous policy that subsequent rollouts are generated by a qualitatively different policy than the one that produced the training data, invalidating the gradient estimate.

Proximal Policy Optimization (PPO) is the standard solution to this instability. It is the most widely deployed RL algorithm in practice: OpenAI used it to train the Dota 2 agent that beat professional players; DeepMind uses it in its robotics work; Ray RLlib's APPO (Asynchronous PPO), used in Module 8's distributed training pipeline, is a direct descendant. Understanding PPO is a prerequisite for understanding what Module 8's RLlib configuration is actually doing.

The trust region problem

Policy gradient methods compute:

∇_θ J(π_θ) = E[∇_θ log π_θ(a|s) · A(s, a)]

One gradient step updates θ by α * ∇J. The problem: the gradient was computed assuming the current policy π_θ. After the gradient step, the new policy π_{θ+Δθ} may be very different from π_θ. The advantage estimates A(s, a) — computed under π_θ — are no longer accurate for π_{θ+Δθ}. If the step is large enough, the new policy is worse than the old one, and the next gradient step (now computed under the worse policy) can be equally destructive. Policy training can catastrophically collapse.

The trust region intuition: constrain each update to stay close to the current policy, where "close" is measured in KL divergence between the old and new policy distributions. TRPO (Schulman et al. 2015) formalizes this as a constrained optimization problem:

maximize E[r_t(θ) · Â_t]
subject to E[KL(π_θ_old || π_θ)] ≤ δ

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the importance ratio between new and old policy.

TRPO works but is expensive: the KL constraint requires computing second-order derivatives (the Fisher information matrix) and solving a constrained optimization problem at each update. For large neural network policies, this is computationally prohibitive.

PPO: the clipped surrogate objective

PPO approximates the TRPO trust region constraint with a simple clipping operation that requires no second-order optimization:

L_CLIP(θ) = E_t[min(r_t(θ) · Â_t,  clip(r_t(θ), 1-ε, 1+ε) · Â_t)]

where:

r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) — the probability ratio between new and old policy
Â_t — the advantage estimate at time t
ε — the clip range (typically 0.1 or 0.2)

The intuition: when r_t > 1+ε, the new policy assigns much higher probability to this action than the old one — the update is being too aggressive. When r_t < 1-ε, the new policy assigns much lower probability. In both cases, the clipped objective stops the gradient from pushing the policy further outside the [1-ε, 1+ε] range.

Taking the minimum of the clipped and unclipped objectives ensures:

When the advantage is positive (this was a good action): the update is encouraged but capped at 1+ε
When the advantage is negative (this was a bad action): the update is encouraged but capped at 1-ε
The objective is always a lower bound on the TRPO objective

The result is a conservative but consistent update: PPO never takes a policy step so large that the importance ratio leaves the trust region, without requiring any constrained optimization.

Generalized Advantage Estimation

PPO uses Generalized Advantage Estimation (GAE, Schulman et al. 2016) rather than the Monte Carlo returns of REINFORCE or the single-step TD advantage of basic actor-critic. GAE interpolates between them with a parameter λ ∈ [0, 1]:

δ_t     = r_t + γ V(s_{t+1}) - V(s_t)      # TD error at step t
Â_t^GAE = Σ_{k=0}^{T-t} (γλ)^k δ_{t+k}    # discounted sum of TD errors

When λ=0: Â_t = δ_t = r_t + γV(s_{t+1}) - V(s_t) — single-step TD advantage, low variance, high bias. When λ=1: Â_t = Σ(γ^k r_{t+k}) - V(s_t) — full Monte Carlo return, high variance, zero bias. Typical λ=0.95 balances these well.

GAE requires that you have a trained value function V(s) — provided by the critic. This is why actor-critic is the prerequisite: PPO inherits the actor-critic architecture and extends it with the clipped objective and GAE.

Complete PPO implementation

PPO collects a fixed batch of experience under the current policy, then performs K epochs of gradient updates on that batch before collecting new experience. This is the key efficiency gain over REINFORCE: rather than discarding each batch after a single gradient step, PPO reuses it for K steps (typically K=4–10) while the clipping constraint ensures the policy doesn't move too far.

import torch
import torch.nn as nn
import numpy as np

class ActorCritic(nn.Module):
    def __init__(self, state_dim: int, action_dim: int, hidden: int = 64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
        )
        self.actor = nn.Linear(hidden, action_dim)   # logits
        self.critic = nn.Linear(hidden, 1)            # state value

    def forward(self, x):
        h = self.shared(x)
        return self.actor(h), self.critic(h).squeeze(-1)

def ppo_update(
    model: ActorCritic,
    optimizer: torch.optim.Optimizer,
    states: torch.Tensor,
    actions: torch.Tensor,
    old_log_probs: torch.Tensor,
    advantages: torch.Tensor,
    returns: torch.Tensor,
    clip_eps: float = 0.2,
    vf_coef: float = 0.5,
    entropy_coef: float = 0.01,
    n_epochs: int = 4,
):
    for _ in range(n_epochs):
        logits, values = model(states)
        dist = torch.distributions.Categorical(logits=logits)
        log_probs = dist.log_prob(actions)
        entropy = dist.entropy().mean()

        # Probability ratio
        ratio = torch.exp(log_probs - old_log_probs)

        # Clipped surrogate objective
        adv = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        surr1 = ratio * adv
        surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * adv
        policy_loss = -torch.min(surr1, surr2).mean()

        # Value function loss
        value_loss = nn.functional.mse_loss(values, returns)

        # Combined loss
        loss = policy_loss + vf_coef * value_loss - entropy_coef * entropy

        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
        optimizer.step()

Key implementation details:

Normalize advantages: (adv - adv.mean()) / adv.std() reduces variance across the batch and stabilizes training. Do this inside the update, not during collection.
Entropy bonus: - entropy_coef * entropy encourages exploration by penalizing a collapsed policy that always picks the same action. Typical entropy_coef = 0.01.
Gradient clipping: Even with PPO's policy constraint, the value function loss can produce large gradients. Clip to max_norm=0.5.
Recompute log probs: Do not cache log probabilities from collection — compute them fresh from the current policy at each epoch of the update. The clipping constraint depends on the ratio of current to old policy, so old log probs must be fixed at collection time and new log probs must be recomputed during each update step.

PPO vs. PPO-Penalty

The clipped objective (PPO-Clip) described above is the standard variant. An alternative, PPO-Penalty, uses an adaptive KL penalty instead of clipping:

L_KL(θ) = E[r_t(θ) · Â_t] - β * KL(π_θ_old || π_θ)

If KL exceeds a target threshold, β is increased; if KL is below the threshold, β is decreased. This approximates the TRPO constraint without the Fisher matrix computation.

In practice, PPO-Clip outperforms PPO-Penalty on most benchmarks and is simpler to implement. Use PPO-Clip unless you have specific reasons to prefer KL-based regularization.

APPO in RLlib: the distributed descendant

Module 8's RLlib training pipeline uses APPO (Asynchronous PPO). APPO applies PPO's clipped objective in the IMPALA decoupled actor-learner architecture (Lesson 8): multiple rollout workers collect experience asynchronously, the learner updates the policy using the PPO-Clip objective, and V-trace corrects for the off-policy bias introduced by the staleness of the actors' policies relative to the learner.

The RLlib configuration maps directly to PPO's hyperparameters:

from ray.rllib.algorithms.appo import APPOConfig

config = (
    APPOConfig()
    .training(
        clip_param=0.2,       # ε in the clipped objective
        vf_loss_coeff=0.5,    # vf_coef in our implementation
        entropy_coeff=0.01,   # entropy_coef
        gamma=0.99,           # discount factor
        lambda_=0.95,         # GAE λ
        num_sgd_iter=4,       # K epochs of gradient updates per batch
    )
    .rollouts(num_rollout_workers=8)
)

Understanding PPO lets you read this configuration meaningfully: clip_param=0.2 is ε, num_sgd_iter=4 is K, lambda_=0.95 is the GAE λ. Without this lesson, Module 8's RLlib pipeline is a configuration file with opaque hyperparameter names.

Key Takeaways

PPO solves the policy collapse problem of vanilla actor-critic. Large gradient steps can move the policy so far from the data-collection policy that subsequent training diverges. PPO prevents this with a clipped importance ratio that stops the update when the new policy diverges too far.
The clipped surrogate objective min(r_t · Â, clip(r_t, 1-ε, 1+ε) · Â) is a conservative lower bound on the TRPO objective. It achieves trust-region-like behavior without second-order optimization — the key engineering simplification that makes PPO practical.
PPO reuses each batch for K epochs of gradient updates. Unlike REINFORCE (one update per rollout) or basic actor-critic, PPO milks each batch for K gradient steps while the clipping constraint ensures the policy doesn't move too far. Typical K = 4–10.
GAE (λ=0.95) interpolates between TD and Monte Carlo advantage estimates. λ=0 is single-step TD (low variance, high bias); λ=1 is full Monte Carlo return (high variance, zero bias). λ=0.95 provides a practical balance.
Normalize advantages within each batch. Subtract the batch mean and divide by batch standard deviation before computing the clipped objective. This reduces sensitivity to reward scaling and stabilizes training.
APPO in RLlib is PPO's clipped objective applied in IMPALA's decoupled architecture. clip_param, num_sgd_iter, and lambda_ in the RLlib config correspond directly to ε, K, and GAE λ in this lesson's implementation.

ML for Spacepower Simulations