0%

Proximal Policy Optimization Algorithms — In-Depth Technical Review

Abstract

Reinforcement learning (RL) has become a cornerstone of modern AI — from training robots to walk, to aligning large language models with human preferences via RLHF. At the heart of many of these breakthroughs lies a deceptively simple algorithm called Proximal Policy Optimization (PPO), introduced by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI in 2017.

PPO proposes a new family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO introduces a novel clipped objective function that enables multiple epochs of minibatch updates without catastrophically large policy changes. The result is an algorithm that inherits the stability of Trust Region Policy Optimization (TRPO) while being dramatically simpler to implement — requiring only a few lines of code change to a vanilla policy gradient implementation.

The paper's experiments demonstrate PPO's superiority across a wide range of benchmarks: continuous control tasks (MuJoCo), complex 3D humanoid locomotion (Roboschool), and Atari game playing. PPO outperforms other online policy gradient methods and strikes an excellent balance between sample complexity, simplicity, and wall-time performance.

Why this paper matters today: PPO is arguably the most influential RL algorithm of the deep learning era. It became the default algorithm for training policies in robotics, game AI, and — most consequentially — the RLHF pipeline that aligns large language models like ChatGPT, Claude, and Gemini. Understanding PPO is prerequisite knowledge for anyone working in AI alignment, LLM training, or modern RL systems.


1. Prerequisites: What You Need to Know Before Reading This Paper

1.1 Reinforcement Learning Basics: Agents, States, Actions, and Rewards

Before we dive into PPO, let's establish the fundamental framework of reinforcement learning (RL). If you've never encountered RL before, think of it as teaching a computer to make decisions by trial and error, much like training a dog — good behaviors get rewarded, bad behaviors get penalized, and over time the learner figures out what to do.

The RL Setup. An RL problem consists of an agent (the learner/decision-maker) interacting with an environment (the world). At each time step tt:

  1. The agent observes the current state sts_t (e.g., the positions of joints in a robot, or pixels on a screen)
  2. The agent selects an action ata_t based on its policy π\pi (the decision-making strategy)
  3. The environment transitions to a new state st+1s_{t+1} and provides a scalar reward rtr_t
  4. The goal is to maximize the cumulative (discounted) reward: R=t=0γtrtR = \sum_{t=0}^{\infty} \gamma^t r_t

Here, γ[0,1)\gamma \in [0, 1) is the discount factor — it determines how much the agent cares about future rewards versus immediate ones. A γ\gamma close to 1 means the agent is farsighted; close to 0 means it's myopic.

Policy π\pi. The policy is the agent's strategy, mapping states to actions. In deep RL, the policy is typically a neural network parameterized by θ\theta:

  • For discrete action spaces (e.g., Atari games with joystick directions): πθ(as)\pi_\theta(a|s) outputs a probability distribution over actions
  • For continuous action spaces (e.g., robot joint torques): πθ\pi_\theta outputs the mean (and sometimes variance) of a Gaussian distribution from which actions are sampled

Value Function V(s)V(s). The value function estimates how good it is to be in a particular state — specifically, the expected cumulative reward from state ss onward if we follow policy π\pi:

Vπ(s)=Eπ[t=0γtrts0=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]

Action-Value Function Q(s,a)Q(s, a). Similarly, the Q-function estimates the expected return from taking action aa in state ss and then following π\pi:

Qπ(s,a)=Eπ[t=0γtrts0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]

Advantage Function A(s,a)A(s, a). The advantage function measures how much better action aa is compared to the average action in state ss:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

A positive advantage means the action is better than average; negative means worse. The advantage function is central to PPO because it tells us which actions to encourage and which to discourage.

1.2 Policy Gradient Methods: Learning by Gradient Ascent

Policy gradient methods are a family of RL algorithms that directly optimize the policy parameters θ\theta by computing the gradient of the expected reward with respect to θ\theta and performing gradient ascent.

The Policy Gradient Theorem. The fundamental result (due to Sutton et al., 1999) says that the gradient of the expected return J(θ)J(\theta) with respect to the policy parameters can be written as:

θJ(θ)=Eπθ[θlogπθ(atst)Aπθ(st,at)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^{\pi_\theta}(s_t, a_t)\right]

Intuition: This gradient says "increase the probability of actions that have positive advantage (better than average), and decrease the probability of actions with negative advantage." The magnitude of the update is proportional to how much better or worse the action was.

The Vanilla Policy Gradient (REINFORCE). In practice, we estimate this gradient from sampled trajectories:

g^=E^t[θlogπθ(atst)A^t]\hat{g} = \hat{\mathbb{E}}_t\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t\right]

where A^t\hat{A}_t is an estimator of the advantage function (e.g., using rewards-to-go minus a baseline). This estimator is unbiased but has very high variance, leading to noisy, unstable training. The corresponding loss function whose gradient gives us g^\hat{g} is:

LPG(θ)=E^t[logπθ(atst)A^t]L^{PG}(\theta) = \hat{\mathbb{E}}_t\left[\log \pi_\theta(a_t|s_t) \hat{A}_t\right]

The Problem with Vanilla Policy Gradients. While mathematically elegant, vanilla policy gradients have a critical practical limitation: you can only use each batch of sampled data for one gradient step. If you try to take multiple gradient steps on the same batch, the updates compound and the policy can change so drastically that performance collapses. This is wastly data-inefficient — you collect an entire trajectory, use it once, then throw it away.

1.3 Trust Region Policy Optimization (TRPO): The Predecessor

TRPO (Schulman et al., 2015) addresses the instability problem by introducing a constraint on how much the policy can change in each update. The idea is beautifully intuitive: if we only allow small policy changes, we can safely reuse data for multiple gradient steps without worrying about catastrophic performance drops.

The TRPO Objective. TRPO maximizes a "surrogate" objective while constraining the KL divergence between the old and new policies:

maxθE^t[πθ(atst)πθold(atst)A^t]\max_\theta \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} \hat{A}_t\right]

subject to E^t[KL[πθold(st),πθ(st)]]δ\text{subject to } \hat{\mathbb{E}}_t\left[\text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]\right] \leq \delta

The Probability Ratio. The ratio rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} is called the importance sampling ratio. When θ=θold\theta = \theta_\text{old}, this ratio equals 1. As the policy changes, the ratio moves away from 1. The ratio allows us to evaluate how the new policy performs using data collected by the old policy — this is what makes data reuse possible.

Why TRPO is Problematic. While TRPO works well, it has significant practical drawbacks:

  1. Implementation complexity: TRPO requires computing second-order derivatives (the Fisher information matrix) and solving a constrained optimization problem using the conjugate gradient algorithm. This is non-trivial to implement correctly.
  2. Incompatible with shared architectures: TRPO doesn't easily work with architectures that share parameters between the policy and value function networks, or with auxiliary tasks — which are common in practice.
  3. Incompatible with noise: It doesn't work well with dropout or other forms of stochasticity in the network.

PPO was designed to get TRPO's benefits without these drawbacks.

1.4 KL Divergence: Measuring Policy Distance

KL divergence (Kullback-Leibler divergence) measures how different two probability distributions are. For policies πold\pi_\text{old} and πnew\pi_\text{new}:

KL[πoldπnew]=aπold(as)logπold(as)πnew(as)\text{KL}[\pi_\text{old} \| \pi_\text{new}] = \sum_a \pi_\text{old}(a|s) \log \frac{\pi_\text{old}(a|s)}{\pi_\text{new}(a|s)}

Key properties:

  • KL0\text{KL} \geq 0 always, and KL=0\text{KL} = 0 if and only if the distributions are identical
  • It is not symmetric: KL[πoldπnew]KL[πnewπold]\text{KL}[\pi_\text{old} \| \pi_\text{new}] \neq \text{KL}[\pi_\text{new} \| \pi_\text{old}]
  • Small KL divergence means the policies make similar decisions

In TRPO, the KL constraint ensures the policy doesn't change too much in any single update, preventing catastrophic performance drops.

1.5 Importance Sampling: Reusing Data from Old Policies

Importance sampling is a technique for estimating expectations under one distribution using samples from a different distribution. If we have samples from πold\pi_\text{old} but want to evaluate the expected reward under πθ\pi_\theta, we use:

Eπθ[f(x)]=Eπold[πθ(x)πold(x)f(x)]\mathbb{E}_{\pi_\theta}[f(x)] = \mathbb{E}_{\pi_\text{old}}\left[\frac{\pi_\theta(x)}{\pi_\text{old}(x)} f(x)\right]

This is what allows PPO to reuse trajectories collected by the old policy to update the new policy — the probability ratio rt(θ)r_t(\theta) is precisely this importance sampling correction. However, importance sampling becomes unreliable when the distributions are very different (high variance), which is why PPO needs mechanisms to keep the policy close to the old one.

1.6 Generalized Advantage Estimation (GAE)

GAE (Schulman et al., 2015b) is a technique for computing advantage estimates that provides a smooth tradeoff between bias and variance. It defines:

A^t=l=0(γλ)lδt+l\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}

where δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the temporal difference (TD) error, γ\gamma is the discount factor, and λ[0,1]\lambda \in [0, 1] is a new hyperparameter controlling the bias-variance tradeoff:

  • λ=0\lambda = 0: Uses only the one-step TD error δt\delta_t — low variance but high bias
  • λ=1\lambda = 1: Reduces to the full Monte Carlo advantage estimate — unbiased but high variance
  • Intermediate λ\lambda (typically 0.95): Balances both

PPO uses GAE with λ=0.95\lambda = 0.95 in all experiments, which has become the standard practice.


2. The PPO Algorithm: Core Contributions

2.1 Motivation: First-Order TRPO

The central question driving PPO is: Can we get TRPO's stability benefits using only first-order optimization (regular gradient descent), without the complexity of constrained optimization?

The authors explore two approaches:

  1. Clipped surrogate objective (PPO-Clip): Modify the objective function to naturally penalize large policy changes
  2. Adaptive KL penalty (PPO-Penalty): Use a penalty term instead of a hard constraint, with an adaptive coefficient

The clipped version turns out to work best and has become the standard "PPO" algorithm.

2.2 The Clipped Surrogate Objective (PPO-Clip)

This is the key innovation of the paper. Recall the CPI surrogate objective from TRPO:

LCPI(θ)=E^t[rt(θ)A^t]L^{CPI}(\theta) = \hat{\mathbb{E}}_t\left[r_t(\theta) \hat{A}_t\right]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} is the probability ratio.

Without any constraint, maximizing LCPIL^{CPI} can lead to destructively large policy updates — the ratio rt(θ)r_t(\theta) can become very large, causing the policy to jump to a completely different behavior.

The PPO Solution: Clip the ratio. PPO proposes:

LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t\right)\right]

where ϵ\epsilon is a hyperparameter (typically ϵ=0.2\epsilon = 0.2).

Breaking this down piece by piece:

  1. rt(θ)A^tr_t(\theta) \hat{A}_t: This is the standard surrogate objective — multiply the probability ratio by the advantage.

  2. clip(rt(θ),1ϵ,1+ϵ)\text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon): This clips the probability ratio to the interval [1ϵ,1+ϵ]=[0.8,1.2][1 - \epsilon, 1 + \epsilon] = [0.8, 1.2] with ϵ=0.2\epsilon = 0.2. This means the clipped version ignores any policy change that moves the ratio outside this interval.

  3. min(,)\min(\cdot, \cdot): Take the minimum of the clipped and unclipped objectives. This creates a pessimistic (lower-bound) estimate of the true objective.

Why does this work? Two cases to consider:

Case 1: Positive advantage (A^t>0\hat{A}_t > 0). The action was better than average. We want to increase its probability (increase rtr_t). The unclipped objective rtA^tr_t \hat{A}_t increases as rtr_t increases. But the clipped version caps the benefit at rt=1+ϵr_t = 1 + \epsilon. The min\min operation means: "You can increase the probability of this good action, but only up to 1+ϵ1 + \epsilon times its old probability — beyond that, you get no additional objective value."

Case 2: Negative advantage (A^t<0\hat{A}_t < 0). The action was worse than average. We want to decrease its probability (decrease rtr_t). The unclipped objective rtA^tr_t \hat{A}_t gets "better" (less negative) as rtr_t decreases. But the clipped version caps the benefit at rt=1ϵr_t = 1 - \epsilon. The min\min operation means: "You can decrease the probability of this bad action, but only down to 1ϵ1 - \epsilon times its old probability."

The key insight: The clipping acts as a one-sided constraint. We only clip changes that would improve the objective — we never clip changes that make the objective worse. This means the clipped objective is always a lower bound on the unclipped objective, creating a "pessimistic" estimate that discourages overly aggressive policy updates. It is precisely this asymmetry that gives PPO its stability.

Near the starting point: At θ=θold\theta = \theta_\text{old} (where rt=1r_t = 1), the clipped and unclipped objectives are identical. They only diverge as θ\theta moves away from θold\theta_\text{old}. To first order, LCLIP(θ)=LCPI(θ)L^{CLIP}(\theta) = L^{CPI}(\theta) — the clipping has zero effect on small updates.

2.3 Adaptive KL Penalty (PPO-Penalty)

The paper also proposes an alternative: use a KL penalty instead of a hard constraint (as in TRPO), but adapt the penalty coefficient automatically. In each policy update:

  1. Optimize the KL-penalized objective for several epochs:

LKLPEN(θ)=E^t[πθ(atst)πθold(atst)A^tβKL[πθold(st),πθ(st)]]L^{KLPEN}(\theta) = \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} \hat{A}_t - \beta \text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]\right]

  1. Compute the actual KL divergence: d=E^t[KL[πθold(st),πθ(st)]]d = \hat{\mathbb{E}}_t[\text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]]

  2. Adapt β\beta:

    • If d<dtarg/1.5d < d_\text{targ} / 1.5: the update was too conservative, so halve β\betaββ/2\beta \leftarrow \beta / 2
    • If d>dtarg×1.5d > d_\text{targ} \times 1.5: the update was too aggressive, so double β\betaββ×2\beta \leftarrow \beta \times 2

This adaptive scheme automatically adjusts the penalty strength based on whether the actual policy change was too large or too small relative to the target. The heuristic multipliers (1.5 and 2) are not carefully tuned — the algorithm is robust to their exact values.

Experimental verdict: The KL penalty works reasonably well but consistently underperforms the clipped objective across benchmarks. The clipped version is simpler and more effective, which is why "PPO" in practice almost always refers to PPO-Clip.

2.4 The Complete PPO Objective

In practice, PPO uses a neural network architecture that shares parameters between the policy πθ\pi_\theta and the value function VθV_\theta. The combined objective function is:

LtCLIP+VF+S(θ)=E^t[LtCLIP(θ)c1LtVF(θ)+c2S[πθ](st)]L^{CLIP+VF+S}_t(\theta) = \hat{\mathbb{E}}_t\left[L^{CLIP}_t(\theta) - c_1 L^{VF}_t(\theta) + c_2 S[\pi_\theta](s_t)\right]

where:

  • LtCLIP(θ)L^{CLIP}_t(\theta) is the clipped surrogate policy loss
  • LtVF(θ)=(Vθ(st)Vttarg)2L^{VF}_t(\theta) = (V_\theta(s_t) - V_t^\text{targ})^2 is the value function loss (mean squared error between predicted and target values)
  • S[πθ](st)S[\pi_\theta](s_t) is an entropy bonus that encourages exploration by penalizing overly deterministic policies
  • c1c_1 is the value function coefficient (typically 1.0)
  • c2c_2 is the entropy coefficient (typically 0.01 for Atari)

Why an entropy bonus? Without it, the policy can collapse to always choosing the same action, losing the ability to explore and potentially missing better strategies. The entropy term keeps the action distribution "spread out," ensuring the agent continues to try different things.

2.5 The Full Algorithm: Actor-Critic with Fixed-Length Trajectory Segments

The practical PPO algorithm combines all the above components:

Algorithm: PPO, Actor-Critic Style

1
2
3
4
5
6
7
8
9
for iteration = 1, 2, ... do
for actor = 1, 2, ..., N do // N parallel actors
Run policy π_θ_old in environment for T timesteps // Collect data
Compute advantage estimates Â_1, ..., Â_T // Using GAE
end for
Optimize surrogate L w.r.t. θ, with K epochs // Multiple passes
and minibatch size M ≤ NT // over the data
θ_old ← θ // Update old policy
end for

Key design decisions:

  1. Parallel actors: NN actors collect data simultaneously from NN copies of the environment. This provides N×TN \times T transitions per iteration, enabling efficient use of parallel hardware.

  2. Fixed-length segments: Each actor collects exactly TT timesteps, regardless of episode boundaries. This simplifies the implementation and ensures consistent batch sizes.

  3. Multiple epochs (KK): The same batch of data is used for KK optimization epochs. This is the key efficiency gain over vanilla policy gradients (which use data once). The clipping mechanism prevents the multiple updates from causing harmful policy drift.

  4. Minibatch SGD: Within each epoch, the N×TN \times T transitions are shuffled and divided into minibatches of size MM for stochastic gradient updates.

  5. GAE for advantage estimation: Using generalized advantage estimation with the truncated form:

A^t=δt+(γλ)δt+1++(γλ)Tt+1δT1\hat{A}_t = \delta_t + (\gamma\lambda)\delta_{t+1} + \cdots + (\gamma\lambda)^{T-t+1}\delta_{T-1}

where δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the TD error.


3. Detailed Method Analysis

3.1 Why Clipping Works: A Geometric Perspective

To truly understand why the clipped objective is effective, it helps to think about the optimization landscape geometrically.

Consider the objective as a function of the policy parameters θ\theta. Starting from θold\theta_\text{old}:

  • The unclipped objective LCPI(θ)=E^t[rt(θ)A^t]L^{CPI}(\theta) = \hat{\mathbb{E}}_t[r_t(\theta)\hat{A}_t] is a smooth function that can increase without bound as θ\theta moves away from θold\theta_\text{old}. This is dangerous because the surrogate objective becomes a poor approximation of the true objective for large policy changes.

  • The clipped objective LCLIP(θ)L^{CLIP}(\theta) is identical to LCPIL^{CPI} near θold\theta_\text{old} but "flattens out" once the probability ratio moves outside [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon]. This creates a "trust region" effect without explicitly computing KL divergence.

The paper's Figure 2 beautifully illustrates this: as we interpolate along the update direction, LCLIPL^{CLIP} is a lower bound on LCPIL^{CPI} and peaks at a KL divergence of about 0.02, then decreases. This self-regulating behavior — the objective naturally discouraging further optimization beyond a certain point — is what makes PPO stable without explicit constraints.

3.2 Comparison of Surrogate Objectives

The paper systematically compares several variants of the surrogate objective:

Variant Description Avg. Normalized Score
No clipping or penalty Lt(θ)=rt(θ)A^tL_t(\theta) = r_t(\theta)\hat{A}_t -0.39
Clipping, ϵ=0.1\epsilon = 0.1 Too conservative 0.76
Clipping, ϵ=0.2\epsilon = 0.2 Sweet spot 0.82
Clipping, ϵ=0.3\epsilon = 0.3 Slightly too aggressive 0.70
Adaptive KL, dtarg=0.003d_\text{targ} = 0.003 Conservative 0.68
Adaptive KL, dtarg=0.01d_\text{targ} = 0.01 Moderate 0.74
Adaptive KL, dtarg=0.03d_\text{targ} = 0.03 Aggressive 0.71
Fixed KL, β=0.3\beta = 0.3 Weak penalty 0.62
Fixed KL, β=1.0\beta = 1.0 Moderate penalty 0.71
Fixed KL, β=3.0\beta = 3.0 Strong penalty 0.72
Fixed KL, β=10.0\beta = 10.0 Very strong penalty 0.69

Key observations:

  1. No constraint at all is catastrophic (negative score due to training collapse on HalfCheetah)
  2. Clipping with ϵ=0.2\epsilon = 0.2 is best — it outperforms all KL penalty variants
  3. The KL penalty approaches (both fixed and adaptive) are competitive but less robust across hyperparameter choices
  4. Clipping has only one hyperparameter (ϵ\epsilon) while KL penalty has two (β\beta or dtargd_\text{targ}, plus initialization)

3.3 Architecture and Implementation Details

Network Architecture for Continuous Control (MuJoCo):

  • Fully-connected MLP with two hidden layers of 64 units each
  • Tanh nonlinearities
  • Output: mean of a Gaussian distribution with learned (variable) standard deviation
  • Policy and value function do NOT share parameters (separate networks)
  • No entropy bonus used

Hyperparameters for Continuous Control:

Hyperparameter Value
Horizon TT 2048
Adam stepsize 3×1043 \times 10^{-4}
Num. epochs KK 10
Minibatch size MM 64
Discount γ\gamma 0.99
GAE parameter λ\lambda 0.95
Clip parameter ϵ\epsilon 0.2

Network Architecture for Atari:

  • Convolutional neural network (same as Mnih et al., 2016)
  • Policy and value function share the convolutional backbone
  • Linear annealing of learning rate and clip parameter

Hyperparameters for Atari:

Hyperparameter Value
Horizon TT 128
Adam stepsize 2.5×104×α2.5 \times 10^{-4} \times \alpha
Num. epochs KK 3
Minibatch size MM 32×8=25632 \times 8 = 256
Discount γ\gamma 0.99
GAE parameter λ\lambda 0.95
Number of actors NN 8
Clip parameter ϵ\epsilon 0.1×α0.1 \times \alpha
VF coefficient c1c_1 1
Entropy coefficient c2c_2 0.01

where α\alpha is linearly annealed from 1 to 0 over the course of training. This annealing schedule means both the learning rate and the clipping range shrink as training progresses, allowing finer adjustments in later stages.


4. Experimental Results and Analysis

4.1 Continuous Control: MuJoCo Benchmarks

PPO was tested on 7 simulated robotics tasks implemented in OpenAI Gym using the MuJoCo physics engine: HalfCheetah, Hopper, InvertedDoublePendulum, InvertedPendulum, Reacher, Swimmer, and Walker2d.

PPO vs. Other Algorithms:

PPO (with ϵ=0.2\epsilon = 0.2) was compared against:

  • TRPO (Trust Region Policy Optimization)
  • CEM (Cross-Entropy Method)
  • Vanilla PG with adaptive stepsize
  • A2C (Advantage Actor-Critic, synchronous A3C)
  • A2C + Trust Region

Results: PPO outperforms all baseline algorithms on almost all environments. Specifically:

  • On HalfCheetah, PPO achieves ~2000 reward vs. ~1500 for TRPO and ~500 for A2C
  • On Hopper, PPO reaches ~2500 while most baselines plateau around ~1500-2000
  • On Walker2d, PPO achieves ~3000 vs. ~1000-2000 for baselines
  • Only on Swimmer does PPO show comparable (not superior) performance to some baselines

The significance here is that PPO achieves these results with much simpler code than TRPO — no conjugate gradient solver, no line search, no Fisher information matrix computation.

4.2 3D Humanoid Control: Roboschool

To demonstrate PPO's capability on high-dimensional continuous control, the authors test on three 3D humanoid tasks using Roboschool:

  1. RoboschoolHumanoid: Forward locomotion only — the robot must learn to walk
  2. RoboschoolHumanoidFlagrun: The target position changes randomly every 200 timesteps or when reached — the robot must navigate to changing goals
  3. RoboschoolHumanoidFlagrunHarder: Same as above, but the robot is pelted by cubes and must recover from being knocked down

These tasks have observation spaces of ~44 dimensions and action spaces of ~17 dimensions (controlling individual joints of a humanoid body). PPO successfully learns all three tasks, demonstrating natural-looking locomotion, turning behavior, and recovery from being knocked over. The learning curves show steady improvement over ~100M timesteps, with the Flagrun tasks reaching rewards of ~2500 and ~3000 respectively.

This was particularly impressive at the time of publication because humanoid locomotion was considered a very challenging benchmark. The robot learns to coordinate many degrees of freedom to produce stable, adaptive walking behavior.

4.3 Atari Game Playing

PPO was tested on all 49 Atari games in OpenAI Gym, compared against A2C and ACER (Actor-Critic with Experience Replay).

Scoring metrics:

  1. Average reward per episode over all of training (favors fast learning)
  2. Average reward per episode over last 100 episodes (favors final performance)

Results (games "won" by each algorithm):

Metric A2C ACER PPO Tie
Avg over all training 1 18 30 0
Avg over last 100 1 28 19 1

Analysis:

  • PPO dominates in fast learning (winning 30/49 games) — it reaches good performance quickly
  • ACER has an edge in final performance (28/49) — with experience replay, it eventually converges to better policies on some games
  • A2C is almost always the worst performer
  • PPO achieves these results while being significantly simpler than ACER (no experience replay buffer, no bias correction, no trust region computation)

Notable game-by-game results from Table 6:

  • Atlantis: PPO achieves 2,311,815 vs. A2C's 729,265 and ACER's 1,841,376
  • BattleZone: PPO scores 17,367 vs. ACER's 8,983 and A2C's 3,080
  • Kangaroo: PPO scores 9,929 vs. A2C's 45 and ACER's 50 — a 200x improvement
  • Zaxxon: PPO scores 5,009 vs. A2C's 16 and ACER's 29 — dramatic improvement

5. PPO's Role in RLHF and LLM Alignment

While the original PPO paper focused on robotics and game environments, PPO's most consequential application has been in Reinforcement Learning from Human Feedback (RLHF) — the technique used to align large language models with human preferences.

5.1 The RLHF Pipeline

The standard RLHF pipeline (as used in InstructGPT, ChatGPT, Claude, etc.) consists of three stages:

  1. Supervised Fine-Tuning (SFT): Fine-tune a pretrained LLM on high-quality demonstration data
  2. Reward Model Training: Train a reward model on human preference comparisons (which of two responses is better?)
  3. PPO Fine-Tuning: Use PPO to optimize the SFT model to maximize the reward model's scores

In Stage 3, the "environment" is the text generation process:

  • State sts_t: The prompt + tokens generated so far
  • Action ata_t: The next token to generate
  • Reward rtr_t: Given by the reward model (typically only at the end of generation)
  • Policy πθ\pi_\theta: The language model being fine-tuned

PPO's clipped objective is critical here because uncontrolled optimization against the reward model can lead to reward hacking — the model finds degenerate outputs that score highly according to the reward model but are actually low quality (e.g., repetitive text, exploitation of reward model biases).

5.2 Why PPO for RLHF?

Several properties of PPO make it particularly well-suited for LLM alignment:

  1. Stability: The clipped objective prevents the model from changing too drastically in any single update, which is crucial when fine-tuning a model that already has valuable capabilities
  2. Simplicity: RLHF implementations are already complex; PPO's simplicity compared to alternatives like TRPO reduces the implementation burden
  3. Sample efficiency: Multiple epochs of minibatch updates allow better use of expensive-to-generate trajectories
  4. KL regularization compatibility: In RLHF, an additional KL penalty against the SFT model is typically added to prevent the model from straying too far from its initial behavior:

reward(x,y)=rϕ(x,y)βKL[πθπSFT]\text{reward}(x, y) = r_\phi(x, y) - \beta \text{KL}[\pi_\theta \| \pi_\text{SFT}]

PPO's clipping combined with this KL penalty provides a double safeguard against reward hacking.

5.3 Alternatives and Successors

PPO's dominance in RLHF has spurred research into alternatives:

  • DPO (Direct Preference Optimization): Eliminates the reward model entirely by reframing the problem as supervised learning on preferences
  • GRPO (Group Relative Policy Optimization): Used by DeepSeek; replaces the critic model with group-relative advantages
  • REINFORCE variants: Some recent work (e.g., RLOO) shows that simpler policy gradient methods with proper baselines can match PPO in RLHF settings
  • KTO (Kahneman-Tversky Optimization): Uses prospect theory-inspired objectives instead of paired comparisons

Despite these alternatives, PPO remains the most widely validated approach for RLHF and is used in production by multiple major AI labs.


6. Limitations and Boundary Conditions

6.1 Sensitivity to Hyperparameters

While PPO is more robust than TRPO, it is not hyperparameter-free:

  • The clip parameter ϵ\epsilon has a meaningful effect (0.1 vs. 0.2 vs. 0.3 give noticeably different performance)
  • Learning rate, number of epochs, and batch size all matter
  • The GAE parameters (γ\gamma and λ\lambda) require tuning for new domains

In the RLHF setting specifically, hyperparameter sensitivity is amplified because the reward signal is noisy (coming from an imperfect reward model).

6.2 Sample Efficiency Limitations

PPO is an on-policy algorithm — it only uses data collected by the current policy. While the clipping mechanism allows multiple gradient steps per batch, PPO is still significantly less sample-efficient than off-policy methods like SAC (Soft Actor-Critic) that maintain experience replay buffers. This means:

  • More environment interactions are needed compared to off-policy methods
  • In settings where data collection is expensive (e.g., real robots, RLHF with human feedback), this is a significant cost

6.3 Credit Assignment Over Long Horizons

PPO uses GAE for advantage estimation, which relies on the value function to propagate credit across time. When episodes are very long or rewards are extremely sparse, the value function may fail to provide useful gradients, leading to slow or failed learning. This is particularly relevant in:

  • Long-horizon planning tasks
  • Tasks where the reward is only given at the very end (e.g., game won/lost)
  • LLM alignment where the reward is given after generating the entire response

6.4 Reward Hacking in RLHF

In the RLHF application specifically, PPO can still exploit imperfections in the reward model:

  • The model may learn to generate outputs that game the reward model without actually being helpful
  • Additional techniques like KL penalties, reward model ensembles, and process supervision are needed to mitigate this
  • The clip parameter provides some protection but is not a complete solution

6.5 Scalability Concerns

The original PPO paper uses relatively small networks (2-layer MLPs with 64 units). When scaling to very large models (e.g., 175B parameter LLMs), new challenges arise:

  • Memory for storing both old and new policy parameters
  • Computational cost of computing probability ratios for all tokens
  • Distributed training coordination across many GPUs
  • Gradient computation through the entire model for each PPO step

7. Reproducibility and Practical Considerations

7.1 Implementation Availability

PPO is implemented in numerous open-source libraries:

  • Stable Baselines3 (Python): The most popular RL library, with well-tested PPO implementation
  • RLlib (Ray): Scalable distributed RL framework
  • CleanRL: Single-file implementations for education
  • TRL (Hugging Face): PPO implementation specifically for LLM fine-tuning via RLHF
  • DeepSpeed-Chat: Microsoft's scalable RLHF training framework using PPO

7.2 Common Implementation Pitfalls

Based on extensive community experience, common mistakes when implementing PPO include:

  1. Incorrect advantage normalization: Per-minibatch normalization (subtracting mean, dividing by std) is critical for stable training but not mentioned in the paper
  2. Incorrect value target computation: The GAE computation must be done correctly at the trajectory level
  3. Missing observation normalization: Running mean/std normalization of observations significantly improves performance
  4. Incorrect gradient accumulation: When using multiple epochs, the old log-probabilities must be detached/stopped from the computation graph
  5. Reward scaling: Normalizing rewards to have unit variance can significantly improve stability

The ICLR 2022 paper "The 37 Implementation Details of Proximal Policy Optimization" by Huang et al. catalogues these details exhaustively.

7.3 Hyperparameter Recommendations

Based on the paper and community experience:

Setting ϵ\epsilon Epochs KK Horizon TT LR γ\gamma λ\lambda
MuJoCo 0.2 10 2048 3e-4 0.99 0.95
Atari 0.1→0 3 128 2.5e-4→0 0.99 0.95
RLHF (LLM) 0.2 1-4 full response 1e-5 to 5e-6 1.0 0.95

8. Historical Context and Impact

8.1 Timeline

  • 2015: TRPO published — first stable policy gradient method via trust regions
  • 2015: GAE published — variance reduction for advantage estimation
  • 2016: A3C/A2C published — parallel data collection for policy gradients
  • 2017: PPO published — simplified TRPO with clipped objective
  • 2018: PPO used to train OpenAI Five (Dota 2 bot that defeated world champions)
  • 2022: InstructGPT uses PPO for RLHF — aligning GPT-3 with human preferences
  • 2022-present: PPO becomes the default RLHF algorithm for ChatGPT, Claude, Gemini, Llama, etc.

8.2 Citation Impact

PPO is one of the most cited papers in deep reinforcement learning, with thousands of citations. More importantly, it has become a practical standard — when people say "we trained with RL," they usually mean PPO.

8.3 The Simplicity Principle

PPO embodies an important engineering principle: the best algorithm is often not the most theoretically elegant one, but the one that's simplest to implement, debug, and scale. TRPO has stronger theoretical guarantees (monotonic improvement under certain conditions), but PPO's practical advantages — first-order optimization, compatibility with shared architectures, ease of distributed implementation — made it the winner in practice.


9. Summary

PPO represents a masterclass in practical algorithm design. By replacing TRPO's constrained optimization with a simple clipped surrogate objective, Schulman et al. created an algorithm that is:

  1. Simple: Requires only a few lines of code change to vanilla policy gradients
  2. Stable: The clipping mechanism provides TRPO-like stability guarantees without second-order optimization
  3. General: Works with shared policy-value architectures, with dropout, with auxiliary tasks
  4. Scalable: First-order optimization scales naturally to large models and parallel implementations
  5. Effective: Matches or exceeds TRPO performance across diverse benchmarks

The algorithm's most profound impact has been as the optimization engine for RLHF, enabling the alignment of large language models with human preferences. Every time you interact with ChatGPT, Claude, or Gemini, you're benefiting from a model whose behavior was shaped by PPO's clipped objective function.

For practitioners, PPO remains the first algorithm to reach for when solving an RL problem. Its combination of simplicity, robustness, and proven effectiveness across domains — from simulated robots to billion-parameter language models — makes it one of the most important algorithms of the deep learning era.


References

  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
  2. Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML.
  3. Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
  4. Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML.
  5. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS.
  6. Huang, S., et al. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track.
  7. Kakade, S. & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. ICML.
  8. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NeurIPS.