Proximal Policy Optimization Algorithms — In-Depth Technical Review

Abstract

Reinforcement learning (RL) has become a cornerstone of modern AI — from training robots to walk, to aligning large language models with human preferences via RLHF. At the heart of many of these breakthroughs lies a deceptively simple algorithm called Proximal Policy Optimization (PPO), introduced by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI in 2017.

PPO proposes a new family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO introduces a novel clipped objective function that enables multiple epochs of minibatch updates without catastrophically large policy changes. The result is an algorithm that inherits the stability of Trust Region Policy Optimization (TRPO) while being dramatically simpler to implement — requiring only a few lines of code change to a vanilla policy gradient implementation.

The paper's experiments demonstrate PPO's superiority across a wide range of benchmarks: continuous control tasks (MuJoCo), complex 3D humanoid locomotion (Roboschool), and Atari game playing. PPO outperforms other online policy gradient methods and strikes an excellent balance between sample complexity, simplicity, and wall-time performance.

Why this paper matters today: PPO is arguably the most influential RL algorithm of the deep learning era. It became the default algorithm for training policies in robotics, game AI, and — most consequentially — the RLHF pipeline that aligns large language models like ChatGPT, Claude, and Gemini. Understanding PPO is prerequisite knowledge for anyone working in AI alignment, LLM training, or modern RL systems.

1. Prerequisites: What You Need to Know Before Reading This Paper

1.1 Reinforcement Learning Basics: Agents, States, Actions, and Rewards

Before we dive into PPO, let's establish the fundamental framework of reinforcement learning (RL). If you've never encountered RL before, think of it as teaching a computer to make decisions by trial and error, much like training a dog — good behaviors get rewarded, bad behaviors get penalized, and over time the learner figures out what to do.

The RL Setup. An RL problem consists of an agent (the learner/decision-maker) interacting with an environment (the world). At each time step $t$ :

The agent observes the current state $s_t$ (e.g., the positions of joints in a robot, or pixels on a screen)
The agent selects an action $a_t$ based on its policy $\pi$ (the decision-making strategy)
The environment transitions to a new state $s_{t+1}$ and provides a scalar reward $r_t$
The goal is to maximize the cumulative (discounted) reward: $R = \sum_{t=0}^{\infty} \gamma^t r_t$

Here, $\gamma \in [0, 1)$ is the discount factor — it determines how much the agent cares about future rewards versus immediate ones. A $\gamma$ close to 1 means the agent is farsighted; close to 0 means it's myopic.

Policy $\pi$ . The policy is the agent's strategy, mapping states to actions. In deep RL, the policy is typically a neural network parameterized by $\theta$ :

For discrete action spaces (e.g., Atari games with joystick directions): $\pi_\theta(a|s)$ outputs a probability distribution over actions
For continuous action spaces (e.g., robot joint torques): $\pi_\theta$ outputs the mean (and sometimes variance) of a Gaussian distribution from which actions are sampled

Value Function $V(s)$ . The value function estimates how good it is to be in a particular state — specifically, the expected cumulative reward from state $s$ onward if we follow policy $\pi$ :

$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]$

Action-Value Function $Q(s, a)$ . Similarly, the Q-function estimates the expected return from taking action $a$ in state $s$ and then following $\pi$ :

$Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$

Advantage Function $A(s, a)$ . The advantage function measures how much better action $a$ is compared to the average action in state $s$ :

$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$

A positive advantage means the action is better than average; negative means worse. The advantage function is central to PPO because it tells us which actions to encourage and which to discourage.

1.2 Policy Gradient Methods: Learning by Gradient Ascent

Policy gradient methods are a family of RL algorithms that directly optimize the policy parameters $\theta$ by computing the gradient of the expected reward with respect to $\theta$ and performing gradient ascent.

The Policy Gradient Theorem. The fundamental result (due to Sutton et al., 1999) says that the gradient of the expected return $J(\theta)$ with respect to the policy parameters can be written as:

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^{\pi_\theta}(s_t, a_t)\right]$

Intuition: This gradient says "increase the probability of actions that have positive advantage (better than average), and decrease the probability of actions with negative advantage." The magnitude of the update is proportional to how much better or worse the action was.

The Vanilla Policy Gradient (REINFORCE). In practice, we estimate this gradient from sampled trajectories:

$\hat{g} = \hat{\mathbb{E}}_t\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t\right]$

where $\hat{A}_t$ is an estimator of the advantage function (e.g., using rewards-to-go minus a baseline). This estimator is unbiased but has very high variance, leading to noisy, unstable training. The corresponding loss function whose gradient gives us $\hat{g}$ is:

$L^{PG}(\theta) = \hat{\mathbb{E}}_t\left[\log \pi_\theta(a_t|s_t) \hat{A}_t\right]$

The Problem with Vanilla Policy Gradients. While mathematically elegant, vanilla policy gradients have a critical practical limitation: you can only use each batch of sampled data for one gradient step. If you try to take multiple gradient steps on the same batch, the updates compound and the policy can change so drastically that performance collapses. This is wastly data-inefficient — you collect an entire trajectory, use it once, then throw it away.

1.3 Trust Region Policy Optimization (TRPO): The Predecessor

TRPO (Schulman et al., 2015) addresses the instability problem by introducing a constraint on how much the policy can change in each update. The idea is beautifully intuitive: if we only allow small policy changes, we can safely reuse data for multiple gradient steps without worrying about catastrophic performance drops.

The TRPO Objective. TRPO maximizes a "surrogate" objective while constraining the KL divergence between the old and new policies:

$\max_\theta \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} \hat{A}_t\right]$

$\text{subject to } \hat{\mathbb{E}}_t\left[\text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]\right] \leq \delta$

The Probability Ratio. The ratio $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$ is called the importance sampling ratio. When $\theta = \theta_\text{old}$ , this ratio equals 1. As the policy changes, the ratio moves away from 1. The ratio allows us to evaluate how the new policy performs using data collected by the old policy — this is what makes data reuse possible.

Why TRPO is Problematic. While TRPO works well, it has significant practical drawbacks:

Implementation complexity: TRPO requires computing second-order derivatives (the Fisher information matrix) and solving a constrained optimization problem using the conjugate gradient algorithm. This is non-trivial to implement correctly.
Incompatible with shared architectures: TRPO doesn't easily work with architectures that share parameters between the policy and value function networks, or with auxiliary tasks — which are common in practice.
Incompatible with noise: It doesn't work well with dropout or other forms of stochasticity in the network.

PPO was designed to get TRPO's benefits without these drawbacks.

1.4 KL Divergence: Measuring Policy Distance

KL divergence (Kullback-Leibler divergence) measures how different two probability distributions are. For policies $\pi_\text{old}$ and $\pi_\text{new}$ :

$\text{KL}[\pi_\text{old} \| \pi_\text{new}] = \sum_a \pi_\text{old}(a|s) \log \frac{\pi_\text{old}(a|s)}{\pi_\text{new}(a|s)}$

Key properties:

$\text{KL} \geq 0$ always, and $\text{KL} = 0$ if and only if the distributions are identical
It is not symmetric: $\text{KL}[\pi_\text{old} \| \pi_\text{new}] \neq \text{KL}[\pi_\text{new} \| \pi_\text{old}]$
Small KL divergence means the policies make similar decisions

In TRPO, the KL constraint ensures the policy doesn't change too much in any single update, preventing catastrophic performance drops.

1.5 Importance Sampling: Reusing Data from Old Policies

Importance sampling is a technique for estimating expectations under one distribution using samples from a different distribution. If we have samples from $\pi_\text{old}$ but want to evaluate the expected reward under $\pi_\theta$ , we use:

$\mathbb{E}_{\pi_\theta}[f(x)] = \mathbb{E}_{\pi_\text{old}}\left[\frac{\pi_\theta(x)}{\pi_\text{old}(x)} f(x)\right]$

This is what allows PPO to reuse trajectories collected by the old policy to update the new policy — the probability ratio $r_t(\theta)$ is precisely this importance sampling correction. However, importance sampling becomes unreliable when the distributions are very different (high variance), which is why PPO needs mechanisms to keep the policy close to the old one.

1.6 Generalized Advantage Estimation (GAE)

GAE (Schulman et al., 2015b) is a technique for computing advantage estimates that provides a smooth tradeoff between bias and variance. It defines:

$\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the temporal difference (TD) error, $\gamma$ is the discount factor, and $\lambda \in [0, 1]$ is a new hyperparameter controlling the bias-variance tradeoff:

$\lambda = 0$ : Uses only the one-step TD error $\delta_t$ — low variance but high bias
$\lambda = 1$ : Reduces to the full Monte Carlo advantage estimate — unbiased but high variance
Intermediate $\lambda$ (typically 0.95): Balances both

PPO uses GAE with $\lambda = 0.95$ in all experiments, which has become the standard practice.

2. The PPO Algorithm: Core Contributions

2.1 Motivation: First-Order TRPO

The central question driving PPO is: Can we get TRPO's stability benefits using only first-order optimization (regular gradient descent), without the complexity of constrained optimization?

The authors explore two approaches:

Clipped surrogate objective (PPO-Clip): Modify the objective function to naturally penalize large policy changes
Adaptive KL penalty (PPO-Penalty): Use a penalty term instead of a hard constraint, with an adaptive coefficient

The clipped version turns out to work best and has become the standard "PPO" algorithm.

2.2 The Clipped Surrogate Objective (PPO-Clip)

This is the key innovation of the paper. Recall the CPI surrogate objective from TRPO:

$L^{CPI}(\theta) = \hat{\mathbb{E}}_t\left[r_t(\theta) \hat{A}_t\right]$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$ is the probability ratio.

Without any constraint, maximizing $L^{CPI}$ can lead to destructively large policy updates — the ratio $r_t(\theta)$ can become very large, causing the policy to jump to a completely different behavior.

The PPO Solution: Clip the ratio. PPO proposes:

$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t\right)\right]$

where $\epsilon$ is a hyperparameter (typically $\epsilon = 0.2$ ).

Breaking this down piece by piece:

$r_t(\theta) \hat{A}_t$ : This is the standard surrogate objective — multiply the probability ratio by the advantage.
$\text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)$ : This clips the probability ratio to the interval $[1 - \epsilon, 1 + \epsilon] = [0.8, 1.2]$ with $\epsilon = 0.2$ . This means the clipped version ignores any policy change that moves the ratio outside this interval.
$\min(\cdot, \cdot)$ : Take the minimum of the clipped and unclipped objectives. This creates a pessimistic (lower-bound) estimate of the true objective.

Why does this work? Two cases to consider:

Case 1: Positive advantage ( $\hat{A}_t > 0$ ). The action was better than average. We want to increase its probability (increase $r_t$ ). The unclipped objective $r_t \hat{A}_t$ increases as $r_t$ increases. But the clipped version caps the benefit at $r_t = 1 + \epsilon$ . The $\min$ operation means: "You can increase the probability of this good action, but only up to $1 + \epsilon$ times its old probability — beyond that, you get no additional objective value."

Case 2: Negative advantage ( $\hat{A}_t < 0$ ). The action was worse than average. We want to decrease its probability (decrease $r_t$ ). The unclipped objective $r_t \hat{A}_t$ gets "better" (less negative) as $r_t$ decreases. But the clipped version caps the benefit at $r_t = 1 - \epsilon$ . The $\min$ operation means: "You can decrease the probability of this bad action, but only down to $1 - \epsilon$ times its old probability."

The key insight: The clipping acts as a one-sided constraint. We only clip changes that would improve the objective — we never clip changes that make the objective worse. This means the clipped objective is always a lower bound on the unclipped objective, creating a "pessimistic" estimate that discourages overly aggressive policy updates. It is precisely this asymmetry that gives PPO its stability.

Near the starting point: At $\theta = \theta_\text{old}$ (where $r_t = 1$ ), the clipped and unclipped objectives are identical. They only diverge as $\theta$ moves away from $\theta_\text{old}$ . To first order, $L^{CLIP}(\theta) = L^{CPI}(\theta)$ — the clipping has zero effect on small updates.

2.3 Adaptive KL Penalty (PPO-Penalty)

The paper also proposes an alternative: use a KL penalty instead of a hard constraint (as in TRPO), but adapt the penalty coefficient automatically. In each policy update:

Optimize the KL-penalized objective for several epochs:

$L^{KLPEN}(\theta) = \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} \hat{A}_t - \beta \text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]\right]$

Compute the actual KL divergence: $d = \hat{\mathbb{E}}_t[\text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]]$
Adapt $\beta$ :
- If $d < d_\text{targ} / 1.5$ : the update was too conservative, so halve $\beta$ → $\beta \leftarrow \beta / 2$
- If $d > d_\text{targ} \times 1.5$ : the update was too aggressive, so double $\beta$ → $\beta \leftarrow \beta \times 2$

This adaptive scheme automatically adjusts the penalty strength based on whether the actual policy change was too large or too small relative to the target. The heuristic multipliers (1.5 and 2) are not carefully tuned — the algorithm is robust to their exact values.

Experimental verdict: The KL penalty works reasonably well but consistently underperforms the clipped objective across benchmarks. The clipped version is simpler and more effective, which is why "PPO" in practice almost always refers to PPO-Clip.

2.4 The Complete PPO Objective

In practice, PPO uses a neural network architecture that shares parameters between the policy $\pi_\theta$ and the value function $V_\theta$ . The combined objective function is:

$L^{CLIP+VF+S}_t(\theta) = \hat{\mathbb{E}}_t\left[L^{CLIP}_t(\theta) - c_1 L^{VF}_t(\theta) + c_2 S[\pi_\theta](s_t)\right]$

where:

$L^{CLIP}_t(\theta)$ is the clipped surrogate policy loss
$L^{VF}_t(\theta) = (V_\theta(s_t) - V_t^\text{targ})^2$ is the value function loss (mean squared error between predicted and target values)
$S[\pi_\theta](s_t)$ is an entropy bonus that encourages exploration by penalizing overly deterministic policies
$c_1$ is the value function coefficient (typically 1.0)
$c_2$ is the entropy coefficient (typically 0.01 for Atari)

Why an entropy bonus? Without it, the policy can collapse to always choosing the same action, losing the ability to explore and potentially missing better strategies. The entropy term keeps the action distribution "spread out," ensuring the agent continues to try different things.

2.5 The Full Algorithm: Actor-Critic with Fixed-Length Trajectory Segments

The practical PPO algorithm combines all the above components:

Algorithm: PPO, Actor-Critic Style

for iteration = 1, 2, ... do
    for actor = 1, 2, ..., N do                          // N parallel actors
        Run policy π_θ_old in environment for T timesteps // Collect data
        Compute advantage estimates Â_1, ..., Â_T         // Using GAE
    end for
    Optimize surrogate L w.r.t. θ, with K epochs          // Multiple passes
        and minibatch size M ≤ NT                          // over the data
    θ_old ← θ                                            // Update old policy
end for

Key design decisions:

Parallel actors: $N$ actors collect data simultaneously from $N$ copies of the environment. This provides $N \times T$ transitions per iteration, enabling efficient use of parallel hardware.
Fixed-length segments: Each actor collects exactly $T$ timesteps, regardless of episode boundaries. This simplifies the implementation and ensures consistent batch sizes.
Multiple epochs ( $K$ ): The same batch of data is used for $K$ optimization epochs. This is the key efficiency gain over vanilla policy gradients (which use data once). The clipping mechanism prevents the multiple updates from causing harmful policy drift.
Minibatch SGD: Within each epoch, the $N \times T$ transitions are shuffled and divided into minibatches of size $M$ for stochastic gradient updates.
GAE for advantage estimation: Using generalized advantage estimation with the truncated form:

$\hat{A}_t = \delta_t + (\gamma\lambda)\delta_{t+1} + \cdots + (\gamma\lambda)^{T-t+1}\delta_{T-1}$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

3. Detailed Method Analysis

3.1 Why Clipping Works: A Geometric Perspective

To truly understand why the clipped objective is effective, it helps to think about the optimization landscape geometrically.

Consider the objective as a function of the policy parameters $\theta$ . Starting from $\theta_\text{old}$ :

The unclipped objective $L^{CPI}(\theta) = \hat{\mathbb{E}}_t[r_t(\theta)\hat{A}_t]$ is a smooth function that can increase without bound as $\theta$ moves away from $\theta_\text{old}$ . This is dangerous because the surrogate objective becomes a poor approximation of the true objective for large policy changes.
The clipped objective $L^{CLIP}(\theta)$ is identical to $L^{CPI}$ near $\theta_\text{old}$ but "flattens out" once the probability ratio moves outside $[1-\epsilon, 1+\epsilon]$ . This creates a "trust region" effect without explicitly computing KL divergence.

The paper's Figure 2 beautifully illustrates this: as we interpolate along the update direction, $L^{CLIP}$ is a lower bound on $L^{CPI}$ and peaks at a KL divergence of about 0.02, then decreases. This self-regulating behavior — the objective naturally discouraging further optimization beyond a certain point — is what makes PPO stable without explicit constraints.

3.2 Comparison of Surrogate Objectives

The paper systematically compares several variants of the surrogate objective:

Variant	Description	Avg. Normalized Score
No clipping or penalty	$L_t(\theta) = r_t(\theta)\hat{A}_t$	-0.39
Clipping, $\epsilon = 0.1$	Too conservative	0.76
Clipping, $\epsilon = 0.2$	Sweet spot	0.82
Clipping, $\epsilon = 0.3$	Slightly too aggressive	0.70
Adaptive KL, $d_\text{targ} = 0.003$	Conservative	0.68
Adaptive KL, $d_\text{targ} = 0.01$	Moderate	0.74
Adaptive KL, $d_\text{targ} = 0.03$	Aggressive	0.71
Fixed KL, $\beta = 0.3$	Weak penalty	0.62
Fixed KL, $\beta = 1.0$	Moderate penalty	0.71
Fixed KL, $\beta = 3.0$	Strong penalty	0.72
Fixed KL, $\beta = 10.0$	Very strong penalty	0.69

Key observations:

No constraint at all is catastrophic (negative score due to training collapse on HalfCheetah)
Clipping with $\epsilon = 0.2$ is best — it outperforms all KL penalty variants
The KL penalty approaches (both fixed and adaptive) are competitive but less robust across hyperparameter choices
Clipping has only one hyperparameter ( $\epsilon$ ) while KL penalty has two ( $\beta$ or $d_\text{targ}$ , plus initialization)

3.3 Architecture and Implementation Details

Network Architecture for Continuous Control (MuJoCo):

Fully-connected MLP with two hidden layers of 64 units each
Tanh nonlinearities
Output: mean of a Gaussian distribution with learned (variable) standard deviation
Policy and value function do NOT share parameters (separate networks)
No entropy bonus used

Hyperparameters for Continuous Control:

Hyperparameter	Value
Horizon $T$	2048
Adam stepsize	$3 \times 10^{-4}$
Num. epochs $K$	10
Minibatch size $M$	64
Discount $\gamma$	0.99
GAE parameter $\lambda$	0.95
Clip parameter $\epsilon$	0.2

Network Architecture for Atari:

Convolutional neural network (same as Mnih et al., 2016)
Policy and value function share the convolutional backbone
Linear annealing of learning rate and clip parameter

Hyperparameters for Atari:

Hyperparameter	Value
Horizon $T$	128
Adam stepsize	$2.5 \times 10^{-4} \times \alpha$
Num. epochs $K$	3
Minibatch size $M$	$32 \times 8 = 256$
Discount $\gamma$	0.99
GAE parameter $\lambda$	0.95
Number of actors $N$	8
Clip parameter $\epsilon$	$0.1 \times \alpha$
VF coefficient $c_1$	1
Entropy coefficient $c_2$	0.01

where $\alpha$ is linearly annealed from 1 to 0 over the course of training. This annealing schedule means both the learning rate and the clipping range shrink as training progresses, allowing finer adjustments in later stages.

4. Experimental Results and Analysis

4.1 Continuous Control: MuJoCo Benchmarks

PPO was tested on 7 simulated robotics tasks implemented in OpenAI Gym using the MuJoCo physics engine: HalfCheetah, Hopper, InvertedDoublePendulum, InvertedPendulum, Reacher, Swimmer, and Walker2d.

PPO vs. Other Algorithms:

PPO (with $\epsilon = 0.2$ ) was compared against:

TRPO (Trust Region Policy Optimization)
CEM (Cross-Entropy Method)
Vanilla PG with adaptive stepsize
A2C (Advantage Actor-Critic, synchronous A3C)
A2C + Trust Region

Results: PPO outperforms all baseline algorithms on almost all environments. Specifically:

On HalfCheetah, PPO achieves ~2000 reward vs. ~1500 for TRPO and ~500 for A2C
On Hopper, PPO reaches ~2500 while most baselines plateau around ~1500-2000
On Walker2d, PPO achieves ~3000 vs. ~1000-2000 for baselines
Only on Swimmer does PPO show comparable (not superior) performance to some baselines

The significance here is that PPO achieves these results with much simpler code than TRPO — no conjugate gradient solver, no line search, no Fisher information matrix computation.

4.2 3D Humanoid Control: Roboschool

To demonstrate PPO's capability on high-dimensional continuous control, the authors test on three 3D humanoid tasks using Roboschool:

RoboschoolHumanoid: Forward locomotion only — the robot must learn to walk
RoboschoolHumanoidFlagrun: The target position changes randomly every 200 timesteps or when reached — the robot must navigate to changing goals
RoboschoolHumanoidFlagrunHarder: Same as above, but the robot is pelted by cubes and must recover from being knocked down

These tasks have observation spaces of ~44 dimensions and action spaces of ~17 dimensions (controlling individual joints of a humanoid body). PPO successfully learns all three tasks, demonstrating natural-looking locomotion, turning behavior, and recovery from being knocked over. The learning curves show steady improvement over ~100M timesteps, with the Flagrun tasks reaching rewards of ~2500 and ~3000 respectively.

This was particularly impressive at the time of publication because humanoid locomotion was considered a very challenging benchmark. The robot learns to coordinate many degrees of freedom to produce stable, adaptive walking behavior.

4.3 Atari Game Playing

PPO was tested on all 49 Atari games in OpenAI Gym, compared against A2C and ACER (Actor-Critic with Experience Replay).

Scoring metrics:

Average reward per episode over all of training (favors fast learning)
Average reward per episode over last 100 episodes (favors final performance)

Results (games "won" by each algorithm):

Metric	A2C	ACER	PPO	Tie
Avg over all training	1	18	30	0
Avg over last 100	1	28	19	1

Analysis:

PPO dominates in fast learning (winning 30/49 games) — it reaches good performance quickly
ACER has an edge in final performance (28/49) — with experience replay, it eventually converges to better policies on some games
A2C is almost always the worst performer
PPO achieves these results while being significantly simpler than ACER (no experience replay buffer, no bias correction, no trust region computation)

Notable game-by-game results from Table 6:

Atlantis: PPO achieves 2,311,815 vs. A2C's 729,265 and ACER's 1,841,376
BattleZone: PPO scores 17,367 vs. ACER's 8,983 and A2C's 3,080
Kangaroo: PPO scores 9,929 vs. A2C's 45 and ACER's 50 — a 200x improvement
Zaxxon: PPO scores 5,009 vs. A2C's 16 and ACER's 29 — dramatic improvement

5. PPO's Role in RLHF and LLM Alignment

While the original PPO paper focused on robotics and game environments, PPO's most consequential application has been in Reinforcement Learning from Human Feedback (RLHF) — the technique used to align large language models with human preferences.

5.1 The RLHF Pipeline

The standard RLHF pipeline (as used in InstructGPT, ChatGPT, Claude, etc.) consists of three stages:

Supervised Fine-Tuning (SFT): Fine-tune a pretrained LLM on high-quality demonstration data
Reward Model Training: Train a reward model on human preference comparisons (which of two responses is better?)
PPO Fine-Tuning: Use PPO to optimize the SFT model to maximize the reward model's scores

In Stage 3, the "environment" is the text generation process:

State $s_t$ : The prompt + tokens generated so far
Action $a_t$ : The next token to generate
Reward $r_t$ : Given by the reward model (typically only at the end of generation)
Policy $\pi_\theta$ : The language model being fine-tuned

PPO's clipped objective is critical here because uncontrolled optimization against the reward model can lead to reward hacking — the model finds degenerate outputs that score highly according to the reward model but are actually low quality (e.g., repetitive text, exploitation of reward model biases).

5.2 Why PPO for RLHF?

Several properties of PPO make it particularly well-suited for LLM alignment:

Stability: The clipped objective prevents the model from changing too drastically in any single update, which is crucial when fine-tuning a model that already has valuable capabilities
Simplicity: RLHF implementations are already complex; PPO's simplicity compared to alternatives like TRPO reduces the implementation burden
Sample efficiency: Multiple epochs of minibatch updates allow better use of expensive-to-generate trajectories
KL regularization compatibility: In RLHF, an additional KL penalty against the SFT model is typically added to prevent the model from straying too far from its initial behavior:

$\text{reward}(x, y) = r_\phi(x, y) - \beta \text{KL}[\pi_\theta \| \pi_\text{SFT}]$

PPO's clipping combined with this KL penalty provides a double safeguard against reward hacking.

5.3 Alternatives and Successors

PPO's dominance in RLHF has spurred research into alternatives:

DPO (Direct Preference Optimization): Eliminates the reward model entirely by reframing the problem as supervised learning on preferences
GRPO (Group Relative Policy Optimization): Used by DeepSeek; replaces the critic model with group-relative advantages
REINFORCE variants: Some recent work (e.g., RLOO) shows that simpler policy gradient methods with proper baselines can match PPO in RLHF settings
KTO (Kahneman-Tversky Optimization): Uses prospect theory-inspired objectives instead of paired comparisons

Despite these alternatives, PPO remains the most widely validated approach for RLHF and is used in production by multiple major AI labs.

6. Limitations and Boundary Conditions

6.1 Sensitivity to Hyperparameters

While PPO is more robust than TRPO, it is not hyperparameter-free:

The clip parameter $\epsilon$ has a meaningful effect (0.1 vs. 0.2 vs. 0.3 give noticeably different performance)
Learning rate, number of epochs, and batch size all matter
The GAE parameters ( $\gamma$ and $\lambda$ ) require tuning for new domains

In the RLHF setting specifically, hyperparameter sensitivity is amplified because the reward signal is noisy (coming from an imperfect reward model).

6.2 Sample Efficiency Limitations

PPO is an on-policy algorithm — it only uses data collected by the current policy. While the clipping mechanism allows multiple gradient steps per batch, PPO is still significantly less sample-efficient than off-policy methods like SAC (Soft Actor-Critic) that maintain experience replay buffers. This means:

More environment interactions are needed compared to off-policy methods
In settings where data collection is expensive (e.g., real robots, RLHF with human feedback), this is a significant cost

6.3 Credit Assignment Over Long Horizons

PPO uses GAE for advantage estimation, which relies on the value function to propagate credit across time. When episodes are very long or rewards are extremely sparse, the value function may fail to provide useful gradients, leading to slow or failed learning. This is particularly relevant in:

Long-horizon planning tasks
Tasks where the reward is only given at the very end (e.g., game won/lost)
LLM alignment where the reward is given after generating the entire response

6.4 Reward Hacking in RLHF

In the RLHF application specifically, PPO can still exploit imperfections in the reward model:

The model may learn to generate outputs that game the reward model without actually being helpful
Additional techniques like KL penalties, reward model ensembles, and process supervision are needed to mitigate this
The clip parameter provides some protection but is not a complete solution

6.5 Scalability Concerns

The original PPO paper uses relatively small networks (2-layer MLPs with 64 units). When scaling to very large models (e.g., 175B parameter LLMs), new challenges arise:

Memory for storing both old and new policy parameters
Computational cost of computing probability ratios for all tokens
Distributed training coordination across many GPUs
Gradient computation through the entire model for each PPO step

7. Reproducibility and Practical Considerations

7.1 Implementation Availability

PPO is implemented in numerous open-source libraries:

Stable Baselines3 (Python): The most popular RL library, with well-tested PPO implementation
RLlib (Ray): Scalable distributed RL framework
CleanRL: Single-file implementations for education
TRL (Hugging Face): PPO implementation specifically for LLM fine-tuning via RLHF
DeepSpeed-Chat: Microsoft's scalable RLHF training framework using PPO

7.2 Common Implementation Pitfalls

Based on extensive community experience, common mistakes when implementing PPO include:

Incorrect advantage normalization: Per-minibatch normalization (subtracting mean, dividing by std) is critical for stable training but not mentioned in the paper
Incorrect value target computation: The GAE computation must be done correctly at the trajectory level
Missing observation normalization: Running mean/std normalization of observations significantly improves performance
Incorrect gradient accumulation: When using multiple epochs, the old log-probabilities must be detached/stopped from the computation graph
Reward scaling: Normalizing rewards to have unit variance can significantly improve stability

The ICLR 2022 paper "The 37 Implementation Details of Proximal Policy Optimization" by Huang et al. catalogues these details exhaustively.

7.3 Hyperparameter Recommendations

Based on the paper and community experience:

Setting	$\epsilon$	Epochs $K$	Horizon $T$	LR	$\gamma$	$\lambda$
MuJoCo	0.2	10	2048	3e-4	0.99	0.95
Atari	0.1→0	3	128	2.5e-4→0	0.99	0.95
RLHF (LLM)	0.2	1-4	full response	1e-5 to 5e-6	1.0	0.95

8. Historical Context and Impact

8.1 Timeline

2015: TRPO published — first stable policy gradient method via trust regions
2015: GAE published — variance reduction for advantage estimation
2016: A3C/A2C published — parallel data collection for policy gradients
2017: PPO published — simplified TRPO with clipped objective
2018: PPO used to train OpenAI Five (Dota 2 bot that defeated world champions)
2022: InstructGPT uses PPO for RLHF — aligning GPT-3 with human preferences
2022-present: PPO becomes the default RLHF algorithm for ChatGPT, Claude, Gemini, Llama, etc.

8.2 Citation Impact

PPO is one of the most cited papers in deep reinforcement learning, with thousands of citations. More importantly, it has become a practical standard — when people say "we trained with RL," they usually mean PPO.

8.3 The Simplicity Principle

PPO embodies an important engineering principle: the best algorithm is often not the most theoretically elegant one, but the one that's simplest to implement, debug, and scale. TRPO has stronger theoretical guarantees (monotonic improvement under certain conditions), but PPO's practical advantages — first-order optimization, compatibility with shared architectures, ease of distributed implementation — made it the winner in practice.

9. Summary

PPO represents a masterclass in practical algorithm design. By replacing TRPO's constrained optimization with a simple clipped surrogate objective, Schulman et al. created an algorithm that is:

Simple: Requires only a few lines of code change to vanilla policy gradients
Stable: The clipping mechanism provides TRPO-like stability guarantees without second-order optimization
General: Works with shared policy-value architectures, with dropout, with auxiliary tasks
Scalable: First-order optimization scales naturally to large models and parallel implementations
Effective: Matches or exceeds TRPO performance across diverse benchmarks

The algorithm's most profound impact has been as the optimization engine for RLHF, enabling the alignment of large language models with human preferences. Every time you interact with ChatGPT, Claude, or Gemini, you're benefiting from a model whose behavior was shaped by PPO's clipped objective function.

For practitioners, PPO remains the first algorithm to reach for when solving an RL problem. Its combination of simplicity, robustness, and proven effectiveness across domains — from simulated robots to billion-parameter language models — makes it one of the most important algorithms of the deep learning era.

References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML.
Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS.
Huang, S., et al. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track.
Kakade, S. & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. ICML.
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NeurIPS.