Abstract
Reinforcement learning (RL) has become a cornerstone of modern AI — from training robots to walk, to aligning large language models with human preferences via RLHF. At the heart of many of these breakthroughs lies a deceptively simple algorithm called Proximal Policy Optimization (PPO), introduced by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI in 2017.
PPO proposes a new family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO introduces a novel clipped objective function that enables multiple epochs of minibatch updates without catastrophically large policy changes. The result is an algorithm that inherits the stability of Trust Region Policy Optimization (TRPO) while being dramatically simpler to implement — requiring only a few lines of code change to a vanilla policy gradient implementation.
The paper's experiments demonstrate PPO's superiority across a wide range of benchmarks: continuous control tasks (MuJoCo), complex 3D humanoid locomotion (Roboschool), and Atari game playing. PPO outperforms other online policy gradient methods and strikes an excellent balance between sample complexity, simplicity, and wall-time performance.
Why this paper matters today: PPO is arguably the most influential RL algorithm of the deep learning era. It became the default algorithm for training policies in robotics, game AI, and — most consequentially — the RLHF pipeline that aligns large language models like ChatGPT, Claude, and Gemini. Understanding PPO is prerequisite knowledge for anyone working in AI alignment, LLM training, or modern RL systems.
1. Prerequisites: What You Need to Know Before Reading This Paper
1.1 Reinforcement Learning Basics: Agents, States, Actions, and Rewards
Before we dive into PPO, let's establish the fundamental framework of reinforcement learning (RL). If you've never encountered RL before, think of it as teaching a computer to make decisions by trial and error, much like training a dog — good behaviors get rewarded, bad behaviors get penalized, and over time the learner figures out what to do.
The RL Setup. An RL problem consists of an agent (the learner/decision-maker) interacting with an environment (the world). At each time step :
- The agent observes the current state (e.g., the positions of joints in a robot, or pixels on a screen)
- The agent selects an action based on its policy (the decision-making strategy)
- The environment transitions to a new state and provides a scalar reward
- The goal is to maximize the cumulative (discounted) reward:
Here, is the discount factor — it determines how much the agent cares about future rewards versus immediate ones. A close to 1 means the agent is farsighted; close to 0 means it's myopic.
Policy . The policy is the agent's strategy, mapping states to actions. In deep RL, the policy is typically a neural network parameterized by :
- For discrete action spaces (e.g., Atari games with joystick directions): outputs a probability distribution over actions
- For continuous action spaces (e.g., robot joint torques): outputs the mean (and sometimes variance) of a Gaussian distribution from which actions are sampled
Value Function . The value function estimates how good it is to be in a particular state — specifically, the expected cumulative reward from state onward if we follow policy :
Action-Value Function . Similarly, the Q-function estimates the expected return from taking action in state and then following :
Advantage Function . The advantage function measures how much better action is compared to the average action in state :
A positive advantage means the action is better than average; negative means worse. The advantage function is central to PPO because it tells us which actions to encourage and which to discourage.
1.2 Policy Gradient Methods: Learning by Gradient Ascent
Policy gradient methods are a family of RL algorithms that directly optimize the policy parameters by computing the gradient of the expected reward with respect to and performing gradient ascent.
The Policy Gradient Theorem. The fundamental result (due to Sutton et al., 1999) says that the gradient of the expected return with respect to the policy parameters can be written as:
Intuition: This gradient says "increase the probability of actions that have positive advantage (better than average), and decrease the probability of actions with negative advantage." The magnitude of the update is proportional to how much better or worse the action was.
The Vanilla Policy Gradient (REINFORCE). In practice, we estimate this gradient from sampled trajectories:
where is an estimator of the advantage function (e.g., using rewards-to-go minus a baseline). This estimator is unbiased but has very high variance, leading to noisy, unstable training. The corresponding loss function whose gradient gives us is:
The Problem with Vanilla Policy Gradients. While mathematically elegant, vanilla policy gradients have a critical practical limitation: you can only use each batch of sampled data for one gradient step. If you try to take multiple gradient steps on the same batch, the updates compound and the policy can change so drastically that performance collapses. This is wastly data-inefficient — you collect an entire trajectory, use it once, then throw it away.
1.3 Trust Region Policy Optimization (TRPO): The Predecessor
TRPO (Schulman et al., 2015) addresses the instability problem by introducing a constraint on how much the policy can change in each update. The idea is beautifully intuitive: if we only allow small policy changes, we can safely reuse data for multiple gradient steps without worrying about catastrophic performance drops.
The TRPO Objective. TRPO maximizes a "surrogate" objective while constraining the KL divergence between the old and new policies:
The Probability Ratio. The ratio is called the importance sampling ratio. When , this ratio equals 1. As the policy changes, the ratio moves away from 1. The ratio allows us to evaluate how the new policy performs using data collected by the old policy — this is what makes data reuse possible.
Why TRPO is Problematic. While TRPO works well, it has significant practical drawbacks:
- Implementation complexity: TRPO requires computing second-order derivatives (the Fisher information matrix) and solving a constrained optimization problem using the conjugate gradient algorithm. This is non-trivial to implement correctly.
- Incompatible with shared architectures: TRPO doesn't easily work with architectures that share parameters between the policy and value function networks, or with auxiliary tasks — which are common in practice.
- Incompatible with noise: It doesn't work well with dropout or other forms of stochasticity in the network.
PPO was designed to get TRPO's benefits without these drawbacks.
1.4 KL Divergence: Measuring Policy Distance
KL divergence (Kullback-Leibler divergence) measures how different two probability distributions are. For policies and :
Key properties:
- always, and if and only if the distributions are identical
- It is not symmetric:
- Small KL divergence means the policies make similar decisions
In TRPO, the KL constraint ensures the policy doesn't change too much in any single update, preventing catastrophic performance drops.
1.5 Importance Sampling: Reusing Data from Old Policies
Importance sampling is a technique for estimating expectations under one distribution using samples from a different distribution. If we have samples from but want to evaluate the expected reward under , we use:
This is what allows PPO to reuse trajectories collected by the old policy to update the new policy — the probability ratio is precisely this importance sampling correction. However, importance sampling becomes unreliable when the distributions are very different (high variance), which is why PPO needs mechanisms to keep the policy close to the old one.
1.6 Generalized Advantage Estimation (GAE)
GAE (Schulman et al., 2015b) is a technique for computing advantage estimates that provides a smooth tradeoff between bias and variance. It defines:
where is the temporal difference (TD) error, is the discount factor, and is a new hyperparameter controlling the bias-variance tradeoff:
- : Uses only the one-step TD error — low variance but high bias
- : Reduces to the full Monte Carlo advantage estimate — unbiased but high variance
- Intermediate (typically 0.95): Balances both
PPO uses GAE with in all experiments, which has become the standard practice.
2. The PPO Algorithm: Core Contributions
2.1 Motivation: First-Order TRPO
The central question driving PPO is: Can we get TRPO's stability benefits using only first-order optimization (regular gradient descent), without the complexity of constrained optimization?
The authors explore two approaches:
- Clipped surrogate objective (PPO-Clip): Modify the objective function to naturally penalize large policy changes
- Adaptive KL penalty (PPO-Penalty): Use a penalty term instead of a hard constraint, with an adaptive coefficient
The clipped version turns out to work best and has become the standard "PPO" algorithm.
2.2 The Clipped Surrogate Objective (PPO-Clip)
This is the key innovation of the paper. Recall the CPI surrogate objective from TRPO:
where is the probability ratio.
Without any constraint, maximizing can lead to destructively large policy updates — the ratio can become very large, causing the policy to jump to a completely different behavior.
The PPO Solution: Clip the ratio. PPO proposes:
where is a hyperparameter (typically ).
Breaking this down piece by piece:
-
: This is the standard surrogate objective — multiply the probability ratio by the advantage.
-
: This clips the probability ratio to the interval with . This means the clipped version ignores any policy change that moves the ratio outside this interval.
-
: Take the minimum of the clipped and unclipped objectives. This creates a pessimistic (lower-bound) estimate of the true objective.
Why does this work? Two cases to consider:
Case 1: Positive advantage (). The action was better than average. We want to increase its probability (increase ). The unclipped objective increases as increases. But the clipped version caps the benefit at . The operation means: "You can increase the probability of this good action, but only up to times its old probability — beyond that, you get no additional objective value."
Case 2: Negative advantage (). The action was worse than average. We want to decrease its probability (decrease ). The unclipped objective gets "better" (less negative) as decreases. But the clipped version caps the benefit at . The operation means: "You can decrease the probability of this bad action, but only down to times its old probability."
The key insight: The clipping acts as a one-sided constraint. We only clip changes that would improve the objective — we never clip changes that make the objective worse. This means the clipped objective is always a lower bound on the unclipped objective, creating a "pessimistic" estimate that discourages overly aggressive policy updates. It is precisely this asymmetry that gives PPO its stability.
Near the starting point: At (where ), the clipped and unclipped objectives are identical. They only diverge as moves away from . To first order, — the clipping has zero effect on small updates.
2.3 Adaptive KL Penalty (PPO-Penalty)
The paper also proposes an alternative: use a KL penalty instead of a hard constraint (as in TRPO), but adapt the penalty coefficient automatically. In each policy update:
- Optimize the KL-penalized objective for several epochs:
-
Compute the actual KL divergence:
-
Adapt :
- If : the update was too conservative, so halve →
- If : the update was too aggressive, so double →
This adaptive scheme automatically adjusts the penalty strength based on whether the actual policy change was too large or too small relative to the target. The heuristic multipliers (1.5 and 2) are not carefully tuned — the algorithm is robust to their exact values.
Experimental verdict: The KL penalty works reasonably well but consistently underperforms the clipped objective across benchmarks. The clipped version is simpler and more effective, which is why "PPO" in practice almost always refers to PPO-Clip.
2.4 The Complete PPO Objective
In practice, PPO uses a neural network architecture that shares parameters between the policy and the value function . The combined objective function is:
where:
- is the clipped surrogate policy loss
- is the value function loss (mean squared error between predicted and target values)
- is an entropy bonus that encourages exploration by penalizing overly deterministic policies
- is the value function coefficient (typically 1.0)
- is the entropy coefficient (typically 0.01 for Atari)
Why an entropy bonus? Without it, the policy can collapse to always choosing the same action, losing the ability to explore and potentially missing better strategies. The entropy term keeps the action distribution "spread out," ensuring the agent continues to try different things.
2.5 The Full Algorithm: Actor-Critic with Fixed-Length Trajectory Segments
The practical PPO algorithm combines all the above components:
Algorithm: PPO, Actor-Critic Style
1 | for iteration = 1, 2, ... do |
Key design decisions:
-
Parallel actors: actors collect data simultaneously from copies of the environment. This provides transitions per iteration, enabling efficient use of parallel hardware.
-
Fixed-length segments: Each actor collects exactly timesteps, regardless of episode boundaries. This simplifies the implementation and ensures consistent batch sizes.
-
Multiple epochs (): The same batch of data is used for optimization epochs. This is the key efficiency gain over vanilla policy gradients (which use data once). The clipping mechanism prevents the multiple updates from causing harmful policy drift.
-
Minibatch SGD: Within each epoch, the transitions are shuffled and divided into minibatches of size for stochastic gradient updates.
-
GAE for advantage estimation: Using generalized advantage estimation with the truncated form:
where is the TD error.
3. Detailed Method Analysis
3.1 Why Clipping Works: A Geometric Perspective
To truly understand why the clipped objective is effective, it helps to think about the optimization landscape geometrically.
Consider the objective as a function of the policy parameters . Starting from :
-
The unclipped objective is a smooth function that can increase without bound as moves away from . This is dangerous because the surrogate objective becomes a poor approximation of the true objective for large policy changes.
-
The clipped objective is identical to near but "flattens out" once the probability ratio moves outside . This creates a "trust region" effect without explicitly computing KL divergence.
The paper's Figure 2 beautifully illustrates this: as we interpolate along the update direction, is a lower bound on and peaks at a KL divergence of about 0.02, then decreases. This self-regulating behavior — the objective naturally discouraging further optimization beyond a certain point — is what makes PPO stable without explicit constraints.
3.2 Comparison of Surrogate Objectives
The paper systematically compares several variants of the surrogate objective:
| Variant | Description | Avg. Normalized Score |
|---|---|---|
| No clipping or penalty | -0.39 | |
| Clipping, | Too conservative | 0.76 |
| Clipping, | Sweet spot | 0.82 |
| Clipping, | Slightly too aggressive | 0.70 |
| Adaptive KL, | Conservative | 0.68 |
| Adaptive KL, | Moderate | 0.74 |
| Adaptive KL, | Aggressive | 0.71 |
| Fixed KL, | Weak penalty | 0.62 |
| Fixed KL, | Moderate penalty | 0.71 |
| Fixed KL, | Strong penalty | 0.72 |
| Fixed KL, | Very strong penalty | 0.69 |
Key observations:
- No constraint at all is catastrophic (negative score due to training collapse on HalfCheetah)
- Clipping with is best — it outperforms all KL penalty variants
- The KL penalty approaches (both fixed and adaptive) are competitive but less robust across hyperparameter choices
- Clipping has only one hyperparameter () while KL penalty has two ( or , plus initialization)
3.3 Architecture and Implementation Details
Network Architecture for Continuous Control (MuJoCo):
- Fully-connected MLP with two hidden layers of 64 units each
- Tanh nonlinearities
- Output: mean of a Gaussian distribution with learned (variable) standard deviation
- Policy and value function do NOT share parameters (separate networks)
- No entropy bonus used
Hyperparameters for Continuous Control:
| Hyperparameter | Value |
|---|---|
| Horizon | 2048 |
| Adam stepsize | |
| Num. epochs | 10 |
| Minibatch size | 64 |
| Discount | 0.99 |
| GAE parameter | 0.95 |
| Clip parameter | 0.2 |
Network Architecture for Atari:
- Convolutional neural network (same as Mnih et al., 2016)
- Policy and value function share the convolutional backbone
- Linear annealing of learning rate and clip parameter
Hyperparameters for Atari:
| Hyperparameter | Value |
|---|---|
| Horizon | 128 |
| Adam stepsize | |
| Num. epochs | 3 |
| Minibatch size | |
| Discount | 0.99 |
| GAE parameter | 0.95 |
| Number of actors | 8 |
| Clip parameter | |
| VF coefficient | 1 |
| Entropy coefficient | 0.01 |
where is linearly annealed from 1 to 0 over the course of training. This annealing schedule means both the learning rate and the clipping range shrink as training progresses, allowing finer adjustments in later stages.
4. Experimental Results and Analysis
4.1 Continuous Control: MuJoCo Benchmarks
PPO was tested on 7 simulated robotics tasks implemented in OpenAI Gym using the MuJoCo physics engine: HalfCheetah, Hopper, InvertedDoublePendulum, InvertedPendulum, Reacher, Swimmer, and Walker2d.
PPO vs. Other Algorithms:
PPO (with ) was compared against:
- TRPO (Trust Region Policy Optimization)
- CEM (Cross-Entropy Method)
- Vanilla PG with adaptive stepsize
- A2C (Advantage Actor-Critic, synchronous A3C)
- A2C + Trust Region
Results: PPO outperforms all baseline algorithms on almost all environments. Specifically:
- On HalfCheetah, PPO achieves ~2000 reward vs. ~1500 for TRPO and ~500 for A2C
- On Hopper, PPO reaches ~2500 while most baselines plateau around ~1500-2000
- On Walker2d, PPO achieves ~3000 vs. ~1000-2000 for baselines
- Only on Swimmer does PPO show comparable (not superior) performance to some baselines
The significance here is that PPO achieves these results with much simpler code than TRPO — no conjugate gradient solver, no line search, no Fisher information matrix computation.
4.2 3D Humanoid Control: Roboschool
To demonstrate PPO's capability on high-dimensional continuous control, the authors test on three 3D humanoid tasks using Roboschool:
- RoboschoolHumanoid: Forward locomotion only — the robot must learn to walk
- RoboschoolHumanoidFlagrun: The target position changes randomly every 200 timesteps or when reached — the robot must navigate to changing goals
- RoboschoolHumanoidFlagrunHarder: Same as above, but the robot is pelted by cubes and must recover from being knocked down
These tasks have observation spaces of ~44 dimensions and action spaces of ~17 dimensions (controlling individual joints of a humanoid body). PPO successfully learns all three tasks, demonstrating natural-looking locomotion, turning behavior, and recovery from being knocked over. The learning curves show steady improvement over ~100M timesteps, with the Flagrun tasks reaching rewards of ~2500 and ~3000 respectively.
This was particularly impressive at the time of publication because humanoid locomotion was considered a very challenging benchmark. The robot learns to coordinate many degrees of freedom to produce stable, adaptive walking behavior.
4.3 Atari Game Playing
PPO was tested on all 49 Atari games in OpenAI Gym, compared against A2C and ACER (Actor-Critic with Experience Replay).
Scoring metrics:
- Average reward per episode over all of training (favors fast learning)
- Average reward per episode over last 100 episodes (favors final performance)
Results (games "won" by each algorithm):
| Metric | A2C | ACER | PPO | Tie |
|---|---|---|---|---|
| Avg over all training | 1 | 18 | 30 | 0 |
| Avg over last 100 | 1 | 28 | 19 | 1 |
Analysis:
- PPO dominates in fast learning (winning 30/49 games) — it reaches good performance quickly
- ACER has an edge in final performance (28/49) — with experience replay, it eventually converges to better policies on some games
- A2C is almost always the worst performer
- PPO achieves these results while being significantly simpler than ACER (no experience replay buffer, no bias correction, no trust region computation)
Notable game-by-game results from Table 6:
- Atlantis: PPO achieves 2,311,815 vs. A2C's 729,265 and ACER's 1,841,376
- BattleZone: PPO scores 17,367 vs. ACER's 8,983 and A2C's 3,080
- Kangaroo: PPO scores 9,929 vs. A2C's 45 and ACER's 50 — a 200x improvement
- Zaxxon: PPO scores 5,009 vs. A2C's 16 and ACER's 29 — dramatic improvement
5. PPO's Role in RLHF and LLM Alignment
While the original PPO paper focused on robotics and game environments, PPO's most consequential application has been in Reinforcement Learning from Human Feedback (RLHF) — the technique used to align large language models with human preferences.
5.1 The RLHF Pipeline
The standard RLHF pipeline (as used in InstructGPT, ChatGPT, Claude, etc.) consists of three stages:
- Supervised Fine-Tuning (SFT): Fine-tune a pretrained LLM on high-quality demonstration data
- Reward Model Training: Train a reward model on human preference comparisons (which of two responses is better?)
- PPO Fine-Tuning: Use PPO to optimize the SFT model to maximize the reward model's scores
In Stage 3, the "environment" is the text generation process:
- State : The prompt + tokens generated so far
- Action : The next token to generate
- Reward : Given by the reward model (typically only at the end of generation)
- Policy : The language model being fine-tuned
PPO's clipped objective is critical here because uncontrolled optimization against the reward model can lead to reward hacking — the model finds degenerate outputs that score highly according to the reward model but are actually low quality (e.g., repetitive text, exploitation of reward model biases).
5.2 Why PPO for RLHF?
Several properties of PPO make it particularly well-suited for LLM alignment:
- Stability: The clipped objective prevents the model from changing too drastically in any single update, which is crucial when fine-tuning a model that already has valuable capabilities
- Simplicity: RLHF implementations are already complex; PPO's simplicity compared to alternatives like TRPO reduces the implementation burden
- Sample efficiency: Multiple epochs of minibatch updates allow better use of expensive-to-generate trajectories
- KL regularization compatibility: In RLHF, an additional KL penalty against the SFT model is typically added to prevent the model from straying too far from its initial behavior:
PPO's clipping combined with this KL penalty provides a double safeguard against reward hacking.
5.3 Alternatives and Successors
PPO's dominance in RLHF has spurred research into alternatives:
- DPO (Direct Preference Optimization): Eliminates the reward model entirely by reframing the problem as supervised learning on preferences
- GRPO (Group Relative Policy Optimization): Used by DeepSeek; replaces the critic model with group-relative advantages
- REINFORCE variants: Some recent work (e.g., RLOO) shows that simpler policy gradient methods with proper baselines can match PPO in RLHF settings
- KTO (Kahneman-Tversky Optimization): Uses prospect theory-inspired objectives instead of paired comparisons
Despite these alternatives, PPO remains the most widely validated approach for RLHF and is used in production by multiple major AI labs.
6. Limitations and Boundary Conditions
6.1 Sensitivity to Hyperparameters
While PPO is more robust than TRPO, it is not hyperparameter-free:
- The clip parameter has a meaningful effect (0.1 vs. 0.2 vs. 0.3 give noticeably different performance)
- Learning rate, number of epochs, and batch size all matter
- The GAE parameters ( and ) require tuning for new domains
In the RLHF setting specifically, hyperparameter sensitivity is amplified because the reward signal is noisy (coming from an imperfect reward model).
6.2 Sample Efficiency Limitations
PPO is an on-policy algorithm — it only uses data collected by the current policy. While the clipping mechanism allows multiple gradient steps per batch, PPO is still significantly less sample-efficient than off-policy methods like SAC (Soft Actor-Critic) that maintain experience replay buffers. This means:
- More environment interactions are needed compared to off-policy methods
- In settings where data collection is expensive (e.g., real robots, RLHF with human feedback), this is a significant cost
6.3 Credit Assignment Over Long Horizons
PPO uses GAE for advantage estimation, which relies on the value function to propagate credit across time. When episodes are very long or rewards are extremely sparse, the value function may fail to provide useful gradients, leading to slow or failed learning. This is particularly relevant in:
- Long-horizon planning tasks
- Tasks where the reward is only given at the very end (e.g., game won/lost)
- LLM alignment where the reward is given after generating the entire response
6.4 Reward Hacking in RLHF
In the RLHF application specifically, PPO can still exploit imperfections in the reward model:
- The model may learn to generate outputs that game the reward model without actually being helpful
- Additional techniques like KL penalties, reward model ensembles, and process supervision are needed to mitigate this
- The clip parameter provides some protection but is not a complete solution
6.5 Scalability Concerns
The original PPO paper uses relatively small networks (2-layer MLPs with 64 units). When scaling to very large models (e.g., 175B parameter LLMs), new challenges arise:
- Memory for storing both old and new policy parameters
- Computational cost of computing probability ratios for all tokens
- Distributed training coordination across many GPUs
- Gradient computation through the entire model for each PPO step
7. Reproducibility and Practical Considerations
7.1 Implementation Availability
PPO is implemented in numerous open-source libraries:
- Stable Baselines3 (Python): The most popular RL library, with well-tested PPO implementation
- RLlib (Ray): Scalable distributed RL framework
- CleanRL: Single-file implementations for education
- TRL (Hugging Face): PPO implementation specifically for LLM fine-tuning via RLHF
- DeepSpeed-Chat: Microsoft's scalable RLHF training framework using PPO
7.2 Common Implementation Pitfalls
Based on extensive community experience, common mistakes when implementing PPO include:
- Incorrect advantage normalization: Per-minibatch normalization (subtracting mean, dividing by std) is critical for stable training but not mentioned in the paper
- Incorrect value target computation: The GAE computation must be done correctly at the trajectory level
- Missing observation normalization: Running mean/std normalization of observations significantly improves performance
- Incorrect gradient accumulation: When using multiple epochs, the old log-probabilities must be detached/stopped from the computation graph
- Reward scaling: Normalizing rewards to have unit variance can significantly improve stability
The ICLR 2022 paper "The 37 Implementation Details of Proximal Policy Optimization" by Huang et al. catalogues these details exhaustively.
7.3 Hyperparameter Recommendations
Based on the paper and community experience:
| Setting | Epochs | Horizon | LR | |||
|---|---|---|---|---|---|---|
| MuJoCo | 0.2 | 10 | 2048 | 3e-4 | 0.99 | 0.95 |
| Atari | 0.1→0 | 3 | 128 | 2.5e-4→0 | 0.99 | 0.95 |
| RLHF (LLM) | 0.2 | 1-4 | full response | 1e-5 to 5e-6 | 1.0 | 0.95 |
8. Historical Context and Impact
8.1 Timeline
- 2015: TRPO published — first stable policy gradient method via trust regions
- 2015: GAE published — variance reduction for advantage estimation
- 2016: A3C/A2C published — parallel data collection for policy gradients
- 2017: PPO published — simplified TRPO with clipped objective
- 2018: PPO used to train OpenAI Five (Dota 2 bot that defeated world champions)
- 2022: InstructGPT uses PPO for RLHF — aligning GPT-3 with human preferences
- 2022-present: PPO becomes the default RLHF algorithm for ChatGPT, Claude, Gemini, Llama, etc.
8.2 Citation Impact
PPO is one of the most cited papers in deep reinforcement learning, with thousands of citations. More importantly, it has become a practical standard — when people say "we trained with RL," they usually mean PPO.
8.3 The Simplicity Principle
PPO embodies an important engineering principle: the best algorithm is often not the most theoretically elegant one, but the one that's simplest to implement, debug, and scale. TRPO has stronger theoretical guarantees (monotonic improvement under certain conditions), but PPO's practical advantages — first-order optimization, compatibility with shared architectures, ease of distributed implementation — made it the winner in practice.
9. Summary
PPO represents a masterclass in practical algorithm design. By replacing TRPO's constrained optimization with a simple clipped surrogate objective, Schulman et al. created an algorithm that is:
- Simple: Requires only a few lines of code change to vanilla policy gradients
- Stable: The clipping mechanism provides TRPO-like stability guarantees without second-order optimization
- General: Works with shared policy-value architectures, with dropout, with auxiliary tasks
- Scalable: First-order optimization scales naturally to large models and parallel implementations
- Effective: Matches or exceeds TRPO performance across diverse benchmarks
The algorithm's most profound impact has been as the optimization engine for RLHF, enabling the alignment of large language models with human preferences. Every time you interact with ChatGPT, Claude, or Gemini, you're benefiting from a model whose behavior was shaped by PPO's clipped objective function.
For practitioners, PPO remains the first algorithm to reach for when solving an RL problem. Its combination of simplicity, robustness, and proven effectiveness across domains — from simulated robots to billion-parameter language models — makes it one of the most important algorithms of the deep learning era.
References
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML.
- Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
- Mnih, V., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML.
- Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS.
- Huang, S., et al. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track.
- Kakade, S. & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. ICML.
- Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NeurIPS.