Actor-Critic and PPO: Stable Policy Optimization
Updated 2026-04-06
Actor-Critic Architecture
In the previous article we saw the idea behind REINFORCE + Baseline: using as a baseline to reduce variance. However, REINFORCE still requires sampling a complete trajectory before updating.
Actor-Critic takes this further: it uses a separate neural network (the Critic) to learn , so that at every step we can compute the TD error as an Advantage signal:
This is a one-step estimate of the Advantage: if the actual reward plus the value of the next state exceeds the expected value of the current state, it means the action was “better than expected.”
The two networks each have a distinct role:
- Actor : the policy network, which decides what action to take
- Critic : the value network, which evaluates how good a state is
GAE: Balancing Bias and Variance
The one-step TD error has high bias but low variance; the Monte Carlo return has low bias but high variance. GAE (Generalized Advantage Estimation) uses a parameter to smoothly interpolate between the two:
- : uses only the one-step TD error (high bias, low variance)
- : equivalent to the Monte Carlo return (low bias, high variance)
- In practice, works best
The Trust Region Problem
Policy Gradient has a critical practical issue: if the step size is too large the policy collapses; if it’s too small convergence is painfully slow.
Standard gradient descent cannot guarantee that performance won’t plummet after a policy update. A seemingly reasonable gradient direction, if the step is too large, can push the policy into a completely different behavioral mode, causing catastrophic performance collapse.
TRPO (Trust Region Policy Optimization) solves this by adding a KL divergence constraint at each update, ensuring the new and old policies stay “close enough”:
PPO: A Simple and Effective Trust Region Method
While TRPO is theoretically elegant, constrained optimization is computationally expensive. PPO (Proximal Policy Optimization) achieves a similar effect with a clever clip operation:
where is the probability ratio between new and old policies.
Intuition behind clipping:
- When (good action): once the ratio exceeds , the objective stops growing, preventing excessive probability increases
- When (bad action): once the ratio drops below , the objective stops decreasing, preventing excessive probability reductions
This “pessimistic” strategy ensures every update stays within a safe range.
PPO’s Role in LLMs
When PPO meets LLMs, RL concepts map in interesting ways:
| Game RL | LLM RLHF |
|---|---|
| Environment state | Prompt + generated tokens so far |
| Action | Next token |
| Policy $\pi(a | s)$ |
| Trajectory | A complete response |
| Reward | RM score - beta * KL penalty |
| Episode ends | Generating the EOS token |
The KL penalty is a critical addition in LLM RLHF: it prevents the LLM from drifting too far from the pretrained distribution, avoiding reward hacking (discussed in detail in later articles).
Summary
This article covered the evolution from Actor-Critic to PPO:
- Actor-Critic uses a Critic network to provide step-by-step Advantage signals, eliminating the need for complete trajectories
- GAE uses the lambda parameter to elegantly trade off between bias and variance
- Trust Region addresses the policy update step size problem, preventing performance collapse
- PPO simplifies the trust region constraint with a clip operation, making it the most practical policy optimization algorithm
- PPO + LLM maps token generation to RL action sequences, with a KL penalty to prevent drift
PPO is the core engine of RLHF. In the next article, we will introduce the complete RLHF pipeline: SFT, Reward Model, and PPO, showing how PPO aligns LLMs in practice.