Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Actor-Critic and PPO: Stable Policy Optimization

Actor-Critic and PPO: Stable Policy Optimization

Updated 2026-04-06

Actor-Critic Architecture

In the previous article we saw the idea behind REINFORCE + Baseline: using V(s)V(s) as a baseline to reduce variance. However, REINFORCE still requires sampling a complete trajectory before updating.

Actor-Critic takes this further: it uses a separate neural network (the Critic) to learn V(s)V(s), so that at every step we can compute the TD error as an Advantage signal:

δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

This δt\delta_t is a one-step estimate of the Advantage: if the actual reward plus the value of the next state exceeds the expected value of the current state, it means the action was “better than expected.”

Actor-Critic Architecture: Dual NetworkActorπ(a|s; θ)EnvironmentCriticV(s; w)AdvantageA = r + γV(s') - V(s)a ~ π(·|s)s', rs, s'V(s), V(s')A → 更新 Actor1. Actor Output2. Environment Return3. Critic Evaluation4. Compute Advantage5. Dual Network Update1. Actor OutputActor network π(a|s;θ) outputs action probability distribution, samples action a

The two networks each have a distinct role:

  • Actor π(as;θ)\pi(a|s;\theta): the policy network, which decides what action to take
  • Critic V(s;w)V(s;w): the value network, which evaluates how good a state is

GAE: Balancing Bias and Variance

The one-step TD error δt\delta_t has high bias but low variance; the Monte Carlo return has low bias but high variance. GAE (Generalized Advantage Estimation) uses a parameter λ\lambda to smoothly interpolate between the two:

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}

  • λ=0\lambda = 0: uses only the one-step TD error (high bias, low variance)
  • λ=1\lambda = 1: equivalent to the Monte Carlo return (low bias, high variance)
  • In practice, λ=0.950.97\lambda = 0.95 \sim 0.97 works best
GAE λ Parameter: Bias-Variance Tradeoffλ = 0.95λ=0 (TD, high bias low variance)λ=1 (MC, low bias high variance)Advantage Estimate ScatterTraining Convergence CurveGAE(λ) = Σₖ (γλ)ᵏ · δₜ₊ₖ where δₜ = rₜ + γV(sₜ₊₁) - V(sₜ)λ near 1: similar to Monte Carlo, full trajectory. High variance (large trajectory differences), but low bias.

The Trust Region Problem

Policy Gradient has a critical practical issue: if the step size is too large the policy collapses; if it’s too small convergence is painfully slow.

Standard gradient descent cannot guarantee that performance won’t plummet after a policy update. A seemingly reasonable gradient direction, if the step is too large, can push the policy into a completely different behavioral mode, causing catastrophic performance collapse.

TRPO (Trust Region Policy Optimization) solves this by adding a KL divergence constraint at each update, ensuring the new and old policies stay “close enough”:

maxθE^[πθ(as)πθold(as)A^]s.t. KL(πθoldπθ)δ\max_\theta \hat{\mathbb{E}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} \hat{A}\right] \quad \text{s.t. } KL(\pi_{\theta_{old}} \| \pi_\theta) \leq \delta

Trust Region: Constrain Policy Update Range✗ Trust Region OFF (Unconstrained)Policy Parameter Space (θ₁, θ₂)Performance ChangeUpdate StepResetUnconstrained update → step size may be too largePolicy change too large → may crash performance

PPO: A Simple and Effective Trust Region Method

While TRPO is theoretically elegant, constrained optimization is computationally expensive. PPO (Proximal Policy Optimization) achieves a similar effect with a clever clip operation:

LCLIP(θ)=E^[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \hat{\mathbb{E}}\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio between new and old policies.

Intuition behind clipping:

  • When A^>0\hat{A} > 0 (good action): once the ratio exceeds 1+ϵ1+\epsilon, the objective stops growing, preventing excessive probability increases
  • When A^<0\hat{A} < 0 (bad action): once the ratio drops below 1ϵ1-\epsilon, the objective stops decreasing, preventing excessive probability reductions

This “pessimistic” strategy ensures every update stays within a safe range.

PPO Clipped Surrogate Objectiveε = 0.200.10.20.30.4A > 0 (good)0ratio=11-ε1+εratio × A (unclipped)L_CLIP = min(...) (PPO)ratio = π_new(a|s) / π_old(a|s)L_CLIP = min(ratio·A, clip(ratio, 1-ε, 1+ε)·A)When A>0: objective stops growing after ratio > 1+ε → prevents excessive increase
PPO vs Vanilla Policy Gradient Training Comparison1000PerformanceVanilla PG (high variance, occasional crashes)PPO (stable climb, clip protection)Training StepsStart TrainingResetHover to see step details | VPG crashes from excessive policy updates

PPO’s Role in LLMs

When PPO meets LLMs, RL concepts map in interesting ways:

Game RLLLM RLHF
Environment state ssPrompt + generated tokens so far
Action aaNext token
Policy $\pi(as)$
TrajectoryA complete response
RewardRM score - beta * KL penalty
Episode endsGenerating the EOS token

The KL penalty βKL(πθπref)\beta \cdot KL(\pi_\theta \| \pi_{ref}) is a critical addition in LLM RLHF: it prevents the LLM from drifting too far from the pretrained distribution, avoiding reward hacking (discussed in detail in later articles).

LLM 生成回答
Step 1: Policy (LLM) generates responsePromptLLM (Policy)π_θ(token|context)ResponseGame RL mapping:State = prompt + generated tokens | Action = next token | Trajectory = full response

Summary

This article covered the evolution from Actor-Critic to PPO:

  1. Actor-Critic uses a Critic network to provide step-by-step Advantage signals, eliminating the need for complete trajectories
  2. GAE uses the lambda parameter to elegantly trade off between bias and variance
  3. Trust Region addresses the policy update step size problem, preventing performance collapse
  4. PPO simplifies the trust region constraint with a clip operation, making it the most practical policy optimization algorithm
  5. PPO + LLM maps token generation to RL action sequences, with a KL penalty to prevent drift

PPO is the core engine of RLHF. In the next article, we will introduce the complete RLHF pipeline: SFT, Reward Model, and PPO, showing how PPO aligns LLMs in practice.