Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Policy Gradient: Directly Optimizing the Policy

Policy Gradient: Directly Optimizing the Policy

Updated 2026-04-06

Why Directly Optimize the Policy

In the previous article, we saw that Value-Based methods (Q-Learning / DQN) work by learning Q(s,a)Q(s,a) and then greedily selecting the action with the highest Q-value. This approach has several fundamental limitations:

  1. Only handles discrete action spaces: It requires computing a Q-value for every possible action. When the action space is very large (e.g., an LLM’s vocabulary of 32K+ tokens) or continuous (e.g., robot control), this is impractical
  2. Cannot express stochastic policies: Sometimes the optimal policy requires randomness (e.g., rock-paper-scissors), but Q-Learning can only output deterministic policies
  3. Small changes, big effects: Tiny changes in Q-values can flip the argmax, causing abrupt policy changes

The core idea of Policy Gradient: Since an LLM is already a parameterized policy πθ(as)\pi_\theta(a|s) (given a context, output a token probability distribution), why not optimize this policy directly?

Policy Gradient Intuition: Probability Adjusts with RewardClick actions to get rewards → positive reward increases prob, negative decreasesAction Aπ = 25.0%0.250Action Bπ = 25.0%0.250Action Cπ = 25.0%0.250Action Dπ = 25.0%0.250Policy Gradient Core∇J ≈ ∇log π(a|s) · RR>0: increase π(a) | R<0: decrease π(a)reward history (0 steps)ResetEach action has hidden expected reward | Click multiple times to see policy converge

The Policy Gradient Theorem

The objective of Policy Gradient is to maximize the expected cumulative return J(θ)=Eπθ[tγtrt]J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_t \gamma^t r_t].

The Policy Gradient Theorem gives the gradient of J(θ)J(\theta) with respect to parameters θ\theta:

θJ(θ)=Eπθ[θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]

where Gt=k=0γkrt+kG_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} is the discounted return starting from time step tt.

Intuition:

  • θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) points in the direction that increases the probability of action aa
  • GtG_t measures how good this trajectory was
  • Their product: when Gt>0G_t > 0, push up the probability of good actions; when Gt<0G_t < 0, push down the probability of bad actions

The beauty of this formula: no environment model is needed — you only need to sample trajectories to estimate the gradient.

The REINFORCE Algorithm

REINFORCE is the simplest Policy Gradient implementation:

  1. Sample: Use the current policy πθ\pi_\theta to sample a complete trajectory τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)
  2. Compute Returns: For each time step in the trajectory, calculate the discounted return GtG_t
  3. Update Parameters: θθ+αtθlogπθ(atst)Gt\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t
REINFORCE: Sample Trajectory → Compute Return → Updateγ = 0.9 | Sample complete trajectory, compute discounted returnSampled Return Distribution (Observe Variance)REINFORCE Update Rule:θ ← θ + α · ∇log π(aₜ|sₜ;θ) · GₜGₜ > 0 → increase action prob | Gₜ < 0 → decrease action prob | Multiple samples needed for reliable gradientSample TrajectoryResetSample multiple times to observe return variance — core problem of REINFORCE

The High Variance Problem

While REINFORCE is simple and elegant, it has a fatal flaw: extremely high variance in gradient estimates.

Each time only one trajectory is sampled, and its return is heavily affected by randomness. The same policy can produce vastly different returns across two samples — one might happen to follow a good path yielding a high return, while another follows a bad path yielding a low return.

This makes gradient estimates very unstable, requiring many samples to obtain a reliable gradient direction.

High Variance Problem: Gradient Estimate ScatterEach trajectory sample yields only one noisy gradient estimateθ₁θ₂True Gradient DirectionSampled Gradient Estimates (n=0)StatisticsSamples: 0Avg Direction: (0.00, 0.00)True Direction: (0.70, -0.70)Click to sample and observe varianceThis is why REINFORCE converges slowlySample 1Sample 10Reset

Baseline and Advantage

The key insight for solving high variance: the absolute value of the reward doesn’t matter — what matters is “how much better than average”.

Introduce a Baseline b(s)b(s) (typically an estimate of V(s)V(s)), modifying the gradient to:

θJ(θ)=Eπθ[θlogπθ(atst)(Gtb(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]

It can be mathematically proven that subtracting any baseline that does not depend on the action does not change the gradient expectation (unbiased), but significantly reduces variance.

This leads to the Advantage Function:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)

A>0A > 0 means this action is better than average, A<0A < 0 means worse than average. Using Advantage instead of raw return gives gradients that are both unbiased and low-variance.

Effect of Baseline on Policy Gradient Variance✗ Show Without Baseline0Training StepsNo Baseline: high variance, severe oscillationNo Baseline: ∇J ≈ ∇log π · G (raw return)Raw return contains absolute magnitude signal, gradient oscillates in positive direction even when all rewards are positiveRun Simulation
Raw Return G
Raw Return: Absolute Valueaction AG = 8action BG = 6action CG = 3Issue: All positive returns → all action probs increase → unstable gradient

From REINFORCE to Actor-Critic

REINFORCE + Baseline still requires sampling complete trajectories to compute returns. Can we update after every step?

The answer is Actor-Critic: use a neural network (the Critic) to approximate V(s)V(s) as the baseline. This way:

  • The Actor (policy network) decides which action to take
  • The Critic (value network) evaluates how good the current state is

The Critic’s TD estimate provides an Advantage signal at every step, without waiting for the trajectory to end. This dramatically improves sample efficiency.

Actor-Critic is the key stepping stone to PPO and RLHF, which we will explore in depth in the next article.

Policy Gradient Algorithm EvolutionClick nodes to see details of each improvement+ baseline b(s)+ Critic network+ parallel sampling+ Clipped Trust RegionREINFORCEREINFORCE+ BaselineActor-CriticA2CPPO↑ Core algorithm for RLHF

Summary

This article covered the core ideas of Policy Gradient:

  1. Directly optimizing the policy πθ\pi_\theta is better suited for large action spaces (like LLMs) than Value-Based methods
  2. The Policy Gradient Theorem provides gradient estimates that do not depend on an environment model
  3. REINFORCE is the simplest implementation, but suffers from extremely high variance
  4. Baseline / Advantage significantly reduce variance while maintaining unbiasedness
  5. Actor-Critic uses a neural network to approximate the baseline, enabling step-by-step updates

In the next article, we will dive into the Actor-Critic architecture and PPO to understand the core optimization algorithm behind RLHF.