Policy Gradient: Directly Optimizing the Policy
Updated 2026-04-06
Why Directly Optimize the Policy
In the previous article, we saw that Value-Based methods (Q-Learning / DQN) work by learning and then greedily selecting the action with the highest Q-value. This approach has several fundamental limitations:
- Only handles discrete action spaces: It requires computing a Q-value for every possible action. When the action space is very large (e.g., an LLM’s vocabulary of 32K+ tokens) or continuous (e.g., robot control), this is impractical
- Cannot express stochastic policies: Sometimes the optimal policy requires randomness (e.g., rock-paper-scissors), but Q-Learning can only output deterministic policies
- Small changes, big effects: Tiny changes in Q-values can flip the argmax, causing abrupt policy changes
The core idea of Policy Gradient: Since an LLM is already a parameterized policy (given a context, output a token probability distribution), why not optimize this policy directly?
The Policy Gradient Theorem
The objective of Policy Gradient is to maximize the expected cumulative return .
The Policy Gradient Theorem gives the gradient of with respect to parameters :
where is the discounted return starting from time step .
Intuition:
- points in the direction that increases the probability of action
- measures how good this trajectory was
- Their product: when , push up the probability of good actions; when , push down the probability of bad actions
The beauty of this formula: no environment model is needed — you only need to sample trajectories to estimate the gradient.
The REINFORCE Algorithm
REINFORCE is the simplest Policy Gradient implementation:
- Sample: Use the current policy to sample a complete trajectory
- Compute Returns: For each time step in the trajectory, calculate the discounted return
- Update Parameters:
The High Variance Problem
While REINFORCE is simple and elegant, it has a fatal flaw: extremely high variance in gradient estimates.
Each time only one trajectory is sampled, and its return is heavily affected by randomness. The same policy can produce vastly different returns across two samples — one might happen to follow a good path yielding a high return, while another follows a bad path yielding a low return.
This makes gradient estimates very unstable, requiring many samples to obtain a reliable gradient direction.
Baseline and Advantage
The key insight for solving high variance: the absolute value of the reward doesn’t matter — what matters is “how much better than average”.
Introduce a Baseline (typically an estimate of ), modifying the gradient to:
It can be mathematically proven that subtracting any baseline that does not depend on the action does not change the gradient expectation (unbiased), but significantly reduces variance.
This leads to the Advantage Function:
means this action is better than average, means worse than average. Using Advantage instead of raw return gives gradients that are both unbiased and low-variance.
From REINFORCE to Actor-Critic
REINFORCE + Baseline still requires sampling complete trajectories to compute returns. Can we update after every step?
The answer is Actor-Critic: use a neural network (the Critic) to approximate as the baseline. This way:
- The Actor (policy network) decides which action to take
- The Critic (value network) evaluates how good the current state is
The Critic’s TD estimate provides an Advantage signal at every step, without waiting for the trajectory to end. This dramatically improves sample efficiency.
Actor-Critic is the key stepping stone to PPO and RLHF, which we will explore in depth in the next article.
Summary
This article covered the core ideas of Policy Gradient:
- Directly optimizing the policy is better suited for large action spaces (like LLMs) than Value-Based methods
- The Policy Gradient Theorem provides gradient estimates that do not depend on an environment model
- REINFORCE is the simplest implementation, but suffers from extremely high variance
- Baseline / Advantage significantly reduce variance while maintaining unbiasedness
- Actor-Critic uses a neural network to approximate the baseline, enabling step-by-step updates
In the next article, we will dive into the Actor-Critic architecture and PPO to understand the core optimization algorithm behind RLHF.