From DPO to GRPO: Direct Preference Optimization
Updated 2026-04-06
Pain Points of RLHF
In the previous article, we covered the complete three-stage RLHF pipeline. While it successfully powered the creation of ChatGPT, it also exposed several core pain points:
- High training complexity: requires running 4 models simultaneously (policy, reference, RM, critic), and PPO training is unstable
- The Reward Model is a bottleneck: RM quality directly caps the alignment ceiling, and it’s easy to exploit
- Hyperparameter sensitivity: PPO’s clip epsilon, learning rate, KL penalty beta, and other parameters require careful tuning
Can we skip the RM and PPO entirely and optimize the policy directly from preference data? This is the core motivation behind DPO.
Core Derivation of DPO
DPO’s key insight is: within the RLHF framework, there exists a closed-form relationship between the optimal policy and the reward function.
Starting from RLHF’s KL-constrained optimization objective:
We can derive the optimal policy as:
Conversely, the reward can be expressed in terms of the policy:
Substituting this relationship into the Bradley-Terry model, the term cancels out, yielding the DPO Loss:
Advantages and Issues of DPO
Advantages:
- Eliminates the Reward Model and PPO, requiring only 2 models (policy + reference)
- Training is as simple as SFT (forward pass + backward pass)
- No online sampling needed; trains directly on offline preference data
Issues:
- Offline distribution shift: training data comes from an old policy; as the model updates, the data no longer matches the current policy
- Sensitive to data quality: noise in preference pairs directly affects the optimization direction
- Prone to overfitting: especially evident on small datasets
IPO and KTO
To address DPO’s problems, researchers have proposed several variants:
IPO (Identity Preference Optimization): adds a regularization term to prevent overfitting, so the model doesn’t need to push the margin of preference pairs to infinity.
KTO (Kahneman-Tversky Optimization): the biggest innovation is that it doesn’t require paired preference data — it only needs to know whether each response is “good” or “bad,” significantly reducing annotation costs.
GRPO: DeepSeek’s Approach
GRPO (Group Relative Policy Optimization) comes from DeepSeek, with a core innovation of eliminating the Critic network:
- For the same prompt, sample a group (G) of responses
- Use a reward function (rules-based or an RM) to score each response
- Compute Advantage using within-group relative ranking:
- Update the policy using a PPO-style clipped objective
GRPO’s advantage is that it doesn’t need a Critic network (saving one large model’s worth of GPU memory), and online sampling avoids distribution shift. DeepSeek-R1 used GRPO with rule-based rewards to train a model that exhibited emergent thinking capabilities.
Method Selection
There is no perfect alignment method. The choice depends on your constraints:
| Dimension | RLHF | DPO | GRPO |
|---|---|---|---|
| Training complexity | High (4 models) | Low (2 models) | Medium (2 models + online generation) |
| Data requirements | Preference pairs + prompts | Preference pairs | Prompts + reward rules |
| Training stability | PPO is unstable | Stable like SFT | Fairly stable |
| Performance ceiling | High (online optimization) | Medium (offline limitations) | High (online + emergence) |
| Use cases | Best alignment quality | Fast iteration, limited resources | Math/reasoning tasks |
Summary
- DPO uses a closed-form relationship to eliminate the RM and PPO, making alignment training as simple as SFT
- IPO adds regularization to prevent overfitting; KTO removes the need for paired data
- GRPO removes the Critic and uses group sampling to compute Advantage, balancing efficiency with online optimization
- Choosing a method requires trade-offs: training resources / data quality / performance requirements
- DeepSeek-R1 demonstrated GRPO’s enormous potential for reasoning tasks
In the next article, we will dive into reward design: ORM vs PRM, the deeper causes of reward hacking, and how reward models are evolving into verifiers.