From DPO to GRPO: Direct Preference Optimization

Pain Points of RLHF

In the previous article, we covered the complete three-stage RLHF pipeline. While it successfully powered the creation of ChatGPT, it also exposed several core pain points:

High training complexity: requires running 4 models simultaneously (policy, reference, RM, critic), and PPO training is unstable
The Reward Model is a bottleneck: RM quality directly caps the alignment ceiling, and it’s easy to exploit
Hyperparameter sensitivity: PPO’s clip epsilon, learning rate, KL penalty beta, and other parameters require careful tuning

Can we skip the RM and PPO entirely and optimize the policy directly from preference data? This is the core motivation behind DPO.

Core Derivation of DPO

DPO’s key insight is: within the RLHF framework, there exists a closed-form relationship between the optimal policy and the reward function.

Starting from RLHF’s KL-constrained optimization objective:

$\max_\pi \mathbb{E}[r(x,y)] - \beta \cdot KL(\pi \| \pi_{ref})$

We can derive the optimal policy as:

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r(x,y)\right)$

Conversely, the reward can be expressed in terms of the policy:

$r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)$

Substituting this relationship into the Bradley-Terry model, the $Z(x)$ term cancels out, yielding the DPO Loss:

$\mathcal{L}_{DPO}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta \left(\log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right)\right]$

Advantages and Issues of DPO

Advantages:

Eliminates the Reward Model and PPO, requiring only 2 models (policy + reference)
Training is as simple as SFT (forward pass + backward pass)
No online sampling needed; trains directly on offline preference data

Issues:

Offline distribution shift: training data comes from an old policy; as the model updates, the data no longer matches the current policy
Sensitive to data quality: noise in preference pairs directly affects the optimization direction
Prone to overfitting: especially evident on small datasets

IPO and KTO

To address DPO’s problems, researchers have proposed several variants:

IPO (Identity Preference Optimization): adds a regularization term to prevent overfitting, so the model doesn’t need to push the margin of preference pairs to infinity.

KTO (Kahneman-Tversky Optimization): the biggest innovation is that it doesn’t require paired preference data — it only needs to know whether each response is “good” or “bad,” significantly reducing annotation costs.

GRPO: DeepSeek’s Approach

GRPO (Group Relative Policy Optimization) comes from DeepSeek, with a core innovation of eliminating the Critic network:

For the same prompt, sample a group (G) of responses
Use a reward function (rules-based or an RM) to score each response
Compute Advantage using within-group relative ranking: $A_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}$
Update the policy using a PPO-style clipped objective

GRPO’s advantage is that it doesn’t need a Critic network (saving one large model’s worth of GPU memory), and online sampling avoids distribution shift. DeepSeek-R1 used GRPO with rule-based rewards to train a model that exhibited emergent thinking capabilities.

DPO (Offline)

Method Selection

There is no perfect alignment method. The choice depends on your constraints:

Dimension	RLHF	DPO	GRPO
Training complexity	High (4 models)	Low (2 models)	Medium (2 models + online generation)
Data requirements	Preference pairs + prompts	Preference pairs	Prompts + reward rules
Training stability	PPO is unstable	Stable like SFT	Fairly stable
Performance ceiling	High (online optimization)	Medium (offline limitations)	High (online + emergence)
Use cases	Best alignment quality	Fast iteration, limited resources	Math/reasoning tasks

Summary

DPO uses a closed-form relationship to eliminate the RM and PPO, making alignment training as simple as SFT
IPO adds regularization to prevent overfitting; KTO removes the need for paired data
GRPO removes the Critic and uses group sampling to compute Advantage, balancing efficiency with online optimization
Choosing a method requires trade-offs: training resources / data quality / performance requirements
DeepSeek-R1 demonstrated GRPO’s enormous potential for reasoning tasks

In the next article, we will dive into reward design: ORM vs PRM, the deeper causes of reward hacking, and how reward models are evolving into verifiers.