Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

From DPO to GRPO: Direct Preference Optimization

From DPO to GRPO: Direct Preference Optimization

Updated 2026-04-06

Pain Points of RLHF

In the previous article, we covered the complete three-stage RLHF pipeline. While it successfully powered the creation of ChatGPT, it also exposed several core pain points:

  1. High training complexity: requires running 4 models simultaneously (policy, reference, RM, critic), and PPO training is unstable
  2. The Reward Model is a bottleneck: RM quality directly caps the alignment ceiling, and it’s easy to exploit
  3. Hyperparameter sensitivity: PPO’s clip epsilon, learning rate, KL penalty beta, and other parameters require careful tuning

Can we skip the RM and PPO entirely and optimize the policy directly from preference data? This is the core motivation behind DPO.

RLHF vs DPO Architecture ComparisonRLHF (4 Models)RLHF ArchitecturePolicyπ_θReferenceπ_refRewardModelCriticV(s;w)Training Resource ComparisonRLHF4 models on GPUDPO2 models on GPUDPO Key InsightOptimal policy and reward have closed-formrelationship → can eliminate RMRLHF Training ProcessPrompt → Policy generates → RM scores → Critic estimates V(s) → Compute Advantage → PPO updateRequires 4 models running simultaneously, high training complexity, sensitive to hyperparameters

Core Derivation of DPO

DPO’s key insight is: within the RLHF framework, there exists a closed-form relationship between the optimal policy and the reward function.

Starting from RLHF’s KL-constrained optimization objective:

maxπE[r(x,y)]βKL(ππref)\max_\pi \mathbb{E}[r(x,y)] - \beta \cdot KL(\pi \| \pi_{ref})

We can derive the optimal policy as:

π(yx)=1Z(x)πref(yx)exp(1βr(x,y))\pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r(x,y)\right)

Conversely, the reward can be expressed in terms of the policy:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)

Substituting this relationship into the Bradley-Terry model, the Z(x)Z(x) term cancels out, yielding the DPO Loss:

LDPO(θ)=E[logσ(β(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)))]\mathcal{L}_{DPO}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta \left(\log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right)\right]

DPO Loss Function Visualizationβ = 0.100.050.10.20.50preferred prob < rejectedpreferred prob > rejectedgradient → push up preferredmargin = log(π/π_ref)(y_w) - log(π/π_ref)(y_l)LossL_DPO = -log σ(β · margin)Higher β → steeper loss curve → more sensitive to preference gap | Lower β → smoother → tolerates larger deviation

Advantages and Issues of DPO

Advantages:

  • Eliminates the Reward Model and PPO, requiring only 2 models (policy + reference)
  • Training is as simple as SFT (forward pass + backward pass)
  • No online sampling needed; trains directly on offline preference data

Issues:

  • Offline distribution shift: training data comes from an old policy; as the model updates, the data no longer matches the current policy
  • Sensitive to data quality: noise in preference pairs directly affects the optimization direction
  • Prone to overfitting: especially evident on small datasets

IPO and KTO

To address DPO’s problems, researchers have proposed several variants:

IPO (Identity Preference Optimization): adds a regularization term to prevent overfitting, so the model doesn’t need to push the margin of preference pairs to infinity.

KTO (Kahneman-Tversky Optimization): the biggest innovation is that it doesn’t require paired preference data — it only needs to know whether each response is “good” or “bad,” significantly reducing annotation costs.

GRPO: DeepSeek’s Approach

GRPO (Group Relative Policy Optimization) comes from DeepSeek, with a core innovation of eliminating the Critic network:

  1. For the same prompt, sample a group (G) of responses
  2. Use a reward function (rules-based or an RM) to score each response
  3. Compute Advantage using within-group relative ranking: Ai=rimean(r)std(r)A_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}
  4. Update the policy using a PPO-style clipped objective
GRPO Group Sampling MechanismSame Prompt → Sample G responses → Rank within group → Compute AdvantageG = 8G=4G=8G=16G=32Same Prompt → Sample 8 responses6.0A=+1.195.7A=+0.665.7A=+0.645.6A=+0.615.5A=+0.364.8A=-0.674.7A=-0.743.9A=-2.06mean=5.23score sorted | green=better than avg (A>0) | red=worse than avg (A<0)A_i = (r_i - mean(r)) / std(r)Larger G → More accurate Advantage (lower variance) → But linear cost increase | In practice G=8~64

GRPO’s advantage is that it doesn’t need a Critic network (saving one large model’s worth of GPU memory), and online sampling avoids distribution shift. DeepSeek-R1 used GRPO with rule-based rewards to train a model that exhibited emergent thinking capabilities.

DPO (Offline)
DPO: Offline — Fixed DatasetFixed Preference DatasetDPO TrainingOne-timeAligned ModelFeatures:✓ Simple training ✓ No online sampling ✗ Distribution shift ✗ Easy to overfitData collected before training, model cannot see responses from its new policy

Method Selection

Method Evolution MapClick nodes to see "What it solved / What problems it introduced"RLHFDPOIPOKTOGRPO← Click a node to see details →Each method solves previous issues while introducing new challengesNo perfect solution — choice depends on resources, data quality and performance needs

There is no perfect alignment method. The choice depends on your constraints:

DimensionRLHFDPOGRPO
Training complexityHigh (4 models)Low (2 models)Medium (2 models + online generation)
Data requirementsPreference pairs + promptsPreference pairsPrompts + reward rules
Training stabilityPPO is unstableStable like SFTFairly stable
Performance ceilingHigh (online optimization)Medium (offline limitations)High (online + emergence)
Use casesBest alignment qualityFast iteration, limited resourcesMath/reasoning tasks
Training Resource ComparisonRLHF vs DPO vs GRPO — GPU, Models, Training TimeConcurrent Models422GPU Memory100%40%60%Training Time100%30%60%RLHFDPOGRPOHover bars for detailsDPO lightest (no RM/Critic) | GRPO medium (no Critic but online gen) | RLHF heaviest

Summary

  1. DPO uses a closed-form relationship to eliminate the RM and PPO, making alignment training as simple as SFT
  2. IPO adds regularization to prevent overfitting; KTO removes the need for paired data
  3. GRPO removes the Critic and uses group sampling to compute Advantage, balancing efficiency with online optimization
  4. Choosing a method requires trade-offs: training resources / data quality / performance requirements
  5. DeepSeek-R1 demonstrated GRPO’s enormous potential for reasoning tasks

In the next article, we will dive into reward design: ORM vs PRM, the deeper causes of reward hacking, and how reward models are evolving into verifiers.