Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

RLHF: Learning from Human Feedback

RLHF: Learning from Human Feedback

Updated 2026-04-06

Why Alignment Is Needed

Pretrained language models have a fundamental problem: their training objective is “predict the next token,” not “be a helpful, honest, and harmless assistant.” This means:

  • The model may generate harmful content (the training data contains harmful text)
  • The model doesn’t follow instructions (it learned to complete text, not answer questions)
  • The model fabricates facts (it doesn’t understand “truth,” only “what looks plausible”)
  • The model’s format and style don’t match user expectations

Alignment aims to make the model’s behavior match human expectations and values. The challenge is: “helpful and safe” cannot be directly encoded as a loss function. We need some way to turn human preferences into an optimization signal — this is the core motivation behind RLHF.

The Three Stages of RLHF

InstructGPT (2022) established the standard three-stage RLHF pipeline:

RLHF Three-Stage PipelineClick each stage to view detailed data flowSFT1RewardModel2PPO优化3π_SFTr_φ← Click stage above to view details →SFT: Learn to follow instructions → RM: Quantify human preferences → PPO: Optimize policy for LLM alignmentThis is the core training pipeline used by InstructGPT (2022) and ChatGPT

Stage 1: SFT (Supervised Fine-Tuning) Use human-written, high-quality (prompt, response) data to perform supervised fine-tuning on the pretrained model. This teaches the model the basic format and ability to “follow instructions.”

Stage 2: Reward Model Training Collect human preference data — for two responses to the same prompt, annotate which one is better — and train a Reward Model to quantify human preferences as scalar scores.

Stage 3: PPO Policy Optimization Use the Reward Model as the reward signal, and PPO optimizes the LLM policy so that its generated responses achieve higher RM scores. A KL penalty is added to prevent drift.

Reward Model Training

Training the Reward Model is the most critical step in RLHF. It needs to transform fuzzy “human preferences” into precise mathematical signals.

Preference Labeling SimulatorChoose the better response | Labeled 0/3Prompt:Explain quantum entanglementResponse AQuantum entanglement is a fascinati...Response BQuantum entanglement is a complex q...How Reward Model LearnsEach preference pair (y_w ≻ y_l) → Bradley-Terry model: P(y_w ≻ y_l) = σ(r(y_w) - r(y_l))RM learns to give higher scores to "better" responses and lower scores to "worse" ones → quantifies human preference as scalar reward

The training data consists of preference pairs (yw,yl)(y_w, y_l): for the same prompt xx, human annotators choose ywy_w (winner) as preferred over yly_l (loser).

The Bradley-Terry model frames preferences as probabilities:

P(ywylx)=σ(rϕ(x,yw)rϕ(x,yl))P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

Training loss function:

L(ϕ)=E(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}(\phi) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]

Preference Pair Collection
Step 1: Collect Human Preference DataPrompt 1y_w ✓y_l ✗Preference Pair (w≻l)Prompt 2y_w ✓y_l ✗Preference Pair (w≻l)Prompt 3y_w ✓y_l ✗Preference Pair (w≻l)Preference Dataset~10K-100K pairsHuman annotated

PPO Alignment Optimization

With the Reward Model in hand, we can use PPO to optimize the LLM. The optimization objective is:

maxθ  ExD,  yπθ(x)[rϕ(x,y)βKL(πθπref)]\max_\theta \; \mathbb{E}_{x \sim D, \; y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y) - \beta \cdot KL(\pi_\theta \| \pi_{ref})\right]

where πref\pi_{ref} is the SFT model (serving as the reference policy) and β\beta is the KL penalty coefficient.

The meaning of this objective: maximize the RM score, but don’t drift too far from the pretrained model.

The Importance of the KL Constraint

The KL divergence KL(πθπref)KL(\pi_\theta \| \pi_{ref}) measures the “distance” between the new policy and the reference policy. β\beta controls the strength of this constraint.

Effect of KL Penalty Coefficient ββ = 0.10β=0 (no penalty→reward hacking)β=0.5 (strong constraint→minimal optimization)RM Score (Quality)-β·KL (Penalty)Total Reward (Actual Optimization Target)Training Steps✓ β moderate: RM score and KL penalty balanced → stable alignment

What happens without the KL penalty? The model discovers weaknesses in the RM and exploits them aggressively — this is Reward Hacking.

Reward Hacking Comparison Demo✗ Without KL Constraint (Reward Hacking)冗长回答讨好措辞格式 HackRM Score:0.92True Quality:⚠️ High RM score but low quality = Reward HackingHack TypeRM 偏好长回答 → 模型学会注水Model Output Preview:非常感谢您提出这个非常好的问题!让我来为您详细解答这个非常重要的问题。首先,我想说...(后面重复废话 500 字)Goodhart's Law in RL"When a measure becomes a target, it ceases to be a good measure" — RM score is an approximation of reward,but imperfect. The model will find "shortcuts" that maximize score without truly improving quality.

Limitations of RLHF

Although RLHF has achieved tremendous success (InstructGPT, ChatGPT), it also has clear limitations:

  1. The Reward Model is a bottleneck

    • RM quality directly caps the alignment ceiling
    • Human preferences are inconsistent (different annotators may give opposite judgments)
    • RMs are easy to exploit (reward hacking)
  2. High training complexity

    • Requires running 4 models simultaneously: policy, reference policy, reward model, critic
    • PPO training is unstable and hyperparameter-sensitive
    • Significant computational resource requirements
  3. High annotation costs

    • Requires large amounts of high-quality preference data
    • Human annotations are noisy and biased

These limitations have given rise to alternatives like DPO and GRPO — discussed in detail in the next article.

LLM Alignment Method Timeline2017RLHF PaperDeep RL from Human Preferences2019Fine-Tuning LMPPO + Reward Model2022.1InstructGPTSFT + RM + PPO (1.3B beats 175B)2022.11ChatGPTRLHF at scale2023.7Llama 2RLHF + Safety RM2023.12DPODirect Preference Optimization2024.2GRPOGroup Relative Policy Optimization2025.1DeepSeek-R1GRPO + Rule RewardHover to see milestone details

Summary

This article provided a complete overview of the three-stage RLHF pipeline:

  1. SFT teaches the model the basic ability to follow instructions
  2. Reward Model quantifies human preferences as scalar scores (Bradley-Terry model)
  3. PPO optimizes the policy to align the LLM, with KL penalty to prevent reward hacking
  4. Reward Hacking is the primary risk when there is no KL constraint
  5. RLHF is successful but complex, giving rise to simpler alternatives like DPO and GRPO

In the next article, we will dive into DPO and GRPO, exploring how to skip the Reward Model and optimize the policy directly from preference data.