RLHF: Learning from Human Feedback
Updated 2026-04-06
Why Alignment Is Needed
Pretrained language models have a fundamental problem: their training objective is “predict the next token,” not “be a helpful, honest, and harmless assistant.” This means:
- The model may generate harmful content (the training data contains harmful text)
- The model doesn’t follow instructions (it learned to complete text, not answer questions)
- The model fabricates facts (it doesn’t understand “truth,” only “what looks plausible”)
- The model’s format and style don’t match user expectations
Alignment aims to make the model’s behavior match human expectations and values. The challenge is: “helpful and safe” cannot be directly encoded as a loss function. We need some way to turn human preferences into an optimization signal — this is the core motivation behind RLHF.
The Three Stages of RLHF
InstructGPT (2022) established the standard three-stage RLHF pipeline:
Stage 1: SFT (Supervised Fine-Tuning) Use human-written, high-quality (prompt, response) data to perform supervised fine-tuning on the pretrained model. This teaches the model the basic format and ability to “follow instructions.”
Stage 2: Reward Model Training Collect human preference data — for two responses to the same prompt, annotate which one is better — and train a Reward Model to quantify human preferences as scalar scores.
Stage 3: PPO Policy Optimization Use the Reward Model as the reward signal, and PPO optimizes the LLM policy so that its generated responses achieve higher RM scores. A KL penalty is added to prevent drift.
Reward Model Training
Training the Reward Model is the most critical step in RLHF. It needs to transform fuzzy “human preferences” into precise mathematical signals.
The training data consists of preference pairs : for the same prompt , human annotators choose (winner) as preferred over (loser).
The Bradley-Terry model frames preferences as probabilities:
Training loss function:
PPO Alignment Optimization
With the Reward Model in hand, we can use PPO to optimize the LLM. The optimization objective is:
where is the SFT model (serving as the reference policy) and is the KL penalty coefficient.
The meaning of this objective: maximize the RM score, but don’t drift too far from the pretrained model.
The Importance of the KL Constraint
The KL divergence measures the “distance” between the new policy and the reference policy. controls the strength of this constraint.
What happens without the KL penalty? The model discovers weaknesses in the RM and exploits them aggressively — this is Reward Hacking.
Limitations of RLHF
Although RLHF has achieved tremendous success (InstructGPT, ChatGPT), it also has clear limitations:
-
The Reward Model is a bottleneck
- RM quality directly caps the alignment ceiling
- Human preferences are inconsistent (different annotators may give opposite judgments)
- RMs are easy to exploit (reward hacking)
-
High training complexity
- Requires running 4 models simultaneously: policy, reference policy, reward model, critic
- PPO training is unstable and hyperparameter-sensitive
- Significant computational resource requirements
-
High annotation costs
- Requires large amounts of high-quality preference data
- Human annotations are noisy and biased
These limitations have given rise to alternatives like DPO and GRPO — discussed in detail in the next article.
Summary
This article provided a complete overview of the three-stage RLHF pipeline:
- SFT teaches the model the basic ability to follow instructions
- Reward Model quantifies human preferences as scalar scores (Bradley-Terry model)
- PPO optimizes the policy to align the LLM, with KL penalty to prevent reward hacking
- Reward Hacking is the primary risk when there is no KL constraint
- RLHF is successful but complex, giving rise to simpler alternatives like DPO and GRPO
In the next article, we will dive into DPO and GRPO, exploring how to skip the Reward Model and optimize the policy directly from preference data.