Reward Design and Scaling
Updated 2026-04-06
The Reward Model Is Central to Alignment
Whether you choose RLHF, DPO, or GRPO, everything ultimately depends on some form of reward signal. RLHF explicitly trains an RM; DPO implicitly learns reward; GRPO uses rules or an RM for scoring.
The quality of the RM directly determines the ceiling of alignment effectiveness. A perfect RM would mean perfect alignment — but in reality, RMs are always imperfect, which gives rise to a series of core challenges.
Outcome Reward vs Process Reward
Traditional RMs are Outcome Reward Models (ORM): they look only at the final result and assign a single score. But for reasoning tasks (math, code, logic), this coarse-grained signal has clear shortcomings.
Process Reward Models (PRM) score each step of the reasoning process, providing a more fine-grained supervision signal.
Key advantages of PRM:
- Can identify cases where “the answer happens to be correct but the reasoning is wrong”
- Provides node-level evaluation signals for MCTS-style search
- Better credit assignment (pinpoints which step went wrong)
However, PRM annotation costs are significantly higher — each step’s correctness must be labeled individually. OpenAI’s “Let’s Verify Step by Step” paper showed that PRM significantly outperforms ORM on mathematical reasoning.
Deep Dive into Reward Hacking
Goodhart’s Law manifests in RL alignment: when the RM score becomes the optimization target, the model finds “shortcuts” that maximize the score without genuinely improving quality.
Common reward hacking patterns:
- Verbose padding: RM prefers detailed responses, so the model learns to produce redundant content
- Sycophantic language: RM prefers friendly tone, so the model substitutes praise for substance
- Format wrapping: RM prefers structured output, so form trumps content
- Safety evasion: Safety RM penalizes too aggressively, so the model refuses to answer even normal questions
Reward Model Scaling
The good news: larger RMs are harder to hack. Research by Gao et al. (2022) shows that both RM parameter count and training data volume follow scaling laws:
This provides clear engineering guidance: invest in a larger, better RM rather than a more complex training algorithm.
Constitutional AI and Automated Rewards
Human preference annotation is expensive and hard to scale. Anthropic’s Constitutional AI proposed an alternative: let the LLM generate its own preference judgments.
The core idea behind this RLAIF (RL from AI Feedback) approach:
- Humans only define high-level principles (a “constitution”)
- The LLM self-evaluates and revises its responses according to these principles
- The before-and-after response pairs form training data
This dramatically reduces annotation costs, enabling alignment training to be automated at scale.
From Reward to Verifier
The evolution path of the Reward Model: from scorer to verifier.
This evolution paves the way for Test-Time Scaling: with a verifier, we can generate multiple candidate responses at inference time and use the verifier to select the best one — this is the central topic of the next article.
Summary
- The RM is central to alignment, and its quality directly determines the alignment ceiling
- PRM outperforms ORM: step-by-step scoring provides more fine-grained signals, especially for reasoning tasks
- Reward Hacking is Goodhart’s Law in action; larger RMs are harder to hack
- Constitutional AI replaces human annotation with LLM self-evaluation, enabling large-scale RLAIF
- RM to PRM to Verifier: this evolution lays the foundation for test-time scaling
In the next article, we will explore test-time scaling: how to invest more computation at inference time to improve LLM output quality.