Reward Design and Scaling

The Reward Model Is Central to Alignment

Whether you choose RLHF, DPO, or GRPO, everything ultimately depends on some form of reward signal. RLHF explicitly trains an RM; DPO implicitly learns reward; GRPO uses rules or an RM for scoring.

The quality of the RM directly determines the ceiling of alignment effectiveness. A perfect RM would mean perfect alignment — but in reality, RMs are always imperfect, which gives rise to a series of core challenges.

Outcome Reward vs Process Reward

Traditional RMs are Outcome Reward Models (ORM): they look only at the final result and assign a single score. But for reasoning tasks (math, code, logic), this coarse-grained signal has clear shortcomings.

Process Reward Models (PRM) score each step of the reasoning process, providing a more fine-grained supervision signal.

Key advantages of PRM:

Can identify cases where “the answer happens to be correct but the reasoning is wrong”
Provides node-level evaluation signals for MCTS-style search
Better credit assignment (pinpoints which step went wrong)

However, PRM annotation costs are significantly higher — each step’s correctness must be labeled individually. OpenAI’s “Let’s Verify Step by Step” paper showed that PRM significantly outperforms ORM on mathematical reasoning.

Deep Dive into Reward Hacking

Goodhart’s Law manifests in RL alignment: when the RM score becomes the optimization target, the model finds “shortcuts” that maximize the score without genuinely improving quality.

Common reward hacking patterns:

Verbose padding: RM prefers detailed responses, so the model learns to produce redundant content
Sycophantic language: RM prefers friendly tone, so the model substitutes praise for substance
Format wrapping: RM prefers structured output, so form trumps content
Safety evasion: Safety RM penalizes too aggressively, so the model refuses to answer even normal questions

Reward Model Scaling

The good news: larger RMs are harder to hack. Research by Gao et al. (2022) shows that both RM parameter count and training data volume follow scaling laws:

This provides clear engineering guidance: invest in a larger, better RM rather than a more complex training algorithm.

Constitutional AI and Automated Rewards

Human preference annotation is expensive and hard to scale. Anthropic’s Constitutional AI proposed an alternative: let the LLM generate its own preference judgments.

Write Principles

The core idea behind this RLAIF (RL from AI Feedback) approach:

Humans only define high-level principles (a “constitution”)
The LLM self-evaluates and revises its responses according to these principles
The before-and-after response pairs form training data

This dramatically reduces annotation costs, enabling alignment training to be automated at scale.

From Reward to Verifier

The evolution path of the Reward Model: from scorer to verifier.

This evolution paves the way for Test-Time Scaling: with a verifier, we can generate multiple candidate responses at inference time and use the verifier to select the best one — this is the central topic of the next article.

Summary

The RM is central to alignment, and its quality directly determines the alignment ceiling
PRM outperforms ORM: step-by-step scoring provides more fine-grained signals, especially for reasoning tasks
Reward Hacking is Goodhart’s Law in action; larger RMs are harder to hack
Constitutional AI replaces human annotation with LLM self-evaluation, enabling large-scale RLAIF
RM to PRM to Verifier: this evolution lays the foundation for test-time scaling

In the next article, we will explore test-time scaling: how to invest more computation at inference time to improve LLM output quality.