Test-Time Scaling and Reasoning Enhancement

Train-Time vs Test-Time Scaling

Traditional scaling laws focus on train-time scaling: increasing model parameters, training data, and compute leads to steady performance improvements. But this curve is flattening — the gains from scaling from 100B to 1T parameters are becoming increasingly marginal.

Test-Time Scaling proposes a different approach: keep the model size fixed and invest more compute at inference time to improve output quality.

The key finding from Snell et al. (2024): on certain tasks, increasing inference-time compute is more cost-effective than scaling up the model. A 14B model with sufficient test-time compute can outperform the direct output of a 70B model.

Chain-of-Thought from an RL Perspective

Chain-of-Thought (CoT) is more than just a prompting trick. From an RL perspective:

Each reasoning step = an action
The chain of thought = a trajectory
The correctness of the final answer = the reward

This means we can use RL to optimize how the model thinks — not just optimizing the final answer, but optimizing the entire reasoning process. This is the core idea behind DeepSeek-R1.

Best-of-N and Rejection Sampling

The simplest test-time scaling method: generate N responses and use a verifier to select the best one.

Best-of-N’s advantage is its simplicity, but computational cost scales linearly with N. For easy problems, N=1 suffices; for hard problems, you may need N=64 or more. The key requirement is a good verifier to select the best answer.

MCTS + LLM

A more advanced test-time scaling approach models the reasoning process as a tree search. Drawing from the AlphaGo/AlphaZero paradigm:

Each node represents a reasoning step
A PRM (Process Reward Model) evaluates the “value” of each node
The MCTS policy determines which reasoning paths to explore

The four-step MCTS loop:

Select: Starting from the root, use the UCB formula to choose the most promising path
Expand: Generate new reasoning steps at the leaf node
Evaluate: Use the PRM to assess the quality of the new step
Backpropagate: Propagate the evaluation result back to update the entire path

This approach is more efficient than Best-of-N because instead of independently sampling N paths, it intelligently explores and prunes.

DeepSeek-R1 Style Thinking

DeepSeek-R1 demonstrates a more profound form of test-time scaling: using RL to train the model to learn how to “think”.

Cold-start data

Key finding: when given a simple reward signal (answer correctness) and trained with GRPO, the model spontaneously develops a range of sophisticated reasoning behaviors:

These emergent behaviors include:

Step-by-step reasoning: Automatically decomposing complex problems into sub-steps
Self-verification: Proactively checking its own computations
Backtracking and error correction: Going back and re-reasoning upon discovering mistakes
Strategy selection: Trying different approaches to solve the same problem

Nobody explicitly taught the model these behaviors — they emerged entirely from the simple signal of “is the answer correct or not.”

Compute-Optimal Inference

Not all problems require extensive reasoning compute. “2+2=?” doesn’t need MCTS search, but an IMO competition problem justifies investing heavily in search.

The core idea of compute-optimal inference is to dynamically allocate inference budget based on problem difficulty:

Easy problems: Direct answer (1x compute)
Medium problems: CoT reasoning (3-5x compute)
Hard problems: Best-of-N + Verifier (8-24x compute)
Extremely hard problems: Deep MCTS search (40-60x compute)

Automatically assessing problem difficulty and selecting the appropriate strategy is the key to making test-time scaling practical.

Summary

This article covered the core ideas and methods of test-time scaling:

Test-Time Scaling is a new paradigm complementary to train-time scaling
Best-of-N is the simplest approach, using a verifier to select the best from multiple candidates
MCTS models the reasoning process as a tree search, intelligently exploring reasoning paths
DeepSeek-R1 uses GRPO to train emergent thinking behaviors, representing the pinnacle of test-time scaling
Compute-Optimal strategies dynamically allocate inference compute based on problem difficulty

From MDP fundamentals to test-time scaling, we have traced the complete chain of RL in the LLM domain. RL not only teaches LLMs to “do the right thing” (alignment) but also to “how to think” (reasoning) — this may well be the critical path toward more capable AI.