Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Test-Time Scaling and Reasoning Enhancement

Test-Time Scaling and Reasoning Enhancement

Updated 2026-04-06

Train-Time vs Test-Time Scaling

Traditional scaling laws focus on train-time scaling: increasing model parameters, training data, and compute leads to steady performance improvements. But this curve is flattening — the gains from scaling from 100B to 1T parameters are becoming increasingly marginal.

Test-Time Scaling proposes a different approach: keep the model size fixed and invest more compute at inference time to improve output quality.

Train-Time vs Test-Time ScalingTwo strategies for "scaling up compute"Train-Time ScalingModel params / Training dataSlowing ↗Test-Time ScalingInference compute (samples / search depth)Continues ↑PerformanceTrain-TimeScale model/data → performance ↑ but saturatesCost: train once, no inference overheadTest-TimeFixed model, invest more compute at inferenceCost: extra compute per inference

The key finding from Snell et al. (2024): on certain tasks, increasing inference-time compute is more cost-effective than scaling up the model. A 14B model with sufficient test-time compute can outperform the direct output of a 70B model.

Chain-of-Thought from an RL Perspective

Chain-of-Thought (CoT) is more than just a prompting trick. From an RL perspective:

  • Each reasoning step = an action
  • The chain of thought = a trajectory
  • The correctness of the final answer = the reward

This means we can use RL to optimize how the model thinks — not just optimizing the final answer, but optimizing the entire reasoning process. This is the core idea behind DeepSeek-R1.

Best-of-N and Rejection Sampling

The simplest test-time scaling method: generate N responses and use a verifier to select the best one.

Best-of-N SamplingGenerate N responses → Verifier selects the bestN = 81248163264Accuracy vs N140%264%487%898%16100%32100%64100%Compute Cost (relative)11x22x44x88x1616x3232x6464xN=8: Accuracy 98.3% | Cost 8xGood cost-performance ratio, significant accuracy improvement with manageable cost

Best-of-N’s advantage is its simplicity, but computational cost scales linearly with N. For easy problems, N=1 suffices; for hard problems, you may need N=64 or more. The key requirement is a good verifier to select the best answer.

MCTS + LLM

A more advanced test-time scaling approach models the reasoning process as a tree search. Drawing from the AlphaGo/AlphaZero paradigm:

  • Each node represents a reasoning step
  • A PRM (Process Reward Model) evaluates the “value” of each node
  • The MCTS policy determines which reasoning paths to explore
MCTS + LLM Reasoning Tree SearchProblem: 12×15=?0.510×15=1500.712×10=1200.612×20=2400.22×15=300.83×15=450.312×5=600.7150+30=180 ✓0.9120+60=180 ✓0.91. Select2. Expand3. Evaluate4. BackpropagateSelectStart from root, select child with highest UCB value

The four-step MCTS loop:

  1. Select: Starting from the root, use the UCB formula to choose the most promising path
  2. Expand: Generate new reasoning steps at the leaf node
  3. Evaluate: Use the PRM to assess the quality of the new step
  4. Backpropagate: Propagate the evaluation result back to update the entire path

This approach is more efficient than Best-of-N because instead of independently sampling N paths, it intelligently explores and prunes.

DeepSeek-R1 Style Thinking

DeepSeek-R1 demonstrates a more profound form of test-time scaling: using RL to train the model to learn how to “think”.

Cold-start data
Step 1: Cold-start with small high-quality CoT dataCold-start data:Small set of (prompt, long CoT answer) pairs → Teach model "thinking format" (wrap reasoning with {'<think>'} tags)Purpose: Not teaching "how to reason correctly", but teaching "how to output reasoning format"Actual reasoning capability comes from next RL training step

Key finding: when given a simple reward signal (answer correctness) and trained with GRPO, the model spontaneously develops a range of sophisticated reasoning behaviors:

R1-Zero Emergent Behavior Demo✓ With RL training (R1-Zero)?Let me analyze this problem: 48÷6+2×3Step-by-step reasoning?First division: 48÷6 = 8Order of operations?Then multiplication: 2×3 = 6Step calculationWait, let me verify: 8 and 6, finally add them up...Self-verification?8 + 6 = 14Combine resultsCheck: 48÷6=8 ✓, 2×3=6 ✓, 8+6=14 ✓Final verificationThe answer is 14Output answerEmergent Behavior:Model spontaneously learned step-by-step reasoning, self-verification, order of operations — nobody explicitly taught these behaviors!ReasoningVerificationAnswer

These emergent behaviors include:

  • Step-by-step reasoning: Automatically decomposing complex problems into sub-steps
  • Self-verification: Proactively checking its own computations
  • Backtracking and error correction: Going back and re-reasoning upon discovering mistakes
  • Strategy selection: Trying different approaches to solve the same problem

Nobody explicitly taught the model these behaviors — they emerged entirely from the simple signal of “is the answer correct or not.”

Compute-Optimal Inference

Not all problems require extensive reasoning compute. “2+2=?” doesn’t need MCTS search, but an IMO competition problem justifies investing heavily in search.

Compute-Optimal InferenceThink less for simple problems, think more for hard ones — Dynamic inference compute allocationProblem DifficultyInference ComputeDirect Answer (1x)CoT (3-5x)Best-of-N (8-24x)MCTS (40-60x)Core idea: dynamically select strategy based on problem difficulty — "2+2=?" doesn't need MCTS, but AMC competition problems deserve deep search

The core idea of compute-optimal inference is to dynamically allocate inference budget based on problem difficulty:

  • Easy problems: Direct answer (1x compute)
  • Medium problems: CoT reasoning (3-5x compute)
  • Hard problems: Best-of-N + Verifier (8-24x compute)
  • Extremely hard problems: Deep MCTS search (40-60x compute)

Automatically assessing problem difficulty and selecting the appropriate strategy is the key to making test-time scaling practical.

Summary

This article covered the core ideas and methods of test-time scaling:

  1. Test-Time Scaling is a new paradigm complementary to train-time scaling
  2. Best-of-N is the simplest approach, using a verifier to select the best from multiple candidates
  3. MCTS models the reasoning process as a tree search, intelligently exploring reasoning paths
  4. DeepSeek-R1 uses GRPO to train emergent thinking behaviors, representing the pinnacle of test-time scaling
  5. Compute-Optimal strategies dynamically allocate inference compute based on problem difficulty

From MDP fundamentals to test-time scaling, we have traced the complete chain of RL in the LLM domain. RL not only teaches LLMs to “do the right thing” (alignment) but also to “how to think” (reasoning) — this may well be the critical path toward more capable AI.