When RL Meets LLM: From Language Generation to Policy Optimization

In the previous three articles, we built a complete RL toolbox: MDP and Bellman equations (value estimation), Policy Gradient (directly optimizing policies), and PPO (stable policy optimization). These algorithms were originally developed for games and robotic control.

But you may have been wondering all along: What do these have to do with LLMs?

This article is that bridge. We will answer three core questions:

Why must LLM post-training use RL, rather than relying solely on SFT?
How does the text generation process of an LLM precisely map to an MDP?
What do PG, Advantage, and PPO concretely mean in the LLM context?

The Ceiling of SFT: Why Behavioral Cloning Is Not Enough

The first stage of LLM post-training is SFT (Supervised Fine-Tuning): using human-written, high-quality (prompt, response) data to fine-tune the pretrained model. SFT teaches the model the basic format and ability to follow instructions.

But SFT is fundamentally behavioral cloning — imitating expert behavior. This approach has four fundamental limitations.

Distribution Shift: The Mismatch Between Training and Inference

During SFT training, the model sees “perfect” trajectories written by humans, and every step uses teacher forcing: regardless of what the model would generate on its own, training force-feeds the correct next token.

But at inference time, the model must use its own generated tokens as subsequent inputs. Once a step produces a token not seen in the training data, the model enters “uncharted territory” — it never learned how to recover from such a state during training.

This is exposure bias / distribution shift (Ross et al., 2011): the training distribution does not match the inference distribution, and errors compound — one small deviation triggers the next, larger deviation.

Analogy: Learning to drive solely by watching expert driving footage. Performance is fine on common road conditions, but the moment an unseen situation arises (a sudden construction zone), you have no idea how to correct course, and things go increasingly off track.

The Ceiling Problem: SFT Cannot Surpass Its Training Data

The ceiling of SFT is the quality of its training data. If human annotators write 90-point answers, SFT can at best learn to produce 90-point answers. It will not “invent” new strategies that surpass the training data.

RL is not bound by this limitation. Through exploration, RL can discover better strategies that do not exist in the training data. A striking example is DeepSeek-R1-Zero: with absolutely no SFT data, trained purely through RL (GRPO), the model spontaneously developed chain-of-thought reasoning, self-verification, and reflection behaviors — these capabilities were not taught by anyone but were “discovered” by the model itself through RL’s exploration mechanism.

The final DeepSeek-R1 added cold-start SFT and multi-round RL on top of R1-Zero, further improving readability and stability. But the core finding stands: the reasoning capabilities produced during the RL phase surpassed the ceiling of what SFT data could teach.

Sequence-Level Objectives Cannot Backpropagate Through Discrete Sampling

“Be helpful,” “be safe,” “reason correctly” — these post-training objectives are defined at the entire response level, not the individual token level.

SFT’s cross-entropy loss is computed token by token: “Does the token you generated match the reference token?” It cannot express a sequence-level concept like “overall response quality.”

What if we had a Reward Model that could score complete responses? The problem is: sampling a discrete token from a probability distribution is non-differentiable. You cannot compute gradients through argmax or categorical sampling — the gradient is broken here.

Policy Gradient was made for exactly this. Its core contribution is: rather than backpropagating through the sampling process, it uses the REINFORCE estimator to indirectly estimate gradients. The $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot A]$ we learned earlier finds its true calling here.

On-Policy Self-Evolution

The first point above described the “problem” (training distribution $\neq$ inference distribution); here we see why RL naturally does not have this problem.

RL is on-policy: training data is generated by the model’s own current policy. This means:

Mistakes the model makes appear in its own training data, so it can learn to correct them
As the model improves, the quality of generated data improves too — forming a virtuous cycle of self-evolution

SFT training data, by contrast, is fixed. No matter how much the model improves, it always sees the same set of human demonstrations.

The Markov Decision Process of Language Generation

Above we argued for the necessity of RL. But to use RL, we first need to formally model the LLM’s text generation process as an MDP (Markov Decision Process).

The Complete MDP Five-Tuple Mapping

Recall the MDP five-tuple $(S, A, P, R, \gamma)$ from RL Foundations. We can precisely map each element to a step in LLM token generation:

State $s_t = (x, y_{<t})$ : The prompt $x$ plus all previously generated tokens $y_1, y_2, \ldots, y_{t-1}$ . The initial state $s_0 = x$ is simply the prompt itself.

Action $a_t \in \mathcal{V}$ : Selecting one token from the vocabulary $\mathcal{V}$ (typically 32K-128K tokens).

Policy $\pi_\theta(a_t | s_t)$ : This is the LLM itself — given the current context (prompt + already generated tokens), it outputs a probability distribution over the next token (the softmax output layer).

State Transition $P(s_{t+1}|s_t, a_t)$ : Deterministic — the new state is simply the selected token appended to the sequence: $s_{t+1} = (x, y_1, \ldots, y_{t-1}, a_t)$ .

Reward $R(s_t, a_t)$ : Depends on task design. The most typical setup is: intermediate steps $r_t = 0$ ( $t < T$ ), with a reward given only upon completion $r_T = R_{RM}(x, y)$ (e.g., an RM score).

Termination condition: Generating the EOS (End of Sequence) token or reaching the maximum length. A complete generation process constitutes one episode.

Discount factor $\gamma = 1$ : Since LLM generation episodes are finite-length (bounded by a maximum length), there is no need for $\gamma < 1$ to ensure convergence — unlike the infinite-horizon setting in classical RL.

What Makes the LLM MDP Unique

Compared to classical RL domains like Atari games or robotic control, the LLM MDP has several notable distinctions:

Deterministic transitions: Classical environments have stochasticity (the bounce direction of a ball in Breakout, physical perturbations in robotics), but LLM state transitions are fully deterministic — once a token is chosen, the new state is uniquely determined
Enormous action space: The vocabulary has 32K-128K possible tokens, far exceeding Atari’s 18 actions. This is precisely why Value-Based methods (Q-Learning) are impractical for LLMs — you cannot maintain a Q-value for every token
Variable-length episodes: Generation length is not fixed, ranging from a few tokens to thousands
Sparse reward: Rewards are typically given only at the end of an episode. Across hundreds of tokens of generation, intermediate steps all receive $r = 0$ — this creates a severe credit assignment problem (detailed in Section 4)

Concrete Example: A Complete MDP Trajectory for Token Generation

Let us walk through a complete round of MDP decision-making with a simple example:

Prompt: “The capital of China is”

Step	State $s_t$	Policy $\pi_\theta(a \vert s_t)$	Action $a_t$	Reward $r_t$
$t=0$	“The capital of China is”	`{"Bei": 0.82, "Shang": 0.05, "Nan": 0.03, ...}`	”Bei”	0
$t=1$	”The capital of China is Bei”	`{"jing": 0.97, "fang": 0.01, ...}`	”jing”	0
$t=2$	”The capital of China is Beijing”	`{".": 0.70, ",": 0.15, ...}`	”.“	0
$t=3$	”The capital of China is Beijing.”	`{"<EOS>": 0.90, ...}`	`<EOS>`	$R_{RM} = 0.92$

Note: Only the final step has a non-zero reward (scored by the Reward Model). The reward for all intermediate steps is 0.

On the Markov Property

An attentive reader might ask: each decision step in LLM generation depends on the full context history — does this satisfy the Markov property?

The answer is yes, because we define the state as $(x, y_{<t})$ — the full prompt plus all generated tokens so far. This state contains all historical information. The Transformer’s attention mechanism allows the model to attend to the entire sequence at each step, so the state truly is a “sufficient statistic” — given the current state, future decisions require no additional historical information.

This is consistent with how classical RL handles such situations: if the environment itself does not satisfy the Markov property (e.g., partially observable environments), we can expand the state definition (to include more history) to make it satisfy the property.

Token-Level vs. Response-Level: Two Granularities of Operation

The MDP established above is token-level — each token is an action. This is the most fundamental and precise modeling approach, and the perspective used by PPO in RLHF.

But in subsequent articles you will see methods that operate at different granularities:

Token-level MDP (RLHF + PPO): Compute the advantage for each token, adjusting probabilities token by token
Response-level perspective (DPO, GRPO): Treat the entire response as a single “action,” comparing across complete answers

The two perspectives are mathematically equivalent — the response-level log probability is simply the sum of token-level log probabilities:

$\log \pi_\theta(y|x) = \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t})$

Token-level is more granular (able to pinpoint which tokens are good/bad) but computationally heavier; response-level is more concise, and is the key to how DPO/GRPO greatly simplify training. We will clearly mark the granularity in subsequent articles when different levels are used.

From PG to PPO: Re-Understanding in the LLM Context

With the MDP mapping in hand, we can revisit the formulas from the previous three articles — this time, every symbol has a concrete meaning in the LLM context.

Policy Gradient -> LLM Fine-Tuning Gradient

In Policy Gradient we learned the policy gradient theorem:

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A_t\right]$

Translated to the LLM context:

$\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(y_t | x, y_{<t}) \cdot A_t\right]$

Intuition: If generating token $y_t$ leads to an overall response quality above average ( $A_t > 0$ ), increase the probability of generating $y_t$ ; otherwise, decrease it. This is how the model gradually learns “which token to generate at which position to improve the overall response.”

Advantage -> Token-Level “Good or Bad” Judgment

Advantage $A_t$ in the LLM context answers a very specific question:

Given that $y_1, \ldots, y_{t-1}$ have already been generated, how much better is choosing token $y_t$ compared to the “average choice”?

If $A_t > 0$ , this token improved overall response quality; if $A_t < 0$ , this token dragged it down.

In PPO, GAE (Generalized Advantage Estimation) is used to compute the Advantage, which requires a Critic network $V(s)$ to estimate the value of each state. In RLHF, this Critic is typically initialized from the Reward Model — leveraging the RM’s knowledge of “what makes a good answer” to estimate intermediate state values.

PPO Clip -> Preventing LLM “Mutations” in a Single Update

The core PPO formula in the LLM context becomes:

$L^{CLIP}(\theta) = \hat{\mathbb{E}}\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$

where the probability ratio $r_t(\theta) = \frac{\pi_\theta(y_t|x,y_{<t})}{\pi_{\theta_{old}}(y_t|x,y_{<t})}$ is “the ratio of the new model’s probability to the old model’s probability for the same token.”

What clip means for LLMs: No single token’s generation probability is allowed to change too much in one update. If a token’s probability suddenly jumps from 5% to 80% (or vice versa), this could destroy the model’s language capabilities. $\epsilon$ is typically set to 0.2, meaning the probability ratio is constrained to the $[0.8, 1.2]$ range.

This is PPO’s role in LLM alignment: conservatively adjusting the token probability distribution, ensuring that each update does not cause the model to “collapse.”

Unique Challenges of RL for LLMs

Applying RL to LLMs is not simply “plugging in formulas.” The LLM setting introduces a series of technical challenges that are absent or not prominent in classical RL.

Sparse Reward and Credit Assignment

In the typical RLHF setup, the Reward Model produces a scalar score after the entire response is generated. This means:

For a 200-token response, only the position of the last token has a non-zero reward
The reward for the preceding 199 tokens is all 0

The credit assignment problem: Which tokens actually made the response better? Was it the problem understanding at the beginning? The reasoning steps in the middle? Or the summary at the end? Sparse reward cannot directly answer this.

There are currently two coping strategies:

Outcome Reward Model (ORM): Scores only the final result. Simple but makes credit assignment difficult. GAE in PPO propagates the reward backwards through multi-step bootstrapping, but this is still indirect — propagating from the last token’s reward backwards, with estimation quality degrading with distance.

Process Reward Model (PRM): Scores each step of the reasoning process, providing a dense reward signal. PRM directly tells you “Step 3 of reasoning was correct, Step 5 went wrong.” OpenAI’s “Let’s Verify Step by Step” (Lightman et al., 2023) demonstrated that PRM significantly outperforms ORM on mathematical reasoning. The cost, however, is extremely high annotation overhead — requiring human step-by-step labeling of each step’s correctness.

This problem is discussed in greater depth in Reward Design and Scaling.

KL Penalty: Don’t Forget You’re a Language Model

The optimization objective of RLHF is not merely “maximize reward” — it includes a critical regularization term, the KL divergence penalty:

$\max_\theta \; \mathbb{E}_{x \sim D, \; y \sim \pi_\theta}\left[R_{task}(x, y) - \beta \cdot KL(\pi_\theta \| \pi_{ref})\right]$

where $\pi_{ref}$ is the SFT model (the reference policy) and $\beta$ controls the strength of the KL constraint.

Why is KL penalty needed? Without it, the model will find and ruthlessly exploit weaknesses in the Reward Model — this is reward hacking. For example, the model might learn to generate responses that “look long and professional but are actually nonsense,” because the RM tends to assign higher scores to longer responses.

The intuition behind KL penalty: keep the model aligned while not forgetting the language capabilities learned during pretraining. It acts like an elastic cord, allowing the model to deviate moderately from the SFT model but pulling it back if it strays too far.

In practice, the KL penalty is decomposed to the token level, becoming part of the per-step reward:

$r_t = r_{task,t} - \beta \cdot \left[\log \pi_\theta(y_t|s_t) - \log \pi_{ref}(y_t|s_t)\right]$

A key detail: $r_{task,t} = 0$ for all intermediate tokens ( $t < T$ ); only the last token $r_{task,T} = R_{RM}(x, y)$ carries the task reward. This means the reward signal for intermediate steps comes entirely from the KL penalty.

This leads to an interesting side effect: the KL penalty plays an unexpectedly important role in credit assignment — it provides a “baseline signal” for intermediate tokens (penalizing excessive deviation from the reference policy), giving GAE more information to estimate the advantage.

The $\beta$ trade-off:

$\beta$ too large: The model can hardly deviate from SFT and learns nothing new
$\beta$ too small: High risk of reward hacking; the model may become incoherent

Generation Is Sampling: The Cost of Being On-Policy

RL optimization requires on-policy data — responses must be generated using the current model parameters $\theta$ to compute policy gradients. This means:

After each parameter update, all previously collected response data becomes “stale”
A new batch of responses must be generated with the updated model
The generation process itself is autoregressive — tokens are generated one at a time, which is not fast

This is one of the fundamental reasons why RLHF training costs far exceed SFT. The InstructGPT paper mentions needing to run 4 large models simultaneously: the policy (current policy), the reference policy (SFT model), the reward model, and the critic (value network).

This is also why DPO (Direct Preference Optimization) is so attractive — it is offline, training directly on existing preference data without requiring on-policy sampling. But this introduces new problems (distribution shift), which we discuss in detail in the DPO article.

The Post-Training Landscape: From SFT to Reasoning Reinforcement

Now let us step back and survey the complete post-training landscape. Each stage solves a specific problem and corresponds to an article in this learning path:

Pretrained LLM: Learns language patterns from massive text (next-token prediction). At this point the model can “speak” but cannot follow instructions or judge what makes a good response.

SFT (Supervised Fine-Tuning): Behavioral cloning with high-quality (prompt, response) pairs. The model learns the basic format for following instructions. But it is limited by the quality ceiling of training data and distribution shift. (This article, Section 1)

RLHF: The complete pipeline of SFT -> Reward Model -> PPO. A reward model is trained on human preferences, then PPO optimizes the LLM policy. Highly effective but complex to train, requiring 4 models running simultaneously. (-> RLHF: Learning from Human Feedback)

DPO: Skips the Reward Model and directly optimizes the policy from preference data. Mathematically equivalent to RLHF’s implicit reward, but the training process is as simple as SFT. The trade-off is distribution shift from offline training. (-> From DPO to GRPO)

GRPO: Proposed by DeepSeek. No Critic network needed; generates multiple responses for the same prompt and uses within-group relative ranking instead of absolute reward. Lighter than PPO and the core training algorithm behind DeepSeek-R1. (-> From DPO to GRPO)

Reward Design: Regardless of which alignment method is used, the quality of the reward signal determines the ceiling. ORM vs PRM, reward hacking defenses, reward scaling. (-> Reward Design and Scaling)

Test-Time Scaling: Fix the model size and invest more computation at inference time for better outputs. Best-of-N, MCTS + PRM, CoT as RL trajectory. (-> Test-Time Scaling and Reasoning Enhancement)

Summary

This article established the complete bridge from classical RL to LLM post-training:

Limitations of SFT: Distribution shift, ceiling problem, non-differentiability of sequence-level objectives, and the off-policy nature of fixed data — these explain why post-training must use RL
LLM generation = MDP: The state is the generated sequence so far, the action is choosing the next token, and the policy is the LLM itself. Deterministic transitions, enormous action space, and sparse reward are the distinctive features of the LLM MDP
RL toolbox translation: Policy Gradient adjusts token probabilities, Advantage judges whether each token is good or bad, and PPO Clip prevents excessively large single updates
Unique challenges of LLM RL: Credit assignment difficulty with sparse reward, KL penalty to prevent reward hacking and forgetting, and the high cost of on-policy sampling
Post-training landscape: SFT -> RLHF / DPO / GRPO -> Reward Design -> Test-Time Scaling

In the next article, we will dive deep into the complete RLHF pipeline: SFT -> Reward Model -> PPO, and see how these RL tools are concretely used to align large language models.