Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Reinforcement Learning Foundations: From Agent to Bellman Equation

Reinforcement Learning Foundations: From Agent to Bellman Equation

Updated 2026-04-13

What Is Reinforcement Learning

Reinforcement Learning (RL) is the third major paradigm of machine learning, alongside supervised learning and unsupervised learning. Its core idea is simple yet profound: an Agent learns the optimal behavioral policy through trial and error in an Environment.

The key differences from supervised learning are:

  • No labels: No one tells the Agent what the correct answer is — it can only learn from reward signals provided by the environment
  • Delayed rewards: A good decision may not show its effects until much later (e.g., opening moves in chess)
  • Exploration-Exploitation Dilemma: The Agent must balance between “trying new strategies” and “using known good strategies”
Agent-Environment LoopS₀StartS₁UpperS₂RightS₃TrapS₄GoalChoose ActionUpRightDownLeftTrajectoryTotal Reward: 0ResetClick action buttons to control Agent | S₄=goal(+5) | S₃=trap(-1)

At each time step, this loop repeats:

  1. The Agent observes the current State sts_t
  2. The Agent selects an Action ata_t based on its policy
  3. The Environment returns a Reward rtr_t and a new state st+1s_{t+1}
  4. The Agent adjusts its policy based on experience

Markov Decision Process (MDP)

The mathematical foundation of RL is the Markov Decision Process (MDP), defined by the five-tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

  • SS: State space (the set of all possible states)
  • AA: Action space (the set of all possible actions)
  • P(ss,a)P(s'|s,a): State transition probability (the probability of transitioning to ss' after taking action aa in state ss)
  • R(s,a,s)R(s,a,s'): Reward function (the immediate reward received during a state transition)
  • γ[0,1)\gamma \in [0,1): Discount factor (the decay coefficient for future rewards — the further away a reward, the less it is worth)

The “Markov” property means memorylessness: the next state depends only on the current state and action, not on history. This assumption makes the problem tractable.

MDP Grid World — Markov Decision Processγ = 0.9 | step reward = -0.04 | goal +1 | trap -1(0,0)-0.04(0,1)-0.04(0,2)-0.04(0,3)+1(1,0)-0.04(1,2)-0.04(1,3)-1(2,0)-0.04(2,1)-0.04(2,2)-0.04(2,3)-0.04(3,0)-0.04(3,1)-0.04(3,2)-0.04(3,3)-0.04TrajectoryResetUse arrow keys to control Agent | Blue dot = Agent position

Policy and Value Functions

With the MDP framework in place, we need to define the Agent’s behavioral model and evaluation criteria:

Policy π(as)\pi(a|s): The probability distribution over actions aa given state ss. A policy can be deterministic (one fixed action per state) or stochastic (a probability distribution).

State Value Function Vπ(s)V^\pi(s): The expected cumulative discounted reward starting from state ss and following policy π\pi:

Vπ(s)=Eπ[t=0γtrts0=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]

Action Value Function Qπ(s,a)Q^\pi(s,a): The expected return after taking action aa in state ss and then following policy π\pi:

Qπ(s,a)=Eπ[t=0γtrts0=s,a0=a]Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]

In these formulas, s0=ss_0 = s means “starting from state ss at time step 0”, and a0=aa_0 = a means “taking action aa at the first step”. The only difference between the two: VV fixes only the starting state (the first action is chosen by policy π\pi), while QQ fixes both the starting state and the first action.

The V-Q Duality:

From Q to V — In state ss, the Agent can choose among multiple actions, each with its own Q-value. The state’s value is the weighted average of all action Q-values, weighted by the policy’s selection probabilities:

Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) \cdot Q^\pi(s,a)

From V to Q — Conversely, after choosing action aa in state ss, the environment may transition to multiple next states ss'. The action’s value is the probability-weighted sum over all reachable next states of (immediate reward + discounted next-state value):

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γVπ(s)]Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s')\right]

Together, these two directions form the complete recursive structure of the Bellman equation in the next section.

Bellman Equation

Why Do We Need the Bellman Equation?

The ultimate goal of RL is to find the optimal policy π\pi^* — choosing the best action in every state. If we already knew Q(s,a)Q^*(s,a) (the optimal value of each state-action pair), the policy would be trivial: just pick the action with the highest Q-value: π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a). So the core problem of RL reduces to: how to compute VV^* or QQ^*?

The brute-force approach: for each state ss, enumerate all possible future trajectories, compute the cumulative discounted reward for each, and take the expectation. The problem is that the number of trajectories grows exponentially with the number of steps (TT steps, A|A| actions per step = AT|A|^T trajectories) — completely intractable. The brilliance of the Bellman equation: it transforms this exponential global search into a local recursive relation — you don’t need to see the endpoint, just the value of your “one-step-away neighbors.”

The Recursive Relation

Core intuition: the value of a state = immediate reward from taking one step + discounted value of the state you land in.

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s')\right]

This is exactly the V→Q→V combination from the previous section: at state ss, choose action aa according to policy π\pi (outer sum), the environment transitions to next state ss' according to P(ss,a)P(s'|s,a) (inner sum), and the value = immediate reward RR + discount factor γ\gamma × next-state value V(s)V(s').

The optimal Bellman equation — instead of weighting by policy probabilities, directly take the action that achieves the highest value:

V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^*(s')\right]

Bellman Backup Visualization

The three-step visualization below shows how the Bellman equation goes from “a recursive formula” to “an executable algorithm”:

Step 1 — Single-Step Backup: Given the values V(s)V(s') of next states, compute R+γV(s)R + \gamma \cdot V(s') for each possible action and take the maximum — that’s the current state’s V(s)V(s). This is one Bellman “backup” — propagating value backward from successor states to the current state.

Step 2 — Chain Propagation: The terminal state’s value is known (e.g., goal cell V=10). Through successive Backups, we compute the values of s₂, s₁, s₀ — information propagates layer by layer from the endpoint back to the start.

Step 3 — Value Iteration: In real environments, state relationships form a graph, not a chain. Value Iteration performs Backup on all states repeatedly — each iteration, values near the terminal states stabilize first, then spread outward like a wave, until all states’ VV values converge to VV^*.

1. Single-Step Backup
How to compute current value from next state values?sV = ?s'₁V=3.0r=+1s'₂V=1.0r=0s'₃V=2.0r=+2V(s) = max [R + γ·V(s')] γ=0.9s'₁: 1 + 0.9×3.0 = 3.70s'₂: 0 + 0.9×1.0 = 0.90s'₃: 2 + 0.9×2.0 = 3.80← max!∴ V(s) = 3.80, optimal action = ↓3.80

From Equation to Algorithm

The Bellman equation is not just a mathematical property — it directly yields the core algorithms of RL:

  • Value Iteration (requires environment model): Repeatedly apply the Bellman update V(s)maxasP(ss,a)[R+γV(s)]V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R + \gamma V(s')] to every state until convergence. But this requires knowing the transition probabilities P(ss,a)P(s'|s,a) — a model-based method.
  • Q-Learning (no environment model needed): The Agent collects samples (s,a,r,s)(s, a, r, s') through interaction with the environment and uses these samples to approximate the Bellman update — turning “requires full model” into “only needs samples.” This is the model-free method discussed in the next section.

Value-Based Methods

The core idea of Value-Based methods is: first learn an accurate Q function, then derive the optimal policy from it (greedy: select the action with the highest Q-value).

Q-Learning is the most classic Value-Based algorithm, with the update rule:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]

where α\alpha is the learning rate. This update rule directly approximates the optimal Q function without needing to know the environment’s transition probabilities (model-free).

When the state space is large (e.g., image inputs), storing Q-values in a table is impractical. DQN (Deep Q-Network) uses a neural network to approximate the Q function, marking a milestone in deep reinforcement learning.

Q-Learning Live Demoα=0.1 γ=0.9 ε=0.3 | Trained 0 episodes0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0Train 1 EpisodeTrain 50 EpisodesResetLearned Policy·············Q(s,a) values at corners | Color intensity = Q magnitude | ε-greedy exploration
Value Function ↔ Policy CorrespondenceClick +/- on cells to adjust V values, observe how policy arrows followValue Function V(s)3.0+2.5+2.0+0.02.5+1.5+0.0+2.0+1.5+1.0+-0.5+1.5+1.0+0.5+0.0+Policy π(s)·Dark = high value | Light = low value | Policy arrows point to highest-value neighbor | Modify V to observe policy changes

From Value to Policy

Q-Learning and DQN have been very successful in discrete action spaces like Atari games, but they have a fundamental limitation: they are not well-suited for LLM scenarios.

The reason is that an LLM’s “action space” is the entire vocabulary (typically 32K-128K tokens), and it involves sequential decision-making — after generating one token, the next must be generated. Using Q-Learning would require computing a Q-value for every token, which is computationally expensive and unnatural.

A more natural approach is to directly parameterize the policy: have a network directly output “the probability of each action given the current state” — which is exactly what LLMs already do (next-token prediction is a policy).

This is the motivation behind Policy Gradient methods, and the topic of the next article.

RL Method Landscape

RL Methods TaxonomyRL 方法Value-BasedPolicy-BasedActor-CriticDQN 系列Policy GradientPPO / TRPORLHF / DPO / GRPO= LLM Alignment RelatedClick node to view details | Purple dot = LLM alignment related path

As the diagram shows, LLM alignment (RLHF, DPO, GRPO) follows the Policy-Based -> Actor-Critic -> PPO path, not the Value-Based path. Understanding this evolutionary trajectory is key to learning the subsequent topics.

If you want to dive deeper into reinforcement learning, here are our curated resources:

Classic Textbooks

  • Sutton & Barto “Reinforcement Learning: An Introduction” (2nd Edition) — The bible of the RL field, available free online. Ideal for systematic study of MDP, dynamic programming, Monte Carlo methods, TD learning, and other fundamentals.
  • Csaba Szepesvari “Algorithms for Reinforcement Learning” — A more mathematically oriented concise textbook, suitable for readers who prefer theoretical derivations.

Video Courses

  • David Silver UCL RL Course — A classic course by DeepMind’s Chief Scientist. 10 lectures covering fundamentals through function approximation and policy gradients. Each lecture is about 1.5 hours; best paired with slides.
  • Sergey Levine UC Berkeley CS285 — A deep RL course focused on research frontiers, covering model-based RL, offline RL, and other advanced topics.
  • Hugging Face Deep RL Course — A free interactive course with hands-on practice, complete with companion code and assignment environments. Ideal for hands-on learners.
  • Stanford CS234 — Taught by Emma Brunskill, with a stronger focus on theory and analysis.

Blogs and Tutorials

  • Lilian Weng’s Blog Series — Covers RL fundamentals, Policy Gradient, RLHF, Reward Hacking, and more. Each post is a high-quality survey with excellent illustrations.
  • OpenAI Spinning Up — Official RL introduction tutorial, from concepts to code implementation. Particularly suited for those who want to understand algorithm details from scratch.
  • Andrej Karpathy “Pong from Pixels” — A classic introductory blog post that trains Pong from scratch using Policy Gradient, with excellent intuitive explanations.
  • Nathan Lambert (interconnects.ai) — An in-depth blog focused on RLHF and LLM alignment, tracking the latest research developments.
  • Chip Huyen’s RLHF Overview — An RLHF introductory article aimed at engineers — clear and direct.

Interactive Experiments

  • Gymnasium (formerly OpenAI Gym) — The standard RL environment library, providing classic environments like CartPole, MountainCar, and more.
  • CleanRL — Single-file RL implementations, one file per algorithm, ideal for learning and modification.

Summary

This article introduced the core concepts of reinforcement learning:

  1. The Agent-Environment loop is the fundamental framework of RL
  2. MDP provides the mathematical foundation for RL (states, actions, transitions, rewards, discount)
  3. The Bellman equation is the recursive relationship for value functions and the foundation of virtually all RL algorithms
  4. Q-Learning / DQN are classic Value-Based methods
  5. The unique characteristics of LLMs (huge action space, sequential decision-making) make Policy-Based methods more suitable

In the next article, we will dive into Policy Gradient to understand how to directly optimize a policy — the key bridge connecting classical RL and LLM alignment.