Reinforcement Learning Foundations: From Agent to Bellman Equation

What Is Reinforcement Learning

Reinforcement Learning (RL) is the third major paradigm of machine learning, alongside supervised learning and unsupervised learning. Its core idea is simple yet profound: an Agent learns the optimal behavioral policy through trial and error in an Environment.

The key differences from supervised learning are:

No labels: No one tells the Agent what the correct answer is — it can only learn from reward signals provided by the environment
Delayed rewards: A good decision may not show its effects until much later (e.g., opening moves in chess)
Exploration-Exploitation Dilemma: The Agent must balance between “trying new strategies” and “using known good strategies”

At each time step, this loop repeats:

The Agent observes the current State $s_t$
The Agent selects an Action $a_t$ based on its policy
The Environment returns a Reward $r_t$ and a new state $s_{t+1}$
The Agent adjusts its policy based on experience

Markov Decision Process (MDP)

The mathematical foundation of RL is the Markov Decision Process (MDP), defined by the five-tuple $(S, A, P, R, \gamma)$ :

$S$ : State space (the set of all possible states)
$A$ : Action space (the set of all possible actions)
$P(s'|s,a)$ : State transition probability (the probability of transitioning to $s'$ after taking action $a$ in state $s$ )
$R(s,a,s')$ : Reward function (the immediate reward received during a state transition)
$\gamma \in [0,1)$ : Discount factor (the decay coefficient for future rewards — the further away a reward, the less it is worth)

The “Markov” property means memorylessness: the next state depends only on the current state and action, not on history. This assumption makes the problem tractable.

Policy and Value Functions

With the MDP framework in place, we need to define the Agent’s behavioral model and evaluation criteria:

Policy $\pi(a|s)$ : The probability distribution over actions $a$ given state $s$ . A policy can be deterministic (one fixed action per state) or stochastic (a probability distribution).

State Value Function $V^\pi(s)$ : The expected cumulative discounted reward starting from state $s$ and following policy $\pi$ :

$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]$

Action Value Function $Q^\pi(s,a)$ : The expected return after taking action $a$ in state $s$ and then following policy $\pi$ :

$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$

In these formulas, $s_0 = s$ means “starting from state $s$ at time step 0”, and $a_0 = a$ means “taking action $a$ at the first step”. The only difference between the two: $V$ fixes only the starting state (the first action is chosen by policy $\pi$ ), while $Q$ fixes both the starting state and the first action.

The V-Q Duality:

From Q to V — In state $s$ , the Agent can choose among multiple actions, each with its own Q-value. The state’s value is the weighted average of all action Q-values, weighted by the policy’s selection probabilities:

$V^\pi(s) = \sum_a \pi(a|s) \cdot Q^\pi(s,a)$

From V to Q — Conversely, after choosing action $a$ in state $s$ , the environment may transition to multiple next states $s'$ . The action’s value is the probability-weighted sum over all reachable next states of (immediate reward + discounted next-state value):

$Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s')\right]$

Together, these two directions form the complete recursive structure of the Bellman equation in the next section.

Bellman Equation

Why Do We Need the Bellman Equation?

The ultimate goal of RL is to find the optimal policy $\pi^*$ — choosing the best action in every state. If we already knew $Q^*(s,a)$ (the optimal value of each state-action pair), the policy would be trivial: just pick the action with the highest Q-value: $\pi^*(s) = \arg\max_a Q^*(s,a)$ . So the core problem of RL reduces to: how to compute $V^*$ or $Q^*$ ?

The brute-force approach: for each state $s$ , enumerate all possible future trajectories, compute the cumulative discounted reward for each, and take the expectation. The problem is that the number of trajectories grows exponentially with the number of steps ( $T$ steps, $|A|$ actions per step = $|A|^T$ trajectories) — completely intractable. The brilliance of the Bellman equation: it transforms this exponential global search into a local recursive relation — you don’t need to see the endpoint, just the value of your “one-step-away neighbors.”

The Recursive Relation

Core intuition: the value of a state = immediate reward from taking one step + discounted value of the state you land in.

$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s')\right]$

This is exactly the V→Q→V combination from the previous section: at state $s$ , choose action $a$ according to policy $\pi$ (outer sum), the environment transitions to next state $s'$ according to $P(s'|s,a)$ (inner sum), and the value = immediate reward $R$ + discount factor $\gamma$ × next-state value $V(s')$ .

The optimal Bellman equation — instead of weighting by policy probabilities, directly take the action that achieves the highest value:

$V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^*(s')\right]$

Bellman Backup Visualization

The three-step visualization below shows how the Bellman equation goes from “a recursive formula” to “an executable algorithm”:

Step 1 — Single-Step Backup: Given the values $V(s')$ of next states, compute $R + \gamma \cdot V(s')$ for each possible action and take the maximum — that’s the current state’s $V(s)$ . This is one Bellman “backup” — propagating value backward from successor states to the current state.

Step 2 — Chain Propagation: The terminal state’s value is known (e.g., goal cell V=10). Through successive Backups, we compute the values of s₂, s₁, s₀ — information propagates layer by layer from the endpoint back to the start.

Step 3 — Value Iteration: In real environments, state relationships form a graph, not a chain. Value Iteration performs Backup on all states repeatedly — each iteration, values near the terminal states stabilize first, then spread outward like a wave, until all states’ $V$ values converge to $V^*$ .

1. Single-Step Backup

From Equation to Algorithm

The Bellman equation is not just a mathematical property — it directly yields the core algorithms of RL:

Value Iteration (requires environment model): Repeatedly apply the Bellman update $V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R + \gamma V(s')]$ to every state until convergence. But this requires knowing the transition probabilities $P(s'|s,a)$ — a model-based method.
Q-Learning (no environment model needed): The Agent collects samples $(s, a, r, s')$ through interaction with the environment and uses these samples to approximate the Bellman update — turning “requires full model” into “only needs samples.” This is the model-free method discussed in the next section.

Value-Based Methods

The core idea of Value-Based methods is: first learn an accurate Q function, then derive the optimal policy from it (greedy: select the action with the highest Q-value).

Q-Learning is the most classic Value-Based algorithm, with the update rule:

$Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]$

where $\alpha$ is the learning rate. This update rule directly approximates the optimal Q function without needing to know the environment’s transition probabilities (model-free).

When the state space is large (e.g., image inputs), storing Q-values in a table is impractical. DQN (Deep Q-Network) uses a neural network to approximate the Q function, marking a milestone in deep reinforcement learning.

From Value to Policy

Q-Learning and DQN have been very successful in discrete action spaces like Atari games, but they have a fundamental limitation: they are not well-suited for LLM scenarios.

The reason is that an LLM’s “action space” is the entire vocabulary (typically 32K-128K tokens), and it involves sequential decision-making — after generating one token, the next must be generated. Using Q-Learning would require computing a Q-value for every token, which is computationally expensive and unnatural.

A more natural approach is to directly parameterize the policy: have a network directly output “the probability of each action given the current state” — which is exactly what LLMs already do (next-token prediction is a policy).

This is the motivation behind Policy Gradient methods, and the topic of the next article.

RL Method Landscape

As the diagram shows, LLM alignment (RLHF, DPO, GRPO) follows the Policy-Based -> Actor-Critic -> PPO path, not the Value-Based path. Understanding this evolutionary trajectory is key to learning the subsequent topics.

Recommended Learning Resources

If you want to dive deeper into reinforcement learning, here are our curated resources:

Classic Textbooks

Sutton & Barto “Reinforcement Learning: An Introduction” (2nd Edition) — The bible of the RL field, available free online. Ideal for systematic study of MDP, dynamic programming, Monte Carlo methods, TD learning, and other fundamentals.
Csaba Szepesvari “Algorithms for Reinforcement Learning” — A more mathematically oriented concise textbook, suitable for readers who prefer theoretical derivations.

Video Courses

David Silver UCL RL Course — A classic course by DeepMind’s Chief Scientist. 10 lectures covering fundamentals through function approximation and policy gradients. Each lecture is about 1.5 hours; best paired with slides.
Sergey Levine UC Berkeley CS285 — A deep RL course focused on research frontiers, covering model-based RL, offline RL, and other advanced topics.
Hugging Face Deep RL Course — A free interactive course with hands-on practice, complete with companion code and assignment environments. Ideal for hands-on learners.
Stanford CS234 — Taught by Emma Brunskill, with a stronger focus on theory and analysis.

Blogs and Tutorials

Lilian Weng’s Blog Series — Covers RL fundamentals, Policy Gradient, RLHF, Reward Hacking, and more. Each post is a high-quality survey with excellent illustrations.
OpenAI Spinning Up — Official RL introduction tutorial, from concepts to code implementation. Particularly suited for those who want to understand algorithm details from scratch.
Andrej Karpathy “Pong from Pixels” — A classic introductory blog post that trains Pong from scratch using Policy Gradient, with excellent intuitive explanations.
Nathan Lambert (interconnects.ai) — An in-depth blog focused on RLHF and LLM alignment, tracking the latest research developments.
Chip Huyen’s RLHF Overview — An RLHF introductory article aimed at engineers — clear and direct.

Interactive Experiments

Gymnasium (formerly OpenAI Gym) — The standard RL environment library, providing classic environments like CartPole, MountainCar, and more.
CleanRL — Single-file RL implementations, one file per algorithm, ideal for learning and modification.

Summary

This article introduced the core concepts of reinforcement learning:

The Agent-Environment loop is the fundamental framework of RL
MDP provides the mathematical foundation for RL (states, actions, transitions, rewards, discount)
The Bellman equation is the recursive relationship for value functions and the foundation of virtually all RL algorithms
Q-Learning / DQN are classic Value-Based methods
The unique characteristics of LLMs (huge action space, sequential decision-making) make Policy-Based methods more suitable

In the next article, we will dive into Policy Gradient to understand how to directly optimize a policy — the key bridge connecting classical RL and LLM alignment.