Reinforcement Learning: From Foundations to LLM Alignment & Reasoning
From MDP to Policy Gradient, from RLHF to GRPO, from Reward Modeling to Test-Time Scaling β a systematic guide to how reinforcement learning drives LLM alignment, optimization, and reasoning.
- 1
Reinforcement Learning Foundations: From Agent to Bellman Equation
Intermediate#reinforcement-learning#mdp#bellman-equation#value-function#q-learning - 2
Policy Gradient: Directly Optimizing the Policy
Intermediate#policy-gradient#reinforce#baseline#variance-reduction#advantage - 3
Actor-Critic and PPO: Stable Policy Optimization
Advanced#actor-critic#ppo#gae#advantage#clipping#trust-region - 4
When RL Meets LLM: From Language Generation to Policy Optimization
Intermediate#reinforcement-learning#llm#post-training#rlhf#policy-optimization#alignment - 5
RLHF: Learning from Human Feedback
Advanced#rlhf#reward-model#alignment#instruct-gpt#kl-divergence - 6
From DPO to GRPO: Direct Preference Optimization
Advanced#dpo#grpo#ipo#preference-optimization#offline-rl - 7
Reward Design and Scaling
Advanced#reward-model#reward-hacking#process-reward#outcome-reward#constitutional-ai - 8
Test-Time Scaling and Reasoning Enhancement
Advanced#test-time-scaling#chain-of-thought#mcts#deepseek-r1#thinking#verifier