Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Reinforcement Learning: From Foundations to LLM Alignment & Reasoning

From MDP to Policy Gradient, from RLHF to GRPO, from Reward Modeling to Test-Time Scaling β€” a systematic guide to how reinforcement learning drives LLM alignment, optimization, and reasoning.

  1. 1

    Reinforcement Learning Foundations: From Agent to Bellman Equation

    Intermediate
    #reinforcement-learning#mdp#bellman-equation#value-function#q-learning
  2. 2

    Policy Gradient: Directly Optimizing the Policy

    Intermediate
    #policy-gradient#reinforce#baseline#variance-reduction#advantage
  3. 3

    Actor-Critic and PPO: Stable Policy Optimization

    Advanced
    #actor-critic#ppo#gae#advantage#clipping#trust-region
  4. 4

    When RL Meets LLM: From Language Generation to Policy Optimization

    Intermediate
    #reinforcement-learning#llm#post-training#rlhf#policy-optimization#alignment
  5. 5

    RLHF: Learning from Human Feedback

    Advanced
    #rlhf#reward-model#alignment#instruct-gpt#kl-divergence
  6. 6

    From DPO to GRPO: Direct Preference Optimization

    Advanced
    #dpo#grpo#ipo#preference-optimization#offline-rl
  7. 7

    Reward Design and Scaling

    Advanced
    #reward-model#reward-hacking#process-reward#outcome-reward#constitutional-ai
  8. 8

    Test-Time Scaling and Reasoning Enhancement

    Advanced
    #test-time-scaling#chain-of-thought#mcts#deepseek-r1#thinking#verifier