Resources
Reference materials organized by learning path, auto-aggregated from article citations.
Transformer Core Mechanisms 53 resources
π Paper
Attention Is All You Need
arxiv.org Β· Source:
Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , Positional Encoding β Giving Transformers a Sense of Order
π Paper
Language Models are Unsupervised Multitask Learners (GPT-2)
cdn.openai.com Β· Source:
Transformer Architecture Overview , Sampling & Decoding β From Probabilities to Text
π Website
The Illustrated Transformer
jalammar.github.io Β· Source:
Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , KV Cache Fundamentals
π Website
LLM Visualization β Brendan Bycroft
bbycroft.net Β· Source:
Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , MQA and GQA , KV Cache Fundamentals , Prefill vs Decode Phases
π Website
Transformer Explainer β Georgia Tech / Polo Club
poloclub.github.io Β· Source:
Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , MQA and GQA
π Paper
GLU Variants Improve Transformer
arxiv.org Β· Source:
Transformer Architecture Overview
π Paper
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
arxiv.org Β· Source:
MQA and GQA
π Paper
Fast Transformer Decoding: One Write-Head is All You Need
arxiv.org Β· Source:
MQA and GQA
π Paper
Mistral 7B
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA
π Paper
Gemma 2 Technical Report
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA
π Paper
Jamba: A Hybrid Transformer-Mamba Language Model
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA , Hybrid Architectures: Fusing Mamba with Attention
π Paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA
π Paper
Flamingo: a Visual Language Model for Few-Shot Learning
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA
π Paper
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA , Mixture of Experts: Sparsely Activated Large Model Architecture
π Paper
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA
π Paper
Retentive Network: A Successor to Transformer for Large Language Models
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA
π Paper
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA , Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
π Paper
Gated Delta Networks: Improving Mamba2 with Delta Rule
arxiv.org Β· Source:
Attention Variants: From Sliding Window to MLA , Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
π Paper
Efficient Memory Management for Large Language Model Serving with PagedAttention
arxiv.org Β· Source:
KV Cache Fundamentals
π Paper
Efficiently Scaling Transformer Inference
arxiv.org Β· Source:
Prefill vs Decode Phases
π Paper
LLM Inference Unveiled: Survey and Roofline Model Insights
arxiv.org Β· Source:
Prefill vs Decode Phases
π Paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
arxiv.org Β· Source:
Flash Attention Tiling Principles
π Paper
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
arxiv.org Β· Source:
Flash Attention Tiling Principles
π Paper
Self-Attention with Relative Position Representations
arxiv.org Β· Source:
Positional Encoding β Giving Transformers a Sense of Order
π Paper
RoFormer: Enhanced Transformer with Rotary Position Embedding
arxiv.org Β· Source:
Positional Encoding β Giving Transformers a Sense of Order
π Paper
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
arxiv.org Β· Source:
Positional Encoding β Giving Transformers a Sense of Order
π Paper
The Curious Case of Neural Text Degeneration
arxiv.org Β· Source:
Sampling & Decoding β From Probabilities to Text
π Paper
Hierarchical Neural Story Generation
arxiv.org Β· Source:
Sampling & Decoding β From Probabilities to Text
π Paper
Perplexity β a Measure of the Difficulty of Speech Recognition Tasks
ieeexplore.ieee.org Β· Source:
Sampling & Decoding β From Probabilities to Text
π Paper
Fast Inference from Transformers via Speculative Decoding
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Paper
Accelerating Large Language Model Decoding with Speculative Sampling
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Paper
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Paper
Better & Faster Large Language Models via Multi-Token Prediction
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Paper
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Paper
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Paper
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Website
DeepSeek-V3 Technical Report
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing , Mixture of Experts: Sparsely Activated Large Model Architecture
π Paper
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
arxiv.org Β· Source:
Speculative Decoding β Accelerating LLM Inference via Guessing
π Paper
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
arxiv.org Β· Source:
Mixture of Experts: Sparsely Activated Large Model Architecture
π Paper
Mixtral of Experts
arxiv.org Β· Source:
Mixture of Experts: Sparsely Activated Large Model Architecture
π Paper
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
arxiv.org Β· Source:
Mixture of Experts: Sparsely Activated Large Model Architecture
π Paper
Efficiently Modeling Long Sequences with Structured State Spaces (S4)
arxiv.org Β· Source:
State Space Models and Mamba
π Paper
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
arxiv.org Β· Source:
State Space Models and Mamba
π Paper
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
arxiv.org Β· Source:
State Space Models and Mamba
π Paper
HiPPO: Recurrent Memory with Optimal Polynomial Projections
arxiv.org Β· Source:
State Space Models and Mamba
π Paper
Hungry Hungry Hippos: Towards Language Modeling with State Space Models (H3)
arxiv.org Β· Source:
State Space Models and Mamba
π Paper
On the Parameterization and Initialization of Diagonal State Space Models (S4D)
arxiv.org Β· Source:
State Space Models and Mamba
π Website
Zamba2-Small: A Hybrid SSM-Transformer Model
zyphra.com Β· Source:
Hybrid Architectures: Fusing Mamba with Attention
π Paper
Hymba: A Hybrid-head Architecture for Small Language Models
arxiv.org Β· Source:
Hybrid Architectures: Fusing Mamba with Attention
π Paper
An Empirical Study of Mamba-based Language Models
arxiv.org Β· Source:
Hybrid Architectures: Fusing Mamba with Attention
π Paper
Repeat After Me: Transformers are Better than State Space Models at Copying
arxiv.org Β· Source:
Hybrid Architectures: Fusing Mamba with Attention
π Paper
Qwen3 Technical Report
arxiv.org Β· Source:
Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
π» Code
Ollama - Qwen3-Next Model Implementation
github.com Β· Source:
Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
Transformer Across Modalities 39 resources
π Paper
Efficient Estimation of Word Representations in Vector Space
arxiv.org Β· Source:
From Text to Vectors: Tokenization and Word Embeddings
π Paper
Neural Machine Translation of Rare Words with Subword Units
arxiv.org Β· Source:
From Text to Vectors: Tokenization and Word Embeddings
π Paper
GloVe: Global Vectors for Word Representation
nlp.stanford.edu Β· Source:
From Text to Vectors: Tokenization and Word Embeddings
π Paper
SentencePiece: A simple and language independent subword tokenizer
arxiv.org Β· Source:
From Text to Vectors: Tokenization and Word Embeddings
π Website
The Illustrated Word2Vec
jalammar.github.io Β· Source:
From Text to Vectors: Tokenization and Word Embeddings
π Website
Hugging Face Tokenizer Summary
huggingface.co Β· Source:
From Text to Vectors: Tokenization and Word Embeddings
π Paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
arxiv.org Β· Source:
BERT and GPT: Two Paths β Understanding vs Generation
π Paper
Improving Language Understanding by Generative Pre-Training
cdn.openai.com Β· Source:
BERT and GPT: Two Paths β Understanding vs Generation
π Paper
Language Models are Unsupervised Multitask Learners
cdn.openai.com Β· Source:
BERT and GPT: Two Paths β Understanding vs Generation
π Paper
Language Models are Few-Shot Learners
arxiv.org Β· Source:
BERT and GPT: Two Paths β Understanding vs Generation
π Paper
BERT for Joint Intent Classification and Slot Filling
arxiv.org Β· Source:
BERT and GPT: Two Paths β Understanding vs Generation
π Paper
Scaling Laws for Neural Language Models
arxiv.org Β· Source:
BERT and GPT: Two Paths β Understanding vs Generation
π Paper
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
arxiv.org Β· Source:
Sentence Embeddings: From Token-Level to Semantic Retrieval
π Paper
Text Embeddings by Weakly-Supervised Contrastive Pre-training
arxiv.org Β· Source:
Sentence Embeddings: From Token-Level to Semantic Retrieval
π Paper
C-Pack: Packaged Resources To Advance General Chinese Embedding
arxiv.org Β· Source:
Sentence Embeddings: From Token-Level to Semantic Retrieval
π Paper
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
arxiv.org Β· Source:
Sentence Embeddings: From Token-Level to Semantic Retrieval
π Paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
arxiv.org Β· Source:
Vision Transformer: When Images Become Token Sequences
π Paper
Training data-efficient image transformers & distillation through attention
arxiv.org Β· Source:
Vision Transformer: When Images Become Token Sequences
π Paper
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
arxiv.org Β· Source:
Vision Transformer: When Images Become Token Sequences
π Paper
Learning Transferable Visual Models From Natural Language Supervision
arxiv.org Β· Source:
Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
π Paper
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
arxiv.org Β· Source:
Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
π Paper
Sigmoid Loss for Language Image Pre-Training
arxiv.org Β· Source:
Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
π Paper
Visual Instruction Tuning
arxiv.org Β· Source:
Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
π Paper
Denoising Diffusion Probabilistic Models
arxiv.org Β· Source:
Diffusion Model Fundamentals: Generating from Noise
π Paper
Denoising Diffusion Implicit Models
arxiv.org Β· Source:
Diffusion Model Fundamentals: Generating from Noise
π Paper
High-Resolution Image Synthesis with Latent Diffusion Models
arxiv.org Β· Source:
Diffusion Model Fundamentals: Generating from Noise
π Paper
Classifier-Free Diffusion Guidance
arxiv.org Β· Source:
Diffusion Model Fundamentals: Generating from Noise
π Paper
Scalable Diffusion Models with Transformers
arxiv.org Β· Source:
Diffusion Transformer: Image Generation with Transformers
π Paper
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
arxiv.org Β· Source:
Diffusion Transformer: Image Generation with Transformers
π Paper
Video generation models as world simulators
openai.com Β· Source:
Video Generation: Spatiotemporal Attention and the Sora Architecture
π Paper
Make-A-Video: Text-to-Video Generation without Text-Video Data
arxiv.org Β· Source:
Video Generation: Spatiotemporal Attention and the Sora Architecture
π Paper
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
arxiv.org Β· Source:
Video Generation: Spatiotemporal Attention and the Sora Architecture
π Paper
Robust Speech Recognition via Large-Scale Weak Supervision
arxiv.org Β· Source:
Speech and Transformers: From Whisper to VALL-E
π Paper
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
arxiv.org Β· Source:
Speech and Transformers: From Whisper to VALL-E
π Paper
High Fidelity Neural Audio Compression
arxiv.org Β· Source:
Speech and Transformers: From Whisper to VALL-E
π Paper
Simple and Controllable Music Generation
arxiv.org Β· Source:
Music Generation: When Transformers Learn to Compose
π Paper
Jukebox: A Generative Model for Music
arxiv.org Β· Source:
Music Generation: When Transformers Learn to Compose
π Paper
MusicLM: Generating Music From Text
arxiv.org Β· Source:
Music Generation: When Transformers Learn to Compose
π Paper
Fast Timing-Conditioned Latent Audio Diffusion
arxiv.org Β· Source:
Music Generation: When Transformers Learn to Compose
LLM Quantization Techniques 27 resources
π Paper
A Survey of Quantization Methods for Efficient Neural Network Inference
arxiv.org Β· Source:
Quantization Fundamentals
π Paper
Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation
arxiv.org Β· Source:
Quantization Fundamentals
π Paper
FP8 Formats for Deep Learning
arxiv.org Β· Source:
Quantization Fundamentals , Inference-Time Quantization: KV Cache and Activation Quantization
π Paper
GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers
arxiv.org Β· Source:
PTQ Weight Quantization: From GPTQ to AWQ , llama.cpp Quantization Methods
π Paper
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arxiv.org Β· Source:
PTQ Weight Quantization: From GPTQ to AWQ , llama.cpp Quantization Methods
π Paper
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
arxiv.org Β· Source:
PTQ Weight Quantization: From GPTQ to AWQ
π Paper
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
arxiv.org Β· Source:
Quantization-Aware Training (QAT)
π Paper
BitNet: Scaling 1-bit Transformers for Large Language Models
arxiv.org Β· Source:
Quantization-Aware Training (QAT)
π Paper
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arxiv.org Β· Source:
Quantization-Aware Training (QAT)
π Paper
QLoRA: Efficient Finetuning of Quantized LLMs
arxiv.org Β· Source:
Quantization-Aware Training (QAT)
π Paper
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
arxiv.org Β· Source:
Quantization-Aware Training (QAT)
π Paper
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
arxiv.org Β· Source:
Inference-Time Quantization: KV Cache and Activation Quantization
π Paper
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
arxiv.org Β· Source:
Inference-Time Quantization: KV Cache and Activation Quantization
π» Code
llama.cpp Quantization Types
github.com Β· Source:
llama.cpp Quantization Methods
π» Code
K-quant PR
github.com Β· Source:
llama.cpp Quantization Methods
π Website
NVIDIA Model Optimizer GitHub
github.com Β· Source:
Quantization and Model Conversion Toolchain Landscape
π Website
vLLM Quantization - LLM Compressor
github.com Β· Source:
Quantization and Model Conversion Toolchain Landscape
π Website
Microsoft Olive Documentation - Why Olive
microsoft.github.io Β· Source:
Quantization and Model Conversion Toolchain Landscape
π Website
Apple coremltools Optimization Overview
apple.github.io Β· Source:
Quantization and Model Conversion Toolchain Landscape
π Website
AMD Quark Documentation
quark.docs.amd.com Β· Source:
Quantization and Model Conversion Toolchain Landscape
π Website
Google AI Edge Torch GitHub
github.com Β· Source:
Quantization and Model Conversion Toolchain Landscape
π Website
NNCF GitHub Repository
github.com Β· Source:
Quantization and Model Conversion Toolchain Landscape
π Website
Optimum Intel Documentation
huggingface.co Β· Source:
Quantization and Model Conversion Toolchain Landscape , Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
llama.cpp GitHub Repository
github.com Β· Source:
Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
ONNX Runtime Documentation
onnxruntime.ai Β· Source:
Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
OpenVINO Documentation
docs.openvino.ai Β· Source:
Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
lm-evaluation-harness
github.com Β· Source:
Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
vLLM + SGLang Inference Engine Deep Dive 13 resources
π Paper
Efficient Memory Management for Large Language Model Serving with PagedAttention
arxiv.org Β· Source:
LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM , PagedAttention and Continuous Batching , Scheduling and Preemption: The Inference Engine Scheduler
π Paper
SGLang: Efficient Execution of Structured Language Model Programs
arxiv.org Β· Source:
LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM , Prefix Caching and RadixAttention , SGLang Programming Model and Structured Output
π Website
NVIDIA TensorRT-LLM Documentation
nvidia.github.io Β· Source:
LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM
π Website
Ollama GitHub Repository
github.com Β· Source:
LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM
π Paper
Orca: A Distributed Serving System for Transformer-Based Generative Models
arxiv.org Β· Source:
PagedAttention and Continuous Batching
π Website
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
blog.vllm.ai Β· Source:
PagedAttention and Continuous Batching
π Paper
Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
arxiv.org Β· Source:
Scheduling and Preemption: The Inference Engine Scheduler
π Paper
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
arxiv.org Β· Source:
Scheduling and Preemption: The Inference Engine Scheduler
π Website
vLLM Automatic Prefix Caching
docs.vllm.ai Β· Source:
Prefix Caching and RadixAttention
π Paper
Trie Memory
dl.acm.org Β· Source:
Prefix Caching and RadixAttention
π Paper
Efficient Guided Generation for Large Language Models
arxiv.org Β· Source:
SGLang Programming Model and Structured Output
π Website
Fast JSON Decoding for Local LLMs with Compressed Finite State Machine
lmsys.org Β· Source:
SGLang Programming Model and Structured Output
π Website
SGLang Documentation β Structured Outputs
docs.sglang.io Β· Source:
SGLang Programming Model and Structured Output
π Paper
RouteLLM: Learning to Route LLMs with Preference Data
arxiv.org Β· Source:
Model Routing Landscape: Why One Model Isn't Enough , Routing Classifiers: Letting Small Models Decide Who Answers , RouteLLM in Practice: From Preference Data to Production Routing , Factorization Machines and LLM Routing: From FM Theory to MF Router , Online Learning and Cost Optimization: Routers Need to Evolve Too
π Paper
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
arxiv.org Β· Source:
Model Routing Landscape: Why One Model Isn't Enough , Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed
π Paper
AutoMix: Automatically Mixing Language Models
arxiv.org Β· Source:
Model Routing Landscape: Why One Model Isn't Enough , Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed
π Website
RouteLLM GitHub Repository
github.com Β· Source:
Model Routing Landscape: Why One Model Isn't Enough , RouteLLM in Practice: From Preference Data to Production Routing
π Paper
Evaluating Small Language Models for Front-Door Routing
arxiv.org Β· Source:
Routing Classifiers: Letting Small Models Decide Who Answers
π Website
semantic-router: Superfast Decision-Making Layer
github.com Β· Source:
Routing Classifiers: Letting Small Models Decide Who Answers
π Paper
Factorization Machines
csie.ntu.edu.tw Β· Source:
Factorization Machines and LLM Routing: From FM Theory to MF Router
π Paper
Factorization Machines with libFM
dl.acm.org Β· Source:
Factorization Machines and LLM Routing: From FM Theory to MF Router
π Paper
Confidence-Driven LLM Router
arxiv.org Β· Source:
Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed
π Paper
ConsRoute: Consistency-Driven LLM Routing for Cloud-Edge-Device
arxiv.org Β· Source:
Hybrid LLM: Intelligent Routing Between Local and Cloud
π Paper
HybridFlow: Subtask-level DAG Routing
arxiv.org Β· Source:
Hybrid LLM: Intelligent Routing Between Local and Cloud
π Paper
PRISM: Privacy-Sensitive Entity-Level LLM Routing
arxiv.org Β· Source:
Hybrid LLM: Intelligent Routing Between Local and Cloud
π Paper
Bridging On-Device and Cloud LLMs for Collaborative Reasoning
arxiv.org Β· Source:
Hybrid LLM: Intelligent Routing Between Local and Cloud
π Paper
Robust Batch-Level LLM Routing
arxiv.org Β· Source:
Online Learning and Cost Optimization: Routers Need to Evolve Too
π Paper
Council Mode: Multi-LLM Collaboration for Hallucination Reduction
arxiv.org Β· Source:
Multi-Model Collaboration: From Picking One to Using Many
π Website
Mixture of Agents - Together AI
together.ai Β· Source:
Multi-Model Collaboration: From Picking One to Using Many
LLM Evaluation and Benchmarks Deep Dive 27 resources
π Paper
Measuring Massive Multitask Language Understanding (MMLU)
arxiv.org Β· Source:
Benchmark Landscape and Evaluation Methodology
π Website
lm-evaluation-harness
github.com Β· Source:
Benchmark Landscape and Evaluation Methodology , Impact of Optimization on Accuracy , lm-eval-harness Practical Guide
π Paper
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
arxiv.org Β· Source:
Benchmark Landscape and Evaluation Methodology
π Website
LiveBench
livebench.ai Β· Source:
Benchmark Landscape and Evaluation Methodology , Interpreting Leaderboards and Model Selection
π Paper
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
arxiv.org Β· Source:
Knowledge & Reasoning Benchmarks
π Paper
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
arxiv.org Β· Source:
Knowledge & Reasoning Benchmarks
π Paper
Measuring Mathematical Problem Solving With the MATH Dataset
arxiv.org Β· Source:
Knowledge & Reasoning Benchmarks
π Paper
Evaluating Large Language Models Trained on Code
arxiv.org Β· Source:
Code Benchmarks
π Paper
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
arxiv.org Β· Source:
Code Benchmarks , SWE-bench Practical Guide
π Paper
Is Your Code Generated by ChatGPT Really Correct? (EvalPlus)
arxiv.org Β· Source:
Code Benchmarks
π Website
Berkeley Function Calling Leaderboard (BFCL)
gorilla.cs.berkeley.edu Β· Source:
Agent & Tool Use Benchmarks , BFCL Practical Guide
π Paper
GAIA: A Benchmark for General AI Assistants
arxiv.org Β· Source:
Agent & Tool Use Benchmarks
π Paper
WebArena: A Realistic Web Environment for Building Autonomous Agents
arxiv.org Β· Source:
Agent & Tool Use Benchmarks
π Website
Google Gemma 2 Technical Report
ai.google.dev Β· Source:
Anatomy of Model Release Benchmark Standard Sets
π Website
Microsoft Phi-3 Technical Report
arxiv.org Β· Source:
Anatomy of Model Release Benchmark Standard Sets
π Paper
Qwen2.5 Technical Report
arxiv.org Β· Source:
Anatomy of Model Release Benchmark Standard Sets
π Website
Meta Llama 3.1 Model Card
huggingface.co Β· Source:
Anatomy of Model Release Benchmark Standard Sets
π Website
Open LLM Leaderboard
huggingface.co Β· Source:
Anatomy of Model Release Benchmark Standard Sets , Interpreting Leaderboards and Model Selection
π Website
OpenVINO Neural Network Compression Framework (NNCF)
github.com Β· Source:
Impact of Optimization on Accuracy
π Website
Optimum Intel
huggingface.co Β· Source:
Impact of Optimization on Accuracy
π Website
llama.cpp
github.com Β· Source:
Impact of Optimization on Accuracy
π Website
Chatbot Arena (LMSYS)
lmarena.ai Β· Source:
Interpreting Leaderboards and Model Selection
π Website
Artificial Analysis LLM Leaderboard
artificialanalysis.ai Β· Source:
Interpreting Leaderboards and Model Selection
π Website
lm-eval Documentation
lm-evaluation-harness.readthedocs.io Β· Source:
lm-eval-harness Practical Guide
π Website
SWE-bench GitHub
github.com Β· Source:
SWE-bench Practical Guide
π Website
SWE-agent GitHub
github.com Β· Source:
SWE-bench Practical Guide
π Website
Gorilla / BFCL GitHub
github.com Β· Source:
BFCL Practical Guide
Ollama + llama.cpp Deep Dive 20 resources
π» Code
Ollama GitHub
github.com Β· Source:
Ollama + llama.cpp Architecture Overview , The Complete Journey of a Single Inference , KV Cache and Batch Scheduling , Server Layer and Scheduling
π» Code
llama.cpp GitHub
github.com Β· Source:
Ollama + llama.cpp Architecture Overview , The Complete Journey of a Single Inference , Compute Graphs and Inference Engines
π» Code
GGML GitHub
github.com Β· Source:
Ollama + llama.cpp Architecture Overview , Compute Graphs and Inference Engines , Hardware Backends
π Paper
Qwen3 Technical Report
arxiv.org Β· Source:
The Complete Journey of a Single Inference
π» Code
GGUF Specification
github.com Β· Source:
The GGUF Model Format
π Website
Safetensors Documentation
huggingface.co Β· Source:
The GGUF Model Format
π Website
ONNX
onnx.ai Β· Source:
The GGUF Model Format
π» Code
llama.cpp Quantization Types
github.com Β· Source:
llama.cpp Quantization Methods
π» Code
K-quant PR
github.com Β· Source:
llama.cpp Quantization Methods
π Paper
GPTQ: Accurate Post-Training Quantization
arxiv.org Β· Source:
llama.cpp Quantization Methods
π Paper
AWQ: Activation-aware Weight Quantization
arxiv.org Β· Source:
llama.cpp Quantization Methods
π Paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
arxiv.org Β· Source:
Compute Graphs and Inference Engines
π Paper
Efficient Memory Management for LLM Serving with PagedAttention
arxiv.org Β· Source:
KV Cache and Batch Scheduling
π Website
CUDA Programming Guide
docs.nvidia.com Β· Source:
Hardware Backends
π Website
Metal Shading Language
developer.apple.com Β· Source:
Hardware Backends
π Website
Vulkan Compute
khronos.org Β· Source:
Hardware Backends
π» Code
Ollama FAQ
github.com Β· Source:
Server Layer and Scheduling
π» Code
Ollama Modelfile
github.com Β· Source:
Model Ecosystem
π» Code
Ollama API
github.com Β· Source:
Model Ecosystem
π Paper
LLaVA: Visual Instruction Tuning
arxiv.org Β· Source:
Model Ecosystem
llama.cpp Source Code Walkthrough 1 resources
AI Compute Stack 21 resources
π Website
NVIDIA CUDA C++ Programming Guide
docs.nvidia.com Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA , GPU Architecture β From Transistors to Threads , CUDA Programming Model β From Code to Hardware
π Website
Khronos OpenCL Specification
khronos.org Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA
π Website
Khronos SYCL Specification
khronos.org Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA
π Website
Intel oneAPI Level Zero Specification
spec.oneapi.io Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA
π Website
AMD ROCm HIP Programming Guide
rocm.docs.amd.com Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA
π Website
Apple Metal Shading Language Specification
developer.apple.com Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA
π» Code
ggml / llama.cpp
github.com Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA
π Website
ONNX Runtime Documentation
onnxruntime.ai Β· Source:
AI Compute Stack Overview β From Inference Frameworks to Hardware ISA
π Website
NVIDIA H100 Tensor Core GPU Architecture Whitepaper
resources.nvidia.com Β· Source:
GPU Architecture β From Transistors to Threads , Matrix Acceleration Units β Tensor Core and XMX
π Paper
Why Systolic Architectures? β H.T. Kung
cs.virginia.edu Β· Source:
Matrix Acceleration Units β Tensor Core and XMX
π Website
NVIDIA PTX ISA β Matrix Multiply-Accumulate
docs.nvidia.com Β· Source:
Matrix Acceleration Units β Tensor Core and XMX
π Website
Intel Xe2 Architecture β Xe-Core and XMX
intel.com Β· Source:
Matrix Acceleration Units β Tensor Core and XMX , CUDA Programming Model β From Code to Hardware
π Paper
DeepSeek-V3 Technical Report
arxiv.org Β· Source:
Matrix Acceleration Units β Tensor Core and XMX
π Website
NVIDIA Kernel Profiling Guide β Memory Coalescing
docs.nvidia.com Β· Source:
CUDA Programming Model β From Code to Hardware
π Website
CUDA Occupancy Calculator
docs.nvidia.com Β· Source:
CUDA Programming Model β From Code to Hardware
π Website
SYCL 2020 Specification
registry.khronos.org Β· Source:
CUDA Programming Model β From Code to Hardware
π Website
CUTLASS: Fast Linear Algebra in CUDA C++
github.com Β· Source:
GEMM Optimization β From Naive to Peak Performance
π Website
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance
siboehm.com Β· Source:
GEMM Optimization β From Naive to Peak Performance
π Website
CUDA C++ Programming Guide β Warp Matrix Functions
docs.nvidia.com Β· Source:
GEMM Optimization β From Naive to Peak Performance
π Website
Intel oneAPI DPC++ β joint_matrix Extension
github.com Β· Source:
GEMM Optimization β From Naive to Peak Performance
π Paper
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
arxiv.org Β· Source:
GEMM Optimization β From Naive to Peak Performance
Graph Compilation & Optimization 79 resources
π Blog
PyTorch 2.0: Our next generation release
pytorch.org Β· Source:
Panorama: The World of ML Compilers , Graph Capture: TorchDynamo, AOTAutograd & Functionalization
π Website
MLIR: Multi-Level Intermediate Representation
mlir.llvm.org Β· Source:
Panorama: The World of ML Compilers
π Website
Triton Language and Compiler
triton-lang.org Β· Source:
Panorama: The World of ML Compilers
π Paper
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
arxiv.org Β· Source:
Panorama: The World of ML Compilers
π Paper
MLIR: A Compiler Infrastructure for the End of Moore's Law
arxiv.org Β· Source:
Panorama: The World of ML Compilers , IR Design (Part 1): SSA, FX IR & MLIR Dialects , IR Design (Part 2): Progressive Lowering and Multi-Level IR
π Blog
TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation
dev-discuss.pytorch.org Β· Source:
Panorama: The World of ML Compilers , Graph Capture: TorchDynamo, AOTAutograd & Functionalization
π Website
PEP 523 β Adding a frame evaluation API to CPython
peps.python.org Β· Source:
Graph Capture: TorchDynamo, AOTAutograd & Functionalization
π Website
torch.compiler β PyTorch Documentation
pytorch.org Β· Source:
Graph Capture: TorchDynamo, AOTAutograd & Functionalization
π Website
AOT Autograd β How to use and optimize?
pytorch.org Β· Source:
Graph Capture: TorchDynamo, AOTAutograd & Functionalization
π Paper
Efficiently Computing Static Single Assignment Form and the Control Dependence Graph
dl.acm.org Β· Source:
IR Design (Part 1): SSA, FX IR & MLIR Dialects , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
π Website
torch.fx β PyTorch Documentation
pytorch.org Β· Source:
IR Design (Part 1): SSA, FX IR & MLIR Dialects , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
π Website
MLIR Language Reference
mlir.llvm.org Β· Source:
IR Design (Part 1): SSA, FX IR & MLIR Dialects
π Website
MLIR Dialects
mlir.llvm.org Β· Source:
IR Design (Part 1): SSA, FX IR & MLIR Dialects
π Website
MLIR Dialect Conversion
mlir.llvm.org Β· Source:
IR Design (Part 2): Progressive Lowering and Multi-Level IR
π Website
MLIR Bufferization
mlir.llvm.org Β· Source:
IR Design (Part 2): Progressive Lowering and Multi-Level IR
π Website
MLIR Pass Infrastructure
mlir.llvm.org Β· Source:
IR Design (Part 2): Progressive Lowering and Multi-Level IR , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
π» Code
torch-mlir: PyTorch to MLIR compiler
github.com Β· Source:
IR Design (Part 2): Progressive Lowering and Multi-Level IR
π Paper
A Unified Approach to Global Program Optimization
dl.acm.org Β· Source:
Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
π Website
MLIR Canonicalization
mlir.llvm.org Β· Source:
Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
π Paper
Constant Propagation with Conditional Branches
dl.acm.org Β· Source:
Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
π Website
PyTorch FX Subgraph Rewriter
pytorch.org Β· Source:
Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
π Website
MLIR Declarative Rewrite Rules (DRR)
mlir.llvm.org Β· Source:
Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
π Website
MLIR PDL β Pattern Description Language
mlir.llvm.org Β· Source:
Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
π Website
torch.fx β Subgraph Rewriting
pytorch.org Β· Source:
Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
π Paper
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
arxiv.org Β· Source:
Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
π Website
NVIDIA Tensor Core Programming
docs.nvidia.com Β· Source:
Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
π Paper
A Practical Automatic Polyhedral Parallelizer and Locality Optimizer
dl.acm.org Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
π Website
MLIR Affine Dialect
mlir.llvm.org Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
π Paper
Polyhedral Compilation as a Design Pattern for Compiler Construction
link.springer.com Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
π Website
MLIR Transform Dialect
mlir.llvm.org Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations , Autotuning and End-to-End Practice
π Paper
Optimizing Compilers for Modern Architectures
elsevier.com Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
π Website
Polly - Polyhedral optimizations for LLVM
polly.llvm.org Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
π Paper
Integer Set Library: A Library for Manipulating Integer Sets
libisl.sourceforge.io Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
π Website
MLIR Linalg Dialect
mlir.llvm.org Β· Source:
Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations , Operator Fusion (Part II): Cost Models & Fusion in Practice
π Paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
arxiv.org Β· Source:
Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice , Tiling Strategies & Memory Hierarchy Optimization
π Paper
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
dl.acm.org Β· Source:
Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice , Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
π Website
TorchInductor: a PyTorch-native Compiler
dev-discuss.pytorch.org Β· Source:
Operator Fusion (Part I): Taxonomy & Decision Algorithms
π Paper
XLA: Optimizing Compiler for Machine Learning
tensorflow.org Β· Source:
Operator Fusion (Part I): Taxonomy & Decision Algorithms
π Website
Roofline Model
docs.nersc.gov Β· Source:
Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice
π Paper
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
arxiv.org Β· Source:
Operator Fusion (Part II): Cost Models & Fusion in Practice
π Paper
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
arxiv.org Β· Source:
Operator Fusion (Part II): Cost Models & Fusion in Practice
π Paper
Roofline: An Insightful Visual Performance Model for Multicore Architectures
www2.eecs.berkeley.edu Β· Source:
Tiling Strategies & Memory Hierarchy Optimization
π Website
NVIDIA CUDA C++ Programming Guide β Shared Memory
docs.nvidia.com Β· Source:
Tiling Strategies & Memory Hierarchy Optimization
π Website
CUTLASS: CUDA Templates for Linear Algebra Subroutines
github.com Β· Source:
Tiling Strategies & Memory Hierarchy Optimization , Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
π Website
Triton Language Documentation
triton-lang.org Β· Source:
Tiling Strategies & Memory Hierarchy Optimization , Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
π Website
NVIDIA A100 GPU Architecture Whitepaper
images.nvidia.com Β· Source:
Tiling Strategies & Memory Hierarchy Optimization
π Website
torch.compile Dynamic Shapes Documentation
pytorch.org Β· Source:
Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
π Website
TorchDynamo Deep Dive
pytorch.org Β· Source:
Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
π Website
MLIR Tensor Type β Dynamic Dimensions
mlir.llvm.org Β· Source:
Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
π Website
NVIDIA CUDA C++ Programming Guide β PTX ISA
docs.nvidia.com Β· Source:
Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
π Website
LLVM Code Generator Documentation
llvm.org Β· Source:
Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
π Website
NVIDIA GPU Architecture β Execution Units
docs.nvidia.com Β· Source:
Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
π Paper
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
eecs.harvard.edu Β· Source:
Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation , Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness , Autotuning and End-to-End Practice
π Website
MLIR GPU Dialect
mlir.llvm.org Β· Source:
Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
π Website
IREE Compiler and Runtime
iree.dev Β· Source:
Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
π Website
TensorRT Developer Guide
docs.nvidia.com Β· Source:
Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
π Website
What Every Computer Scientist Should Know About Floating-Point Arithmetic
docs.oracle.com Β· Source:
Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
π Paper
A Survey of Quantization Methods for Efficient Neural Network Inference
arxiv.org Β· Source:
Quantization Compilation and Mixed-Precision Optimization
π Paper
GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers
arxiv.org Β· Source:
Quantization Compilation and Mixed-Precision Optimization
π Paper
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arxiv.org Β· Source:
Quantization Compilation and Mixed-Precision Optimization
π Paper
FP8 Formats for Deep Learning
arxiv.org Β· Source:
Quantization Compilation and Mixed-Precision Optimization
π Website
PyTorch Quantization Documentation
pytorch.org Β· Source:
Quantization Compilation and Mixed-Precision Optimization
π Website
TensorRT Quantization Toolkit
docs.nvidia.com Β· Source:
Quantization Compilation and Mixed-Precision Optimization
π Paper
GSPMD: General and Scalable Parallelization for ML Computation Graphs
arxiv.org Β· Source:
Distributed Compilation and Graph Partitioning
π Paper
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
arxiv.org Β· Source:
Distributed Compilation and Graph Partitioning
π Paper
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
arxiv.org Β· Source:
Distributed Compilation and Graph Partitioning
π Paper
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
arxiv.org Β· Source:
Distributed Compilation and Graph Partitioning
π Website
PyTorch Distributed Overview
pytorch.org Β· Source:
Distributed Compilation and Graph Partitioning
π Website
XLA SPMD Partitioner
openxla.org Β· Source:
Distributed Compilation and Graph Partitioning
π Website
CUDA C++ Programming Guide β Streams
docs.nvidia.com Β· Source:
Scheduling and Execution Optimization
π Website
CUDA Graphs
docs.nvidia.com Β· Source:
Scheduling and Execution Optimization
π Paper
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
arxiv.org Β· Source:
Scheduling and Execution Optimization
π Paper
Dynamic Tensor Rematerialization
arxiv.org Β· Source:
Scheduling and Execution Optimization
π Website
TorchInductor: A PyTorch Native Compiler
dev-discuss.pytorch.org Β· Source:
Scheduling and Execution Optimization
π Website
PyTorch Activation Checkpointing
pytorch.org Β· Source:
Scheduling and Execution Optimization
π Paper
Ansor: Generating High-Performance Tensor Programs for Deep Learning
arxiv.org Β· Source:
Autotuning and End-to-End Practice
π Paper
Learning to Optimize Tensor Programs
arxiv.org Β· Source:
Autotuning and End-to-End Practice
π Website
Triton Autotune Documentation
triton-lang.org Β· Source:
Autotuning and End-to-End Practice
π Website
torch.compile Troubleshooting
pytorch.org Β· Source:
Autotuning and End-to-End Practice
π Website
Reinforcement Learning: An Introduction (Sutton & Barto, 2nd Edition)
incompleteideas.net Β· Source:
Reinforcement Learning Foundations: From Agent to Bellman Equation
π Website
David Silver UCL Reinforcement Learning Course
davidsilver.uk Β· Source:
Reinforcement Learning Foundations: From Agent to Bellman Equation
π Website
OpenAI Spinning Up: Introduction to RL
spinningup.openai.com Β· Source:
Reinforcement Learning Foundations: From Agent to Bellman Equation
π Website
Hugging Face Deep RL Course
huggingface.co Β· Source:
Reinforcement Learning Foundations: From Agent to Bellman Equation , Test-Time Scaling and Reasoning Enhancement
π Website
A (Long) Peek into Reinforcement Learning β Lilian Weng
lilianweng.github.io Β· Source:
Reinforcement Learning Foundations: From Agent to Bellman Equation
π Website
UC Berkeley CS285: Deep Reinforcement Learning
rail.eecs.berkeley.edu Β· Source:
Reinforcement Learning Foundations: From Agent to Bellman Equation , Policy Gradient: Directly Optimizing the Policy , Actor-Critic and PPO: Stable Policy Optimization
π Website
Deep Reinforcement Learning: Pong from Pixels β Andrej Karpathy
karpathy.github.io Β· Source:
Reinforcement Learning Foundations: From Agent to Bellman Equation , Test-Time Scaling and Reasoning Enhancement
π Paper
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (Williams, 1992)
link.springer.com Β· Source:
Policy Gradient: Directly Optimizing the Policy
π Paper
Policy Gradient Methods for Reinforcement Learning with Function Approximation (Sutton et al., 1999)
proceedings.neurips.cc Β· Source:
Policy Gradient: Directly Optimizing the Policy
π Website
Policy Gradient Algorithms β Lilian Weng
lilianweng.github.io Β· Source:
Policy Gradient: Directly Optimizing the Policy , Actor-Critic and PPO: Stable Policy Optimization , When RL Meets LLM: From Language Generation to Policy Optimization
π Website
OpenAI Spinning Up: Vanilla Policy Gradient
spinningup.openai.com Β· Source:
Policy Gradient: Directly Optimizing the Policy
π Paper
Proximal Policy Optimization Algorithms (Schulman et al., 2017)
arxiv.org Β· Source:
Actor-Critic and PPO: Stable Policy Optimization
π Paper
High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., 2016)
arxiv.org Β· Source:
Actor-Critic and PPO: Stable Policy Optimization
π Paper
Trust Region Policy Optimization (Schulman et al., 2015)
arxiv.org Β· Source:
Actor-Critic and PPO: Stable Policy Optimization
π Website
Hugging Face Deep RL Course: PPO
huggingface.co Β· Source:
Actor-Critic and PPO: Stable Policy Optimization
π Paper
Training language models to follow instructions with human feedback (Ouyang et al., 2022)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback
π Paper
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization , From DPO to GRPO: Direct Preference Optimization
π Paper
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al., 2024)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization , From DPO to GRPO: Direct Preference Optimization
π Paper
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization , Test-Time Scaling and Reasoning Enhancement
π Paper
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (Ross et al., 2011)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization
π Paper
Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback
π Paper
Learning to summarize from human feedback (Stiennon et al., 2020)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization
π Website
RLHF: Reinforcement Learning from Human Feedback β Chip Huyen
huyenchip.com Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback
π Paper
Let's Verify Step by Step (Lightman et al., 2023)
arxiv.org Β· Source:
When RL Meets LLM: From Language Generation to Policy Optimization , Reward Design and Scaling , Test-Time Scaling and Reasoning Enhancement
π Paper
Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017)
arxiv.org Β· Source:
RLHF: Learning from Human Feedback
π Website
RLHF η³»ε β Nathan Lambert (interconnects.ai)
interconnects.ai Β· Source:
RLHF: Learning from Human Feedback , Reward Design and Scaling
π Website
Reward Hacking in Reinforcement Learning β Lilian Weng
lilianweng.github.io Β· Source:
RLHF: Learning from Human Feedback , From DPO to GRPO: Direct Preference Optimization , Reward Design and Scaling
π Paper
A General Theoretical Paradigm to Understand Learning from Human Feedback (Azar et al., 2023)
arxiv.org Β· Source:
From DPO to GRPO: Direct Preference Optimization
π Paper
KTO: Model Alignment as Prospect Theoretic Optimization (Ethayarajh et al., 2024)
arxiv.org Β· Source:
From DPO to GRPO: Direct Preference Optimization
π Website
Hugging Face TRL Documentation: DPO Trainer
huggingface.co Β· Source:
From DPO to GRPO: Direct Preference Optimization
π Paper
Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)
arxiv.org Β· Source:
Reward Design and Scaling
π Paper
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
arxiv.org Β· Source:
Reward Design and Scaling
π Paper
Scaling Laws for Reward Model Overoptimization (Gao et al., 2022)
arxiv.org Β· Source:
Reward Design and Scaling
π Paper
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Snell et al., 2024)
arxiv.org Β· Source:
Test-Time Scaling and Reasoning Enhancement
π Paper
AlphaZero-like Tree-Search can Guide Large Language Model Decoding and Training (Feng et al., 2024)
arxiv.org Β· Source:
Test-Time Scaling and Reasoning Enhancement
π Website
Intel Xe2 Architecture β Intel
intel.com Β· Source:
Xe2 GPU Architecture
π Website
Intel Data Center GPU Max Series Architecture β Intel
intel.com Β· Source:
Xe2 GPU Architecture
π Website
oneAPI GPU Optimization Guide β Intel
intel.com Β· Source:
Xe2 GPU Architecture
π Website
oneAPI GPU Optimization Guide β Thread Hierarchy β Intel
intel.com Β· Source:
Xe2 Execution Model and Programming Abstractions
π Website
SYCL 2020 Specification β Khronos Group
registry.khronos.org Β· Source:
Xe2 Execution Model and Programming Abstractions
π Website
Intel GPU Occupancy Calculator β Intel
intel.com Β· Source:
Xe2 Execution Model and Programming Abstractions
π Website
DPC++ Language Extensions for SYCL β Intel
github.com Β· Source:
Xe2 Execution Model and Programming Abstractions
π Website
SPIR-V Specification β Khronos Group
registry.khronos.org Β· Source:
SPIR-V Compilation and Level Zero Runtime
π Website
oneAPI Level Zero Specification β Intel
spec.oneapi.io Β· Source:
SPIR-V Compilation and Level Zero Runtime , LLM Inference on NPU: KV Cache and the Software Stack , NPU Execution Model and the Boundaries of Its Programming Model
π» Code
Intel Graphics Compiler (IGC) β GitHub
github.com Β· Source:
SPIR-V Compilation and Level Zero Runtime
π Website
SPIR-V Guide β Khronos
github.com Β· Source:
SPIR-V Compilation and Level Zero Runtime
π Website
oneDNN Developer Guide β Intel
oneapi-src.github.io Β· Source:
oneDNN Primitive System
π» Code
oneAPI Deep Neural Network Library (oneDNN) β GitHub
github.com Β· Source:
oneDNN Primitive System
π Website
oneDNN Programming Model β Intel
oneapi-src.github.io Β· Source:
oneDNN Primitive System
π Website
Memory Format Propagation β oneDNN
oneapi-src.github.io Β· Source:
oneDNN Primitive System
π Website
oneDNN Performance Profiling and Inspection β Intel
oneapi-src.github.io Β· Source:
oneDNN GPU Kernel Optimization
π Website
oneAPI GPU Optimization Guide β GEMM β Intel
intel.com Β· Source:
oneDNN GPU Kernel Optimization
π Website
XMX and XVE Architecture β Intel
intel.com Β· Source:
oneDNN GPU Kernel Optimization
π Website
OpenVINO Architecture β Intel
docs.openvino.ai Β· Source:
OpenVINO Graph Optimization Pipeline
π Website
OpenVINO GPU Plugin β Intel
docs.openvino.ai Β· Source:
OpenVINO Graph Optimization Pipeline
π» Code
OpenVINO Toolkit β GitHub
github.com Β· Source:
OpenVINO Graph Optimization Pipeline
π Website
Optimum Intel Documentation
huggingface.co Β· Source:
Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO , Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
NNCF GitHub Repository
github.com Β· Source:
Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO
π Website
NNCF API Documentation
openvinotoolkit.github.io Β· Source:
Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO
π Website
OpenVINO Model Conversion
docs.openvino.ai Β· Source:
Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO , Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
Optimum Intel Source - Quantization
github.com Β· Source:
Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO
π Website
Intel VTune Profiler β GPU Analysis β Intel
intel.com Β· Source:
Performance Analysis and Bottleneck Diagnosis
π Website
OpenVINO Benchmark Tool β Intel
docs.openvino.ai Β· Source:
Performance Analysis and Bottleneck Diagnosis
π Website
Intel GPU Top β intel_gpu_top man page
manpages.ubuntu.com Β· Source:
Performance Analysis and Bottleneck Diagnosis
π Website
OpenVINO Multi-Device Execution β Intel
docs.openvino.ai Β· Source:
NPU Architecture and GPU+NPU Co-Inference
π Website
OpenVINO AUTO Device β Intel
docs.openvino.ai Β· Source:
NPU Architecture and GPU+NPU Co-Inference
π Website
Intel NPU Device β OpenVINO Documentation
docs.openvino.ai Β· Source:
NPU Architecture and GPU+NPU Co-Inference , LLM Inference on NPU: KV Cache and the Software Stack
π Website
Heterogeneous Execution β OpenVINO Docs
docs.openvino.ai Β· Source:
NPU Architecture and GPU+NPU Co-Inference
π Website
OpenVINO GenAI β Stateful LLM Pipeline
docs.openvino.ai Β· Source:
LLM Inference on NPU: KV Cache and the Software Stack
π Website
openvinotoolkit/npu_compiler β GitHub
github.com Β· Source:
LLM Inference on NPU: KV Cache and the Software Stack , NPU Execution Model and the Boundaries of Its Programming Model
π Website
Flash Attention β Tri Dao et al.
arxiv.org Β· Source:
NPU Execution Model and the Boundaries of Its Programming Model
π Website
CUTLASS 3.0 & CuTe β NVIDIA
github.com Β· Source:
NPU Execution Model and the Boundaries of Its Programming Model
π Website
llama.cpp GitHub Repository
github.com Β· Source:
Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
ONNX Runtime Documentation
onnxruntime.ai Β· Source:
Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths
π Website
lm-evaluation-harness
github.com Β· Source:
Hands-On: HF β GGUF / ONNX / OpenVINO β Three End-to-End Paths