Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Resources

Reference materials organized by learning path, auto-aggregated from article citations.

πŸ“„ Paper

Attention Is All You Need

arxiv.org Β· Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , Positional Encoding β€” Giving Transformers a Sense of Order
πŸ“„ Paper

Language Models are Unsupervised Multitask Learners (GPT-2)

cdn.openai.com Β· Source: Transformer Architecture Overview , Sampling & Decoding β€” From Probabilities to Text
🌐 Website

The Illustrated Transformer

jalammar.github.io Β· Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , KV Cache Fundamentals
🌐 Website

LLM Visualization β€” Brendan Bycroft

bbycroft.net Β· Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , MQA and GQA , KV Cache Fundamentals , Prefill vs Decode Phases
🌐 Website

Transformer Explainer β€” Georgia Tech / Polo Club

poloclub.github.io Β· Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , MQA and GQA
πŸ“„ Paper

GLU Variants Improve Transformer

arxiv.org Β· Source: Transformer Architecture Overview
πŸ“„ Paper

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

arxiv.org Β· Source: MQA and GQA
πŸ“„ Paper

Fast Transformer Decoding: One Write-Head is All You Need

arxiv.org Β· Source: MQA and GQA
πŸ“„ Paper

Mistral 7B

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA
πŸ“„ Paper

Gemma 2 Technical Report

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA
πŸ“„ Paper

Jamba: A Hybrid Transformer-Mamba Language Model

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA , Hybrid Architectures: Fusing Mamba with Attention
πŸ“„ Paper

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA
πŸ“„ Paper

Flamingo: a Visual Language Model for Few-Shot Learning

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA
πŸ“„ Paper

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA , Mixture of Experts: Sparsely Activated Large Model Architecture
πŸ“„ Paper

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA
πŸ“„ Paper

Retentive Network: A Successor to Transformer for Large Language Models

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA
πŸ“„ Paper

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA , Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
πŸ“„ Paper

Gated Delta Networks: Improving Mamba2 with Delta Rule

arxiv.org Β· Source: Attention Variants: From Sliding Window to MLA , Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
πŸ“„ Paper

Efficient Memory Management for Large Language Model Serving with PagedAttention

arxiv.org Β· Source: KV Cache Fundamentals
πŸ“„ Paper

Efficiently Scaling Transformer Inference

arxiv.org Β· Source: Prefill vs Decode Phases
πŸ“„ Paper

LLM Inference Unveiled: Survey and Roofline Model Insights

arxiv.org Β· Source: Prefill vs Decode Phases
πŸ“„ Paper

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arxiv.org Β· Source: Flash Attention Tiling Principles
πŸ“„ Paper

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

arxiv.org Β· Source: Flash Attention Tiling Principles
πŸ“„ Paper

Self-Attention with Relative Position Representations

arxiv.org Β· Source: Positional Encoding β€” Giving Transformers a Sense of Order
πŸ“„ Paper

RoFormer: Enhanced Transformer with Rotary Position Embedding

arxiv.org Β· Source: Positional Encoding β€” Giving Transformers a Sense of Order
πŸ“„ Paper

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

arxiv.org Β· Source: Positional Encoding β€” Giving Transformers a Sense of Order
πŸ“„ Paper

The Curious Case of Neural Text Degeneration

arxiv.org Β· Source: Sampling & Decoding β€” From Probabilities to Text
πŸ“„ Paper

Hierarchical Neural Story Generation

arxiv.org Β· Source: Sampling & Decoding β€” From Probabilities to Text
πŸ“„ Paper

Perplexity β€” a Measure of the Difficulty of Speech Recognition Tasks

ieeexplore.ieee.org Β· Source: Sampling & Decoding β€” From Probabilities to Text
πŸ“„ Paper

Fast Inference from Transformers via Speculative Decoding

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
πŸ“„ Paper

Accelerating Large Language Model Decoding with Speculative Sampling

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
πŸ“„ Paper

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
πŸ“„ Paper

Better & Faster Large Language Models via Multi-Token Prediction

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
πŸ“„ Paper

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
πŸ“„ Paper

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
πŸ“„ Paper

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
🌐 Website

DeepSeek-V3 Technical Report

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing , Mixture of Experts: Sparsely Activated Large Model Architecture
πŸ“„ Paper

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

arxiv.org Β· Source: Speculative Decoding β€” Accelerating LLM Inference via Guessing
πŸ“„ Paper

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

arxiv.org Β· Source: Mixture of Experts: Sparsely Activated Large Model Architecture
πŸ“„ Paper

Mixtral of Experts

arxiv.org Β· Source: Mixture of Experts: Sparsely Activated Large Model Architecture
πŸ“„ Paper

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

arxiv.org Β· Source: Mixture of Experts: Sparsely Activated Large Model Architecture
πŸ“„ Paper

Efficiently Modeling Long Sequences with Structured State Spaces (S4)

arxiv.org Β· Source: State Space Models and Mamba
πŸ“„ Paper

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

arxiv.org Β· Source: State Space Models and Mamba
πŸ“„ Paper

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

arxiv.org Β· Source: State Space Models and Mamba
πŸ“„ Paper

HiPPO: Recurrent Memory with Optimal Polynomial Projections

arxiv.org Β· Source: State Space Models and Mamba
πŸ“„ Paper

Hungry Hungry Hippos: Towards Language Modeling with State Space Models (H3)

arxiv.org Β· Source: State Space Models and Mamba
πŸ“„ Paper

On the Parameterization and Initialization of Diagonal State Space Models (S4D)

arxiv.org Β· Source: State Space Models and Mamba
🌐 Website

Zamba2-Small: A Hybrid SSM-Transformer Model

zyphra.com Β· Source: Hybrid Architectures: Fusing Mamba with Attention
πŸ“„ Paper

Hymba: A Hybrid-head Architecture for Small Language Models

arxiv.org Β· Source: Hybrid Architectures: Fusing Mamba with Attention
πŸ“„ Paper

An Empirical Study of Mamba-based Language Models

arxiv.org Β· Source: Hybrid Architectures: Fusing Mamba with Attention
πŸ“„ Paper

Repeat After Me: Transformers are Better than State Space Models at Copying

arxiv.org Β· Source: Hybrid Architectures: Fusing Mamba with Attention
πŸ“„ Paper

Qwen3 Technical Report

arxiv.org Β· Source: Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
πŸ’» Code

Ollama - Qwen3-Next Model Implementation

github.com Β· Source: Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
πŸ“„ Paper

Efficient Estimation of Word Representations in Vector Space

arxiv.org Β· Source: From Text to Vectors: Tokenization and Word Embeddings
πŸ“„ Paper

Neural Machine Translation of Rare Words with Subword Units

arxiv.org Β· Source: From Text to Vectors: Tokenization and Word Embeddings
πŸ“„ Paper

GloVe: Global Vectors for Word Representation

nlp.stanford.edu Β· Source: From Text to Vectors: Tokenization and Word Embeddings
πŸ“„ Paper

SentencePiece: A simple and language independent subword tokenizer

arxiv.org Β· Source: From Text to Vectors: Tokenization and Word Embeddings
🌐 Website

The Illustrated Word2Vec

jalammar.github.io Β· Source: From Text to Vectors: Tokenization and Word Embeddings
🌐 Website

Hugging Face Tokenizer Summary

huggingface.co Β· Source: From Text to Vectors: Tokenization and Word Embeddings
πŸ“„ Paper

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

arxiv.org Β· Source: BERT and GPT: Two Paths β€” Understanding vs Generation
πŸ“„ Paper

Improving Language Understanding by Generative Pre-Training

cdn.openai.com Β· Source: BERT and GPT: Two Paths β€” Understanding vs Generation
πŸ“„ Paper

Language Models are Unsupervised Multitask Learners

cdn.openai.com Β· Source: BERT and GPT: Two Paths β€” Understanding vs Generation
πŸ“„ Paper

Language Models are Few-Shot Learners

arxiv.org Β· Source: BERT and GPT: Two Paths β€” Understanding vs Generation
πŸ“„ Paper

BERT for Joint Intent Classification and Slot Filling

arxiv.org Β· Source: BERT and GPT: Two Paths β€” Understanding vs Generation
πŸ“„ Paper

Scaling Laws for Neural Language Models

arxiv.org Β· Source: BERT and GPT: Two Paths β€” Understanding vs Generation
πŸ“„ Paper

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

arxiv.org Β· Source: Sentence Embeddings: From Token-Level to Semantic Retrieval
πŸ“„ Paper

Text Embeddings by Weakly-Supervised Contrastive Pre-training

arxiv.org Β· Source: Sentence Embeddings: From Token-Level to Semantic Retrieval
πŸ“„ Paper

C-Pack: Packaged Resources To Advance General Chinese Embedding

arxiv.org Β· Source: Sentence Embeddings: From Token-Level to Semantic Retrieval
πŸ“„ Paper

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

arxiv.org Β· Source: Sentence Embeddings: From Token-Level to Semantic Retrieval
πŸ“„ Paper

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

arxiv.org Β· Source: Vision Transformer: When Images Become Token Sequences
πŸ“„ Paper

Training data-efficient image transformers & distillation through attention

arxiv.org Β· Source: Vision Transformer: When Images Become Token Sequences
πŸ“„ Paper

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

arxiv.org Β· Source: Vision Transformer: When Images Become Token Sequences
πŸ“„ Paper

Learning Transferable Visual Models From Natural Language Supervision

arxiv.org Β· Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
πŸ“„ Paper

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

arxiv.org Β· Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
πŸ“„ Paper

Sigmoid Loss for Language Image Pre-Training

arxiv.org Β· Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
πŸ“„ Paper

Visual Instruction Tuning

arxiv.org Β· Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
πŸ“„ Paper

Denoising Diffusion Probabilistic Models

arxiv.org Β· Source: Diffusion Model Fundamentals: Generating from Noise
πŸ“„ Paper

Denoising Diffusion Implicit Models

arxiv.org Β· Source: Diffusion Model Fundamentals: Generating from Noise
πŸ“„ Paper

High-Resolution Image Synthesis with Latent Diffusion Models

arxiv.org Β· Source: Diffusion Model Fundamentals: Generating from Noise
πŸ“„ Paper

Classifier-Free Diffusion Guidance

arxiv.org Β· Source: Diffusion Model Fundamentals: Generating from Noise
πŸ“„ Paper

Scalable Diffusion Models with Transformers

arxiv.org Β· Source: Diffusion Transformer: Image Generation with Transformers
πŸ“„ Paper

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

arxiv.org Β· Source: Diffusion Transformer: Image Generation with Transformers
πŸ“„ Paper

Video generation models as world simulators

openai.com Β· Source: Video Generation: Spatiotemporal Attention and the Sora Architecture
πŸ“„ Paper

Make-A-Video: Text-to-Video Generation without Text-Video Data

arxiv.org Β· Source: Video Generation: Spatiotemporal Attention and the Sora Architecture
πŸ“„ Paper

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

arxiv.org Β· Source: Video Generation: Spatiotemporal Attention and the Sora Architecture
πŸ“„ Paper

Robust Speech Recognition via Large-Scale Weak Supervision

arxiv.org Β· Source: Speech and Transformers: From Whisper to VALL-E
πŸ“„ Paper

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

arxiv.org Β· Source: Speech and Transformers: From Whisper to VALL-E
πŸ“„ Paper

High Fidelity Neural Audio Compression

arxiv.org Β· Source: Speech and Transformers: From Whisper to VALL-E
πŸ“„ Paper

Simple and Controllable Music Generation

arxiv.org Β· Source: Music Generation: When Transformers Learn to Compose
πŸ“„ Paper

Jukebox: A Generative Model for Music

arxiv.org Β· Source: Music Generation: When Transformers Learn to Compose
πŸ“„ Paper

MusicLM: Generating Music From Text

arxiv.org Β· Source: Music Generation: When Transformers Learn to Compose
πŸ“„ Paper

Fast Timing-Conditioned Latent Audio Diffusion

arxiv.org Β· Source: Music Generation: When Transformers Learn to Compose
πŸ“„ Paper

A Survey of Quantization Methods for Efficient Neural Network Inference

arxiv.org Β· Source: Quantization Fundamentals
πŸ“„ Paper

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

arxiv.org Β· Source: Quantization Fundamentals
πŸ“„ Paper

FP8 Formats for Deep Learning

arxiv.org Β· Source: Quantization Fundamentals , Inference-Time Quantization: KV Cache and Activation Quantization
πŸ“„ Paper

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

arxiv.org Β· Source: PTQ Weight Quantization: From GPTQ to AWQ , llama.cpp Quantization Methods
πŸ“„ Paper

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

arxiv.org Β· Source: PTQ Weight Quantization: From GPTQ to AWQ , llama.cpp Quantization Methods
πŸ“„ Paper

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

arxiv.org Β· Source: PTQ Weight Quantization: From GPTQ to AWQ
πŸ“„ Paper

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

arxiv.org Β· Source: Quantization-Aware Training (QAT)
πŸ“„ Paper

BitNet: Scaling 1-bit Transformers for Large Language Models

arxiv.org Β· Source: Quantization-Aware Training (QAT)
πŸ“„ Paper

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

arxiv.org Β· Source: Quantization-Aware Training (QAT)
πŸ“„ Paper

QLoRA: Efficient Finetuning of Quantized LLMs

arxiv.org Β· Source: Quantization-Aware Training (QAT)
πŸ“„ Paper

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

arxiv.org Β· Source: Quantization-Aware Training (QAT)
πŸ“„ Paper

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

arxiv.org Β· Source: Inference-Time Quantization: KV Cache and Activation Quantization
πŸ“„ Paper

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

arxiv.org Β· Source: Inference-Time Quantization: KV Cache and Activation Quantization
πŸ’» Code

llama.cpp Quantization Types

github.com Β· Source: llama.cpp Quantization Methods
πŸ’» Code

K-quant PR

github.com Β· Source: llama.cpp Quantization Methods
🌐 Website

NVIDIA Model Optimizer GitHub

github.com Β· Source: Quantization and Model Conversion Toolchain Landscape
🌐 Website

vLLM Quantization - LLM Compressor

github.com Β· Source: Quantization and Model Conversion Toolchain Landscape
🌐 Website

Microsoft Olive Documentation - Why Olive

microsoft.github.io Β· Source: Quantization and Model Conversion Toolchain Landscape
🌐 Website

Apple coremltools Optimization Overview

apple.github.io Β· Source: Quantization and Model Conversion Toolchain Landscape
🌐 Website

AMD Quark Documentation

quark.docs.amd.com Β· Source: Quantization and Model Conversion Toolchain Landscape
🌐 Website

Google AI Edge Torch GitHub

github.com Β· Source: Quantization and Model Conversion Toolchain Landscape
🌐 Website

NNCF GitHub Repository

github.com Β· Source: Quantization and Model Conversion Toolchain Landscape
🌐 Website

Optimum Intel Documentation

huggingface.co Β· Source: Quantization and Model Conversion Toolchain Landscape , Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

llama.cpp GitHub Repository

github.com Β· Source: Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

ONNX Runtime Documentation

onnxruntime.ai Β· Source: Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

OpenVINO Documentation

docs.openvino.ai Β· Source: Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

lm-evaluation-harness

github.com Β· Source: Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
πŸ“„ Paper

Efficient Memory Management for Large Language Model Serving with PagedAttention

arxiv.org Β· Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM , PagedAttention and Continuous Batching , Scheduling and Preemption: The Inference Engine Scheduler
πŸ“„ Paper

SGLang: Efficient Execution of Structured Language Model Programs

arxiv.org Β· Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM , Prefix Caching and RadixAttention , SGLang Programming Model and Structured Output
🌐 Website

NVIDIA TensorRT-LLM Documentation

nvidia.github.io Β· Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM
🌐 Website

Ollama GitHub Repository

github.com Β· Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM
πŸ“„ Paper

Orca: A Distributed Serving System for Transformer-Based Generative Models

arxiv.org Β· Source: PagedAttention and Continuous Batching
🌐 Website

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

blog.vllm.ai Β· Source: PagedAttention and Continuous Batching
πŸ“„ Paper

Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

arxiv.org Β· Source: Scheduling and Preemption: The Inference Engine Scheduler
πŸ“„ Paper

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

arxiv.org Β· Source: Scheduling and Preemption: The Inference Engine Scheduler
🌐 Website

vLLM Automatic Prefix Caching

docs.vllm.ai Β· Source: Prefix Caching and RadixAttention
πŸ“„ Paper

Trie Memory

dl.acm.org Β· Source: Prefix Caching and RadixAttention
πŸ“„ Paper

Efficient Guided Generation for Large Language Models

arxiv.org Β· Source: SGLang Programming Model and Structured Output
🌐 Website

Fast JSON Decoding for Local LLMs with Compressed Finite State Machine

lmsys.org Β· Source: SGLang Programming Model and Structured Output
🌐 Website

SGLang Documentation β€” Structured Outputs

docs.sglang.io Β· Source: SGLang Programming Model and Structured Output
πŸ“„ Paper

RouteLLM: Learning to Route LLMs with Preference Data

arxiv.org Β· Source: Model Routing Landscape: Why One Model Isn't Enough , Routing Classifiers: Letting Small Models Decide Who Answers , RouteLLM in Practice: From Preference Data to Production Routing , Factorization Machines and LLM Routing: From FM Theory to MF Router , Online Learning and Cost Optimization: Routers Need to Evolve Too
πŸ“„ Paper

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

arxiv.org Β· Source: Model Routing Landscape: Why One Model Isn't Enough , Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed
πŸ“„ Paper

AutoMix: Automatically Mixing Language Models

arxiv.org Β· Source: Model Routing Landscape: Why One Model Isn't Enough , Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed
🌐 Website

RouteLLM GitHub Repository

github.com Β· Source: Model Routing Landscape: Why One Model Isn't Enough , RouteLLM in Practice: From Preference Data to Production Routing
πŸ“„ Paper

Evaluating Small Language Models for Front-Door Routing

arxiv.org Β· Source: Routing Classifiers: Letting Small Models Decide Who Answers
🌐 Website

semantic-router: Superfast Decision-Making Layer

github.com Β· Source: Routing Classifiers: Letting Small Models Decide Who Answers
πŸ“„ Paper

Factorization Machines

csie.ntu.edu.tw Β· Source: Factorization Machines and LLM Routing: From FM Theory to MF Router
πŸ“„ Paper

Factorization Machines with libFM

dl.acm.org Β· Source: Factorization Machines and LLM Routing: From FM Theory to MF Router
πŸ“„ Paper

Confidence-Driven LLM Router

arxiv.org Β· Source: Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed
πŸ“„ Paper

ConsRoute: Consistency-Driven LLM Routing for Cloud-Edge-Device

arxiv.org Β· Source: Hybrid LLM: Intelligent Routing Between Local and Cloud
πŸ“„ Paper

HybridFlow: Subtask-level DAG Routing

arxiv.org Β· Source: Hybrid LLM: Intelligent Routing Between Local and Cloud
πŸ“„ Paper

PRISM: Privacy-Sensitive Entity-Level LLM Routing

arxiv.org Β· Source: Hybrid LLM: Intelligent Routing Between Local and Cloud
πŸ“„ Paper

Bridging On-Device and Cloud LLMs for Collaborative Reasoning

arxiv.org Β· Source: Hybrid LLM: Intelligent Routing Between Local and Cloud
πŸ“„ Paper

Robust Batch-Level LLM Routing

arxiv.org Β· Source: Online Learning and Cost Optimization: Routers Need to Evolve Too
πŸ“„ Paper

Council Mode: Multi-LLM Collaboration for Hallucination Reduction

arxiv.org Β· Source: Multi-Model Collaboration: From Picking One to Using Many
🌐 Website

Mixture of Agents - Together AI

together.ai Β· Source: Multi-Model Collaboration: From Picking One to Using Many
πŸ“„ Paper

Measuring Massive Multitask Language Understanding (MMLU)

arxiv.org Β· Source: Benchmark Landscape and Evaluation Methodology
🌐 Website

lm-evaluation-harness

github.com Β· Source: Benchmark Landscape and Evaluation Methodology , Impact of Optimization on Accuracy , lm-eval-harness Practical Guide
πŸ“„ Paper

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

arxiv.org Β· Source: Benchmark Landscape and Evaluation Methodology
🌐 Website

LiveBench

livebench.ai Β· Source: Benchmark Landscape and Evaluation Methodology , Interpreting Leaderboards and Model Selection
πŸ“„ Paper

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

arxiv.org Β· Source: Knowledge & Reasoning Benchmarks
πŸ“„ Paper

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

arxiv.org Β· Source: Knowledge & Reasoning Benchmarks
πŸ“„ Paper

Measuring Mathematical Problem Solving With the MATH Dataset

arxiv.org Β· Source: Knowledge & Reasoning Benchmarks
πŸ“„ Paper

Evaluating Large Language Models Trained on Code

arxiv.org Β· Source: Code Benchmarks
πŸ“„ Paper

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

arxiv.org Β· Source: Code Benchmarks , SWE-bench Practical Guide
πŸ“„ Paper

Is Your Code Generated by ChatGPT Really Correct? (EvalPlus)

arxiv.org Β· Source: Code Benchmarks
🌐 Website

Berkeley Function Calling Leaderboard (BFCL)

gorilla.cs.berkeley.edu Β· Source: Agent & Tool Use Benchmarks , BFCL Practical Guide
πŸ“„ Paper

GAIA: A Benchmark for General AI Assistants

arxiv.org Β· Source: Agent & Tool Use Benchmarks
πŸ“„ Paper

WebArena: A Realistic Web Environment for Building Autonomous Agents

arxiv.org Β· Source: Agent & Tool Use Benchmarks
🌐 Website

Google Gemma 2 Technical Report

ai.google.dev Β· Source: Anatomy of Model Release Benchmark Standard Sets
🌐 Website

Microsoft Phi-3 Technical Report

arxiv.org Β· Source: Anatomy of Model Release Benchmark Standard Sets
πŸ“„ Paper

Qwen2.5 Technical Report

arxiv.org Β· Source: Anatomy of Model Release Benchmark Standard Sets
🌐 Website

Meta Llama 3.1 Model Card

huggingface.co Β· Source: Anatomy of Model Release Benchmark Standard Sets
🌐 Website

Open LLM Leaderboard

huggingface.co Β· Source: Anatomy of Model Release Benchmark Standard Sets , Interpreting Leaderboards and Model Selection
🌐 Website

OpenVINO Neural Network Compression Framework (NNCF)

github.com Β· Source: Impact of Optimization on Accuracy
🌐 Website

Optimum Intel

huggingface.co Β· Source: Impact of Optimization on Accuracy
🌐 Website

llama.cpp

github.com Β· Source: Impact of Optimization on Accuracy
🌐 Website

Chatbot Arena (LMSYS)

lmarena.ai Β· Source: Interpreting Leaderboards and Model Selection
🌐 Website

Artificial Analysis LLM Leaderboard

artificialanalysis.ai Β· Source: Interpreting Leaderboards and Model Selection
🌐 Website

lm-eval Documentation

lm-evaluation-harness.readthedocs.io Β· Source: lm-eval-harness Practical Guide
🌐 Website

SWE-bench GitHub

github.com Β· Source: SWE-bench Practical Guide
🌐 Website

SWE-agent GitHub

github.com Β· Source: SWE-bench Practical Guide
🌐 Website

Gorilla / BFCL GitHub

github.com Β· Source: BFCL Practical Guide
πŸ’» Code

Ollama GitHub

github.com Β· Source: Ollama + llama.cpp Architecture Overview , The Complete Journey of a Single Inference , KV Cache and Batch Scheduling , Server Layer and Scheduling
πŸ’» Code

llama.cpp GitHub

github.com Β· Source: Ollama + llama.cpp Architecture Overview , The Complete Journey of a Single Inference , Compute Graphs and Inference Engines
πŸ’» Code

GGML GitHub

github.com Β· Source: Ollama + llama.cpp Architecture Overview , Compute Graphs and Inference Engines , Hardware Backends
πŸ“„ Paper

Qwen3 Technical Report

arxiv.org Β· Source: The Complete Journey of a Single Inference
πŸ’» Code

GGUF Specification

github.com Β· Source: The GGUF Model Format
🌐 Website

Safetensors Documentation

huggingface.co Β· Source: The GGUF Model Format
🌐 Website

ONNX

onnx.ai Β· Source: The GGUF Model Format
πŸ’» Code

llama.cpp Quantization Types

github.com Β· Source: llama.cpp Quantization Methods
πŸ’» Code

K-quant PR

github.com Β· Source: llama.cpp Quantization Methods
πŸ“„ Paper

GPTQ: Accurate Post-Training Quantization

arxiv.org Β· Source: llama.cpp Quantization Methods
πŸ“„ Paper

AWQ: Activation-aware Weight Quantization

arxiv.org Β· Source: llama.cpp Quantization Methods
πŸ“„ Paper

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arxiv.org Β· Source: Compute Graphs and Inference Engines
πŸ“„ Paper

Efficient Memory Management for LLM Serving with PagedAttention

arxiv.org Β· Source: KV Cache and Batch Scheduling
🌐 Website

CUDA Programming Guide

docs.nvidia.com Β· Source: Hardware Backends
🌐 Website

Metal Shading Language

developer.apple.com Β· Source: Hardware Backends
🌐 Website

Vulkan Compute

khronos.org Β· Source: Hardware Backends
πŸ’» Code

Ollama FAQ

github.com Β· Source: Server Layer and Scheduling
πŸ’» Code

Ollama Modelfile

github.com Β· Source: Model Ecosystem
πŸ’» Code

Ollama API

github.com Β· Source: Model Ecosystem
πŸ“„ Paper

LLaVA: Visual Instruction Tuning

arxiv.org Β· Source: Model Ecosystem
AI Compute Stack 21 resources
🌐 Website

NVIDIA CUDA C++ Programming Guide

docs.nvidia.com Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA , GPU Architecture β€” From Transistors to Threads , CUDA Programming Model β€” From Code to Hardware
🌐 Website

Khronos OpenCL Specification

khronos.org Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA
🌐 Website

Khronos SYCL Specification

khronos.org Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA
🌐 Website

Intel oneAPI Level Zero Specification

spec.oneapi.io Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA
🌐 Website

AMD ROCm HIP Programming Guide

rocm.docs.amd.com Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA
🌐 Website

Apple Metal Shading Language Specification

developer.apple.com Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA
πŸ’» Code

ggml / llama.cpp

github.com Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA
🌐 Website

ONNX Runtime Documentation

onnxruntime.ai Β· Source: AI Compute Stack Overview β€” From Inference Frameworks to Hardware ISA
🌐 Website

NVIDIA H100 Tensor Core GPU Architecture Whitepaper

resources.nvidia.com Β· Source: GPU Architecture β€” From Transistors to Threads , Matrix Acceleration Units β€” Tensor Core and XMX
πŸ“„ Paper

Why Systolic Architectures? β€” H.T. Kung

cs.virginia.edu Β· Source: Matrix Acceleration Units β€” Tensor Core and XMX
🌐 Website

NVIDIA PTX ISA β€” Matrix Multiply-Accumulate

docs.nvidia.com Β· Source: Matrix Acceleration Units β€” Tensor Core and XMX
🌐 Website

Intel Xe2 Architecture β€” Xe-Core and XMX

intel.com Β· Source: Matrix Acceleration Units β€” Tensor Core and XMX , CUDA Programming Model β€” From Code to Hardware
πŸ“„ Paper

DeepSeek-V3 Technical Report

arxiv.org Β· Source: Matrix Acceleration Units β€” Tensor Core and XMX
🌐 Website

NVIDIA Kernel Profiling Guide β€” Memory Coalescing

docs.nvidia.com Β· Source: CUDA Programming Model β€” From Code to Hardware
🌐 Website

CUDA Occupancy Calculator

docs.nvidia.com Β· Source: CUDA Programming Model β€” From Code to Hardware
🌐 Website

SYCL 2020 Specification

registry.khronos.org Β· Source: CUDA Programming Model β€” From Code to Hardware
🌐 Website

CUTLASS: Fast Linear Algebra in CUDA C++

github.com Β· Source: GEMM Optimization β€” From Naive to Peak Performance
🌐 Website

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance

siboehm.com Β· Source: GEMM Optimization β€” From Naive to Peak Performance
🌐 Website

CUDA C++ Programming Guide β€” Warp Matrix Functions

docs.nvidia.com Β· Source: GEMM Optimization β€” From Naive to Peak Performance
🌐 Website

Intel oneAPI DPC++ β€” joint_matrix Extension

github.com Β· Source: GEMM Optimization β€” From Naive to Peak Performance
πŸ“„ Paper

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

arxiv.org Β· Source: GEMM Optimization β€” From Naive to Peak Performance
πŸ“ Blog

PyTorch 2.0: Our next generation release

pytorch.org Β· Source: Panorama: The World of ML Compilers , Graph Capture: TorchDynamo, AOTAutograd & Functionalization
🌐 Website

MLIR: Multi-Level Intermediate Representation

mlir.llvm.org Β· Source: Panorama: The World of ML Compilers
🌐 Website

Triton Language and Compiler

triton-lang.org Β· Source: Panorama: The World of ML Compilers
πŸ“„ Paper

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

arxiv.org Β· Source: Panorama: The World of ML Compilers
πŸ“„ Paper

MLIR: A Compiler Infrastructure for the End of Moore's Law

arxiv.org Β· Source: Panorama: The World of ML Compilers , IR Design (Part 1): SSA, FX IR & MLIR Dialects , IR Design (Part 2): Progressive Lowering and Multi-Level IR
πŸ“ Blog

TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation

dev-discuss.pytorch.org Β· Source: Panorama: The World of ML Compilers , Graph Capture: TorchDynamo, AOTAutograd & Functionalization
🌐 Website

PEP 523 – Adding a frame evaluation API to CPython

peps.python.org Β· Source: Graph Capture: TorchDynamo, AOTAutograd & Functionalization
🌐 Website

torch.compiler β€” PyTorch Documentation

pytorch.org Β· Source: Graph Capture: TorchDynamo, AOTAutograd & Functionalization
🌐 Website

AOT Autograd β€” How to use and optimize?

pytorch.org Β· Source: Graph Capture: TorchDynamo, AOTAutograd & Functionalization
πŸ“„ Paper

Efficiently Computing Static Single Assignment Form and the Control Dependence Graph

dl.acm.org Β· Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
🌐 Website

torch.fx β€” PyTorch Documentation

pytorch.org Β· Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
🌐 Website

MLIR Language Reference

mlir.llvm.org Β· Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects
🌐 Website

MLIR Dialects

mlir.llvm.org Β· Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects
🌐 Website

MLIR Dialect Conversion

mlir.llvm.org Β· Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR
🌐 Website

MLIR Bufferization

mlir.llvm.org Β· Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR
🌐 Website

MLIR Pass Infrastructure

mlir.llvm.org Β· Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
πŸ’» Code

torch-mlir: PyTorch to MLIR compiler

github.com Β· Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR
πŸ“„ Paper

A Unified Approach to Global Program Optimization

dl.acm.org Β· Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
🌐 Website

MLIR Canonicalization

mlir.llvm.org Β· Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
πŸ“„ Paper

Constant Propagation with Conditional Branches

dl.acm.org Β· Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
🌐 Website

PyTorch FX Subgraph Rewriter

pytorch.org Β· Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals
🌐 Website

MLIR Declarative Rewrite Rules (DRR)

mlir.llvm.org Β· Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
🌐 Website

MLIR PDL β€” Pattern Description Language

mlir.llvm.org Β· Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
🌐 Website

torch.fx β€” Subgraph Rewriting

pytorch.org Β· Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
πŸ“„ Paper

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

arxiv.org Β· Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
🌐 Website

NVIDIA Tensor Core Programming

docs.nvidia.com Β· Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching
πŸ“„ Paper

A Practical Automatic Polyhedral Parallelizer and Locality Optimizer

dl.acm.org Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
🌐 Website

MLIR Affine Dialect

mlir.llvm.org Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
πŸ“„ Paper

Polyhedral Compilation as a Design Pattern for Compiler Construction

link.springer.com Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
🌐 Website

MLIR Transform Dialect

mlir.llvm.org Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations , Autotuning and End-to-End Practice
πŸ“„ Paper

Optimizing Compilers for Modern Architectures

elsevier.com Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
🌐 Website

Polly - Polyhedral optimizations for LLVM

polly.llvm.org Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
πŸ“„ Paper

Integer Set Library: A Library for Manipulating Integer Sets

libisl.sourceforge.io Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
🌐 Website

MLIR Linalg Dialect

mlir.llvm.org Β· Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations , Operator Fusion (Part II): Cost Models & Fusion in Practice
πŸ“„ Paper

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arxiv.org Β· Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice , Tiling Strategies & Memory Hierarchy Optimization
πŸ“„ Paper

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

dl.acm.org Β· Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice , Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
🌐 Website

TorchInductor: a PyTorch-native Compiler

dev-discuss.pytorch.org Β· Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms
πŸ“„ Paper

XLA: Optimizing Compiler for Machine Learning

tensorflow.org Β· Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms
🌐 Website

Roofline Model

docs.nersc.gov Β· Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice
πŸ“„ Paper

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

arxiv.org Β· Source: Operator Fusion (Part II): Cost Models & Fusion in Practice
πŸ“„ Paper

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

arxiv.org Β· Source: Operator Fusion (Part II): Cost Models & Fusion in Practice
πŸ“„ Paper

Roofline: An Insightful Visual Performance Model for Multicore Architectures

www2.eecs.berkeley.edu Β· Source: Tiling Strategies & Memory Hierarchy Optimization
🌐 Website

NVIDIA CUDA C++ Programming Guide β€” Shared Memory

docs.nvidia.com Β· Source: Tiling Strategies & Memory Hierarchy Optimization
🌐 Website

CUTLASS: CUDA Templates for Linear Algebra Subroutines

github.com Β· Source: Tiling Strategies & Memory Hierarchy Optimization , Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
🌐 Website

Triton Language Documentation

triton-lang.org Β· Source: Tiling Strategies & Memory Hierarchy Optimization , Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
🌐 Website

NVIDIA A100 GPU Architecture Whitepaper

images.nvidia.com Β· Source: Tiling Strategies & Memory Hierarchy Optimization
🌐 Website

torch.compile Dynamic Shapes Documentation

pytorch.org Β· Source: Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
🌐 Website

TorchDynamo Deep Dive

pytorch.org Β· Source: Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
🌐 Website

MLIR Tensor Type β€” Dynamic Dimensions

mlir.llvm.org Β· Source: Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution
🌐 Website

NVIDIA CUDA C++ Programming Guide β€” PTX ISA

docs.nvidia.com Β· Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
🌐 Website

LLVM Code Generator Documentation

llvm.org Β· Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
🌐 Website

NVIDIA GPU Architecture β€” Execution Units

docs.nvidia.com Β· Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation
πŸ“„ Paper

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

eecs.harvard.edu Β· Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation , Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness , Autotuning and End-to-End Practice
🌐 Website

MLIR GPU Dialect

mlir.llvm.org Β· Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
🌐 Website

IREE Compiler and Runtime

iree.dev Β· Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
🌐 Website

TensorRT Developer Guide

docs.nvidia.com Β· Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
🌐 Website

What Every Computer Scientist Should Know About Floating-Point Arithmetic

docs.oracle.com Β· Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness
πŸ“„ Paper

A Survey of Quantization Methods for Efficient Neural Network Inference

arxiv.org Β· Source: Quantization Compilation and Mixed-Precision Optimization
πŸ“„ Paper

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

arxiv.org Β· Source: Quantization Compilation and Mixed-Precision Optimization
πŸ“„ Paper

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

arxiv.org Β· Source: Quantization Compilation and Mixed-Precision Optimization
πŸ“„ Paper

FP8 Formats for Deep Learning

arxiv.org Β· Source: Quantization Compilation and Mixed-Precision Optimization
🌐 Website

PyTorch Quantization Documentation

pytorch.org Β· Source: Quantization Compilation and Mixed-Precision Optimization
🌐 Website

TensorRT Quantization Toolkit

docs.nvidia.com Β· Source: Quantization Compilation and Mixed-Precision Optimization
πŸ“„ Paper

GSPMD: General and Scalable Parallelization for ML Computation Graphs

arxiv.org Β· Source: Distributed Compilation and Graph Partitioning
πŸ“„ Paper

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

arxiv.org Β· Source: Distributed Compilation and Graph Partitioning
πŸ“„ Paper

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

arxiv.org Β· Source: Distributed Compilation and Graph Partitioning
πŸ“„ Paper

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

arxiv.org Β· Source: Distributed Compilation and Graph Partitioning
🌐 Website

PyTorch Distributed Overview

pytorch.org Β· Source: Distributed Compilation and Graph Partitioning
🌐 Website

XLA SPMD Partitioner

openxla.org Β· Source: Distributed Compilation and Graph Partitioning
🌐 Website

CUDA C++ Programming Guide β€” Streams

docs.nvidia.com Β· Source: Scheduling and Execution Optimization
🌐 Website

CUDA Graphs

docs.nvidia.com Β· Source: Scheduling and Execution Optimization
πŸ“„ Paper

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

arxiv.org Β· Source: Scheduling and Execution Optimization
πŸ“„ Paper

Dynamic Tensor Rematerialization

arxiv.org Β· Source: Scheduling and Execution Optimization
🌐 Website

TorchInductor: A PyTorch Native Compiler

dev-discuss.pytorch.org Β· Source: Scheduling and Execution Optimization
🌐 Website

PyTorch Activation Checkpointing

pytorch.org Β· Source: Scheduling and Execution Optimization
πŸ“„ Paper

Ansor: Generating High-Performance Tensor Programs for Deep Learning

arxiv.org Β· Source: Autotuning and End-to-End Practice
πŸ“„ Paper

Learning to Optimize Tensor Programs

arxiv.org Β· Source: Autotuning and End-to-End Practice
🌐 Website

Triton Autotune Documentation

triton-lang.org Β· Source: Autotuning and End-to-End Practice
🌐 Website

torch.compile Troubleshooting

pytorch.org Β· Source: Autotuning and End-to-End Practice
🌐 Website

Reinforcement Learning: An Introduction (Sutton & Barto, 2nd Edition)

incompleteideas.net Β· Source: Reinforcement Learning Foundations: From Agent to Bellman Equation
🌐 Website

David Silver UCL Reinforcement Learning Course

davidsilver.uk Β· Source: Reinforcement Learning Foundations: From Agent to Bellman Equation
🌐 Website

OpenAI Spinning Up: Introduction to RL

spinningup.openai.com Β· Source: Reinforcement Learning Foundations: From Agent to Bellman Equation
🌐 Website

Hugging Face Deep RL Course

huggingface.co Β· Source: Reinforcement Learning Foundations: From Agent to Bellman Equation , Test-Time Scaling and Reasoning Enhancement
🌐 Website

A (Long) Peek into Reinforcement Learning β€” Lilian Weng

lilianweng.github.io Β· Source: Reinforcement Learning Foundations: From Agent to Bellman Equation
🌐 Website

UC Berkeley CS285: Deep Reinforcement Learning

rail.eecs.berkeley.edu Β· Source: Reinforcement Learning Foundations: From Agent to Bellman Equation , Policy Gradient: Directly Optimizing the Policy , Actor-Critic and PPO: Stable Policy Optimization
🌐 Website

Deep Reinforcement Learning: Pong from Pixels β€” Andrej Karpathy

karpathy.github.io Β· Source: Reinforcement Learning Foundations: From Agent to Bellman Equation , Test-Time Scaling and Reasoning Enhancement
πŸ“„ Paper

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (Williams, 1992)

link.springer.com Β· Source: Policy Gradient: Directly Optimizing the Policy
πŸ“„ Paper

Policy Gradient Methods for Reinforcement Learning with Function Approximation (Sutton et al., 1999)

proceedings.neurips.cc Β· Source: Policy Gradient: Directly Optimizing the Policy
🌐 Website

Policy Gradient Algorithms β€” Lilian Weng

lilianweng.github.io Β· Source: Policy Gradient: Directly Optimizing the Policy , Actor-Critic and PPO: Stable Policy Optimization , When RL Meets LLM: From Language Generation to Policy Optimization
🌐 Website

OpenAI Spinning Up: Vanilla Policy Gradient

spinningup.openai.com Β· Source: Policy Gradient: Directly Optimizing the Policy
πŸ“„ Paper

Proximal Policy Optimization Algorithms (Schulman et al., 2017)

arxiv.org Β· Source: Actor-Critic and PPO: Stable Policy Optimization
πŸ“„ Paper

High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., 2016)

arxiv.org Β· Source: Actor-Critic and PPO: Stable Policy Optimization
πŸ“„ Paper

Trust Region Policy Optimization (Schulman et al., 2015)

arxiv.org Β· Source: Actor-Critic and PPO: Stable Policy Optimization
🌐 Website

Hugging Face Deep RL Course: PPO

huggingface.co Β· Source: Actor-Critic and PPO: Stable Policy Optimization
πŸ“„ Paper

Training language models to follow instructions with human feedback (Ouyang et al., 2022)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback
πŸ“„ Paper

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization , From DPO to GRPO: Direct Preference Optimization
πŸ“„ Paper

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al., 2024)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization , From DPO to GRPO: Direct Preference Optimization
πŸ“„ Paper

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization , Test-Time Scaling and Reasoning Enhancement
πŸ“„ Paper

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (Ross et al., 2011)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization
πŸ“„ Paper

Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback
πŸ“„ Paper

Learning to summarize from human feedback (Stiennon et al., 2020)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization
🌐 Website

RLHF: Reinforcement Learning from Human Feedback β€” Chip Huyen

huyenchip.com Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback
πŸ“„ Paper

Let's Verify Step by Step (Lightman et al., 2023)

arxiv.org Β· Source: When RL Meets LLM: From Language Generation to Policy Optimization , Reward Design and Scaling , Test-Time Scaling and Reasoning Enhancement
πŸ“„ Paper

Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017)

arxiv.org Β· Source: RLHF: Learning from Human Feedback
🌐 Website

RLHF η³»εˆ— β€” Nathan Lambert (interconnects.ai)

interconnects.ai Β· Source: RLHF: Learning from Human Feedback , Reward Design and Scaling
🌐 Website

Reward Hacking in Reinforcement Learning β€” Lilian Weng

lilianweng.github.io Β· Source: RLHF: Learning from Human Feedback , From DPO to GRPO: Direct Preference Optimization , Reward Design and Scaling
πŸ“„ Paper

A General Theoretical Paradigm to Understand Learning from Human Feedback (Azar et al., 2023)

arxiv.org Β· Source: From DPO to GRPO: Direct Preference Optimization
πŸ“„ Paper

KTO: Model Alignment as Prospect Theoretic Optimization (Ethayarajh et al., 2024)

arxiv.org Β· Source: From DPO to GRPO: Direct Preference Optimization
🌐 Website

Hugging Face TRL Documentation: DPO Trainer

huggingface.co Β· Source: From DPO to GRPO: Direct Preference Optimization
πŸ“„ Paper

Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)

arxiv.org Β· Source: Reward Design and Scaling
πŸ“„ Paper

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

arxiv.org Β· Source: Reward Design and Scaling
πŸ“„ Paper

Scaling Laws for Reward Model Overoptimization (Gao et al., 2022)

arxiv.org Β· Source: Reward Design and Scaling
πŸ“„ Paper

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Snell et al., 2024)

arxiv.org Β· Source: Test-Time Scaling and Reasoning Enhancement
πŸ“„ Paper

AlphaZero-like Tree-Search can Guide Large Language Model Decoding and Training (Feng et al., 2024)

arxiv.org Β· Source: Test-Time Scaling and Reasoning Enhancement
🌐 Website

Intel Xe2 Architecture β€” Intel

intel.com Β· Source: Xe2 GPU Architecture
🌐 Website

Intel Data Center GPU Max Series Architecture β€” Intel

intel.com Β· Source: Xe2 GPU Architecture
🌐 Website

oneAPI GPU Optimization Guide β€” Intel

intel.com Β· Source: Xe2 GPU Architecture
🌐 Website

oneAPI GPU Optimization Guide β€” Thread Hierarchy β€” Intel

intel.com Β· Source: Xe2 Execution Model and Programming Abstractions
🌐 Website

SYCL 2020 Specification β€” Khronos Group

registry.khronos.org Β· Source: Xe2 Execution Model and Programming Abstractions
🌐 Website

Intel GPU Occupancy Calculator β€” Intel

intel.com Β· Source: Xe2 Execution Model and Programming Abstractions
🌐 Website

DPC++ Language Extensions for SYCL β€” Intel

github.com Β· Source: Xe2 Execution Model and Programming Abstractions
🌐 Website

SPIR-V Specification β€” Khronos Group

registry.khronos.org Β· Source: SPIR-V Compilation and Level Zero Runtime
🌐 Website

oneAPI Level Zero Specification β€” Intel

spec.oneapi.io Β· Source: SPIR-V Compilation and Level Zero Runtime , LLM Inference on NPU: KV Cache and the Software Stack , NPU Execution Model and the Boundaries of Its Programming Model
πŸ’» Code

Intel Graphics Compiler (IGC) β€” GitHub

github.com Β· Source: SPIR-V Compilation and Level Zero Runtime
🌐 Website

SPIR-V Guide β€” Khronos

github.com Β· Source: SPIR-V Compilation and Level Zero Runtime
🌐 Website

oneDNN Developer Guide β€” Intel

oneapi-src.github.io Β· Source: oneDNN Primitive System
πŸ’» Code

oneAPI Deep Neural Network Library (oneDNN) β€” GitHub

github.com Β· Source: oneDNN Primitive System
🌐 Website

oneDNN Programming Model β€” Intel

oneapi-src.github.io Β· Source: oneDNN Primitive System
🌐 Website

Memory Format Propagation β€” oneDNN

oneapi-src.github.io Β· Source: oneDNN Primitive System
🌐 Website

oneDNN Performance Profiling and Inspection β€” Intel

oneapi-src.github.io Β· Source: oneDNN GPU Kernel Optimization
🌐 Website

oneAPI GPU Optimization Guide β€” GEMM β€” Intel

intel.com Β· Source: oneDNN GPU Kernel Optimization
🌐 Website

XMX and XVE Architecture β€” Intel

intel.com Β· Source: oneDNN GPU Kernel Optimization
🌐 Website

OpenVINO Architecture β€” Intel

docs.openvino.ai Β· Source: OpenVINO Graph Optimization Pipeline
🌐 Website

OpenVINO GPU Plugin β€” Intel

docs.openvino.ai Β· Source: OpenVINO Graph Optimization Pipeline
πŸ’» Code

OpenVINO Toolkit β€” GitHub

github.com Β· Source: OpenVINO Graph Optimization Pipeline
🌐 Website

Optimum Intel Documentation

huggingface.co Β· Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO , Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

NNCF GitHub Repository

github.com Β· Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO
🌐 Website

NNCF API Documentation

openvinotoolkit.github.io Β· Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO
🌐 Website

OpenVINO Model Conversion

docs.openvino.ai Β· Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO , Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

Optimum Intel Source - Quantization

github.com Β· Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO
🌐 Website

Intel VTune Profiler β€” GPU Analysis β€” Intel

intel.com Β· Source: Performance Analysis and Bottleneck Diagnosis
🌐 Website

OpenVINO Benchmark Tool β€” Intel

docs.openvino.ai Β· Source: Performance Analysis and Bottleneck Diagnosis
🌐 Website

Intel GPU Top β€” intel_gpu_top man page

manpages.ubuntu.com Β· Source: Performance Analysis and Bottleneck Diagnosis
🌐 Website

OpenVINO Multi-Device Execution β€” Intel

docs.openvino.ai Β· Source: NPU Architecture and GPU+NPU Co-Inference
🌐 Website

OpenVINO AUTO Device β€” Intel

docs.openvino.ai Β· Source: NPU Architecture and GPU+NPU Co-Inference
🌐 Website

Intel NPU Device β€” OpenVINO Documentation

docs.openvino.ai Β· Source: NPU Architecture and GPU+NPU Co-Inference , LLM Inference on NPU: KV Cache and the Software Stack
🌐 Website

Heterogeneous Execution β€” OpenVINO Docs

docs.openvino.ai Β· Source: NPU Architecture and GPU+NPU Co-Inference
🌐 Website

OpenVINO GenAI β€” Stateful LLM Pipeline

docs.openvino.ai Β· Source: LLM Inference on NPU: KV Cache and the Software Stack
🌐 Website

openvinotoolkit/npu_compiler β€” GitHub

github.com Β· Source: LLM Inference on NPU: KV Cache and the Software Stack , NPU Execution Model and the Boundaries of Its Programming Model
🌐 Website

Flash Attention β€” Tri Dao et al.

arxiv.org Β· Source: NPU Execution Model and the Boundaries of Its Programming Model
🌐 Website

CUTLASS 3.0 & CuTe β€” NVIDIA

github.com Β· Source: NPU Execution Model and the Boundaries of Its Programming Model
🌐 Website

llama.cpp GitHub Repository

github.com Β· Source: Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

ONNX Runtime Documentation

onnxruntime.ai Β· Source: Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths
🌐 Website

lm-evaluation-harness

github.com Β· Source: Hands-On: HF β†’ GGUF / ONNX / OpenVINO β€” Three End-to-End Paths