Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Resources

Reference materials organized by learning path, auto-aggregated from article citations.

Transformer Core Mechanisms 53 resources

Attention Is All You Need

arxiv.org · Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , Positional Encoding — Giving Transformers a Sense of Order

Language Models are Unsupervised Multitask Learners (GPT-2)

cdn.openai.com · Source: Transformer Architecture Overview , Sampling & Decoding — From Probabilities to Text

The Illustrated Transformer

jalammar.github.io · Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , KV Cache Fundamentals

LLM Visualization — Brendan Bycroft

bbycroft.net · Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , MQA and GQA , KV Cache Fundamentals , Prefill vs Decode Phases

Transformer Explainer — Georgia Tech / Polo Club

poloclub.github.io · Source: Transformer Architecture Overview , QKV Data Structures and Intuition , Attention Computation in Detail , Multi-Head Attention , MQA and GQA

GLU Variants Improve Transformer

arxiv.org · Source: Transformer Architecture Overview

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

arxiv.org · Source: MQA and GQA

Fast Transformer Decoding: One Write-Head is All You Need

arxiv.org · Source: MQA and GQA

Mistral 7B

arxiv.org · Source: Attention Variants: From Sliding Window to MLA

Gemma 2 Technical Report

arxiv.org · Source: Attention Variants: From Sliding Window to MLA

Jamba: A Hybrid Transformer-Mamba Language Model

arxiv.org · Source: Attention Variants: From Sliding Window to MLA , Hybrid Architectures: Fusing Mamba with Attention

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)

arxiv.org · Source: Attention Variants: From Sliding Window to MLA

Flamingo: a Visual Language Model for Few-Shot Learning

arxiv.org · Source: Attention Variants: From Sliding Window to MLA

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

arxiv.org · Source: Attention Variants: From Sliding Window to MLA , Mixture of Experts: Sparsely Activated Large Model Architecture

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

arxiv.org · Source: Attention Variants: From Sliding Window to MLA

Retentive Network: A Successor to Transformer for Large Language Models

arxiv.org · Source: Attention Variants: From Sliding Window to MLA

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

arxiv.org · Source: Attention Variants: From Sliding Window to MLA , Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge

Gated Delta Networks: Improving Mamba2 with Delta Rule

arxiv.org · Source: Attention Variants: From Sliding Window to MLA , Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge

Efficient Memory Management for Large Language Model Serving with PagedAttention

arxiv.org · Source: KV Cache Fundamentals

Efficiently Scaling Transformer Inference

arxiv.org · Source: Prefill vs Decode Phases

LLM Inference Unveiled: Survey and Roofline Model Insights

arxiv.org · Source: Prefill vs Decode Phases

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arxiv.org · Source: Flash Attention Tiling Principles

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

arxiv.org · Source: Flash Attention Tiling Principles

Self-Attention with Relative Position Representations

arxiv.org · Source: Positional Encoding — Giving Transformers a Sense of Order

RoFormer: Enhanced Transformer with Rotary Position Embedding

arxiv.org · Source: Positional Encoding — Giving Transformers a Sense of Order

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

arxiv.org · Source: Positional Encoding — Giving Transformers a Sense of Order

The Curious Case of Neural Text Degeneration

arxiv.org · Source: Sampling & Decoding — From Probabilities to Text

Hierarchical Neural Story Generation

arxiv.org · Source: Sampling & Decoding — From Probabilities to Text

Perplexity — a Measure of the Difficulty of Speech Recognition Tasks

ieeexplore.ieee.org · Source: Sampling & Decoding — From Probabilities to Text

Fast Inference from Transformers via Speculative Decoding

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

Accelerating Large Language Model Decoding with Speculative Sampling

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

Better & Faster Large Language Models via Multi-Token Prediction

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

DeepSeek-V3 Technical Report

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing , Mixture of Experts: Sparsely Activated Large Model Architecture

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

arxiv.org · Source: Speculative Decoding — Accelerating LLM Inference via Guessing

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

arxiv.org · Source: Mixture of Experts: Sparsely Activated Large Model Architecture

Mixtral of Experts

arxiv.org · Source: Mixture of Experts: Sparsely Activated Large Model Architecture

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

arxiv.org · Source: Mixture of Experts: Sparsely Activated Large Model Architecture

Efficiently Modeling Long Sequences with Structured State Spaces (S4)

arxiv.org · Source: State Space Models and Mamba

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

arxiv.org · Source: State Space Models and Mamba

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

arxiv.org · Source: State Space Models and Mamba

HiPPO: Recurrent Memory with Optimal Polynomial Projections

arxiv.org · Source: State Space Models and Mamba

Hungry Hungry Hippos: Towards Language Modeling with State Space Models (H3)

arxiv.org · Source: State Space Models and Mamba

On the Parameterization and Initialization of Diagonal State Space Models (S4D)

arxiv.org · Source: State Space Models and Mamba

Zamba2-Small: A Hybrid SSM-Transformer Model

zyphra.com · Source: Hybrid Architectures: Fusing Mamba with Attention

Hymba: A Hybrid-head Architecture for Small Language Models

arxiv.org · Source: Hybrid Architectures: Fusing Mamba with Attention

An Empirical Study of Mamba-based Language Models

arxiv.org · Source: Hybrid Architectures: Fusing Mamba with Attention

Repeat After Me: Transformers are Better than State Space Models at Copying

arxiv.org · Source: Hybrid Architectures: Fusing Mamba with Attention

Qwen3 Technical Report

arxiv.org · Source: Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge

Ollama - Qwen3-Next Model Implementation

github.com · Source: Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge

Transformer Across Modalities 39 resources

Efficient Estimation of Word Representations in Vector Space

arxiv.org · Source: From Text to Vectors: Tokenization and Word Embeddings

Neural Machine Translation of Rare Words with Subword Units

arxiv.org · Source: From Text to Vectors: Tokenization and Word Embeddings

GloVe: Global Vectors for Word Representation

nlp.stanford.edu · Source: From Text to Vectors: Tokenization and Word Embeddings

SentencePiece: A simple and language independent subword tokenizer

arxiv.org · Source: From Text to Vectors: Tokenization and Word Embeddings

The Illustrated Word2Vec

jalammar.github.io · Source: From Text to Vectors: Tokenization and Word Embeddings

Hugging Face Tokenizer Summary

huggingface.co · Source: From Text to Vectors: Tokenization and Word Embeddings

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

arxiv.org · Source: BERT and GPT: Two Paths — Understanding vs Generation

Improving Language Understanding by Generative Pre-Training

cdn.openai.com · Source: BERT and GPT: Two Paths — Understanding vs Generation

Language Models are Unsupervised Multitask Learners

cdn.openai.com · Source: BERT and GPT: Two Paths — Understanding vs Generation

Language Models are Few-Shot Learners

arxiv.org · Source: BERT and GPT: Two Paths — Understanding vs Generation

BERT for Joint Intent Classification and Slot Filling

arxiv.org · Source: BERT and GPT: Two Paths — Understanding vs Generation

Scaling Laws for Neural Language Models

arxiv.org · Source: BERT and GPT: Two Paths — Understanding vs Generation

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

arxiv.org · Source: Sentence Embeddings: From Token-Level to Semantic Retrieval

Text Embeddings by Weakly-Supervised Contrastive Pre-training

arxiv.org · Source: Sentence Embeddings: From Token-Level to Semantic Retrieval

C-Pack: Packaged Resources To Advance General Chinese Embedding

arxiv.org · Source: Sentence Embeddings: From Token-Level to Semantic Retrieval

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

arxiv.org · Source: Sentence Embeddings: From Token-Level to Semantic Retrieval

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

arxiv.org · Source: Vision Transformer: When Images Become Token Sequences

Training data-efficient image transformers & distillation through attention

arxiv.org · Source: Vision Transformer: When Images Become Token Sequences

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

arxiv.org · Source: Vision Transformer: When Images Become Token Sequences

Learning Transferable Visual Models From Natural Language Supervision

arxiv.org · Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

arxiv.org · Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces

Sigmoid Loss for Language Image Pre-Training

arxiv.org · Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces

Visual Instruction Tuning

arxiv.org · Source: Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces

Denoising Diffusion Probabilistic Models

arxiv.org · Source: Diffusion Model Fundamentals: Generating from Noise

Denoising Diffusion Implicit Models

arxiv.org · Source: Diffusion Model Fundamentals: Generating from Noise

High-Resolution Image Synthesis with Latent Diffusion Models

arxiv.org · Source: Diffusion Model Fundamentals: Generating from Noise

Classifier-Free Diffusion Guidance

arxiv.org · Source: Diffusion Model Fundamentals: Generating from Noise

Scalable Diffusion Models with Transformers

arxiv.org · Source: Diffusion Transformer: Image Generation with Transformers

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

arxiv.org · Source: Diffusion Transformer: Image Generation with Transformers

Video generation models as world simulators

openai.com · Source: Video Generation: Spatiotemporal Attention and the Sora Architecture

Make-A-Video: Text-to-Video Generation without Text-Video Data

arxiv.org · Source: Video Generation: Spatiotemporal Attention and the Sora Architecture

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

arxiv.org · Source: Video Generation: Spatiotemporal Attention and the Sora Architecture

Robust Speech Recognition via Large-Scale Weak Supervision

arxiv.org · Source: Speech and Transformers: From Whisper to VALL-E

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

arxiv.org · Source: Speech and Transformers: From Whisper to VALL-E

High Fidelity Neural Audio Compression

arxiv.org · Source: Speech and Transformers: From Whisper to VALL-E

Simple and Controllable Music Generation

arxiv.org · Source: Music Generation: When Transformers Learn to Compose

Jukebox: A Generative Model for Music

arxiv.org · Source: Music Generation: When Transformers Learn to Compose

MusicLM: Generating Music From Text

arxiv.org · Source: Music Generation: When Transformers Learn to Compose

Fast Timing-Conditioned Latent Audio Diffusion

arxiv.org · Source: Music Generation: When Transformers Learn to Compose

LLM Quantization Techniques 27 resources

A Survey of Quantization Methods for Efficient Neural Network Inference

arxiv.org · Source: Quantization Fundamentals

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

arxiv.org · Source: Quantization Fundamentals

FP8 Formats for Deep Learning

arxiv.org · Source: Quantization Fundamentals , Inference-Time Quantization: KV Cache and Activation Quantization

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

arxiv.org · Source: PTQ Weight Quantization: From GPTQ to AWQ , llama.cpp Quantization Methods

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

arxiv.org · Source: PTQ Weight Quantization: From GPTQ to AWQ , llama.cpp Quantization Methods

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

arxiv.org · Source: PTQ Weight Quantization: From GPTQ to AWQ

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

arxiv.org · Source: Quantization-Aware Training (QAT)

BitNet: Scaling 1-bit Transformers for Large Language Models

arxiv.org · Source: Quantization-Aware Training (QAT)

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

arxiv.org · Source: Quantization-Aware Training (QAT)

QLoRA: Efficient Finetuning of Quantized LLMs

arxiv.org · Source: Quantization-Aware Training (QAT)

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

arxiv.org · Source: Quantization-Aware Training (QAT)

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

arxiv.org · Source: Inference-Time Quantization: KV Cache and Activation Quantization

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

arxiv.org · Source: Inference-Time Quantization: KV Cache and Activation Quantization

llama.cpp Quantization Types

github.com · Source: llama.cpp Quantization Methods

K-quant PR

github.com · Source: llama.cpp Quantization Methods

NVIDIA Model Optimizer GitHub

github.com · Source: Quantization and Model Conversion Toolchain Landscape

vLLM Quantization - LLM Compressor

github.com · Source: Quantization and Model Conversion Toolchain Landscape

Microsoft Olive Documentation - Why Olive

microsoft.github.io · Source: Quantization and Model Conversion Toolchain Landscape

Apple coremltools Optimization Overview

apple.github.io · Source: Quantization and Model Conversion Toolchain Landscape

AMD Quark Documentation

quark.docs.amd.com · Source: Quantization and Model Conversion Toolchain Landscape

Google AI Edge Torch GitHub

github.com · Source: Quantization and Model Conversion Toolchain Landscape

NNCF GitHub Repository

github.com · Source: Quantization and Model Conversion Toolchain Landscape

Optimum Intel Documentation

huggingface.co · Source: Quantization and Model Conversion Toolchain Landscape , Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

llama.cpp GitHub Repository

github.com · Source: Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

ONNX Runtime Documentation

onnxruntime.ai · Source: Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

OpenVINO Documentation

docs.openvino.ai · Source: Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

lm-evaluation-harness

github.com · Source: Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

vLLM + SGLang Inference Engine Deep Dive 13 resources

Efficient Memory Management for Large Language Model Serving with PagedAttention

arxiv.org · Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM , PagedAttention and Continuous Batching , Scheduling and Preemption: The Inference Engine Scheduler

SGLang: Efficient Execution of Structured Language Model Programs

arxiv.org · Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM , Prefix Caching and RadixAttention , SGLang Programming Model and Structured Output

NVIDIA TensorRT-LLM Documentation

nvidia.github.io · Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM

Ollama GitHub Repository

github.com · Source: LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM

Orca: A Distributed Serving System for Transformer-Based Generative Models

arxiv.org · Source: PagedAttention and Continuous Batching

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

blog.vllm.ai · Source: PagedAttention and Continuous Batching

Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

arxiv.org · Source: Scheduling and Preemption: The Inference Engine Scheduler

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

arxiv.org · Source: Scheduling and Preemption: The Inference Engine Scheduler

vLLM Automatic Prefix Caching

docs.vllm.ai · Source: Prefix Caching and RadixAttention

Trie Memory

dl.acm.org · Source: Prefix Caching and RadixAttention

Efficient Guided Generation for Large Language Models

arxiv.org · Source: SGLang Programming Model and Structured Output

Fast JSON Decoding for Local LLMs with Compressed Finite State Machine

lmsys.org · Source: SGLang Programming Model and Structured Output

SGLang Documentation — Structured Outputs

docs.sglang.io · Source: SGLang Programming Model and Structured Output

LLM Model Routing: Intelligent Model Selection and Hybrid Inference 16 resources

RouteLLM: Learning to Route LLMs with Preference Data

arxiv.org · Source: Model Routing Landscape: Why One Model Isn't Enough , Routing Classifiers: Letting Small Models Decide Who Answers , RouteLLM in Practice: From Preference Data to Production Routing , Factorization Machines and LLM Routing: From FM Theory to MF Router , Online Learning and Cost Optimization: Routers Need to Evolve Too

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

arxiv.org · Source: Model Routing Landscape: Why One Model Isn't Enough , Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed

AutoMix: Automatically Mixing Language Models

arxiv.org · Source: Model Routing Landscape: Why One Model Isn't Enough , Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed

RouteLLM GitHub Repository

github.com · Source: Model Routing Landscape: Why One Model Isn't Enough , RouteLLM in Practice: From Preference Data to Production Routing

Evaluating Small Language Models for Front-Door Routing

arxiv.org · Source: Routing Classifiers: Letting Small Models Decide Who Answers

semantic-router: Superfast Decision-Making Layer

github.com · Source: Routing Classifiers: Letting Small Models Decide Who Answers

Factorization Machines

csie.ntu.edu.tw · Source: Factorization Machines and LLM Routing: From FM Theory to MF Router

Factorization Machines with libFM

dl.acm.org · Source: Factorization Machines and LLM Routing: From FM Theory to MF Router

Confidence-Driven LLM Router

arxiv.org · Source: Cascade and Self-Verification: Try the Cheap Model First, Upgrade If Needed

ConsRoute: Consistency-Driven LLM Routing for Cloud-Edge-Device

arxiv.org · Source: Hybrid LLM: Intelligent Routing Between Local and Cloud

HybridFlow: Subtask-level DAG Routing

arxiv.org · Source: Hybrid LLM: Intelligent Routing Between Local and Cloud

PRISM: Privacy-Sensitive Entity-Level LLM Routing

arxiv.org · Source: Hybrid LLM: Intelligent Routing Between Local and Cloud

Bridging On-Device and Cloud LLMs for Collaborative Reasoning

arxiv.org · Source: Hybrid LLM: Intelligent Routing Between Local and Cloud

Robust Batch-Level LLM Routing

arxiv.org · Source: Online Learning and Cost Optimization: Routers Need to Evolve Too

Council Mode: Multi-LLM Collaboration for Hallucination Reduction

arxiv.org · Source: Multi-Model Collaboration: From Picking One to Using Many

Mixture of Agents - Together AI

together.ai · Source: Multi-Model Collaboration: From Picking One to Using Many

LLM Evaluation and Benchmarks Deep Dive 27 resources

Measuring Massive Multitask Language Understanding (MMLU)

arxiv.org · Source: Benchmark Landscape and Evaluation Methodology

lm-evaluation-harness

github.com · Source: Benchmark Landscape and Evaluation Methodology , Impact of Optimization on Accuracy , lm-eval-harness Practical Guide

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

arxiv.org · Source: Benchmark Landscape and Evaluation Methodology

LiveBench

livebench.ai · Source: Benchmark Landscape and Evaluation Methodology , Interpreting Leaderboards and Model Selection

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

arxiv.org · Source: Knowledge & Reasoning Benchmarks

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

arxiv.org · Source: Knowledge & Reasoning Benchmarks

Measuring Mathematical Problem Solving With the MATH Dataset

arxiv.org · Source: Knowledge & Reasoning Benchmarks

Evaluating Large Language Models Trained on Code

arxiv.org · Source: Code Benchmarks

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

arxiv.org · Source: Code Benchmarks , SWE-bench Practical Guide

Is Your Code Generated by ChatGPT Really Correct? (EvalPlus)

arxiv.org · Source: Code Benchmarks

Berkeley Function Calling Leaderboard (BFCL)

gorilla.cs.berkeley.edu · Source: Agent & Tool Use Benchmarks , BFCL Practical Guide

GAIA: A Benchmark for General AI Assistants

arxiv.org · Source: Agent & Tool Use Benchmarks

WebArena: A Realistic Web Environment for Building Autonomous Agents

arxiv.org · Source: Agent & Tool Use Benchmarks

Google Gemma 2 Technical Report

ai.google.dev · Source: Anatomy of Model Release Benchmark Standard Sets

Microsoft Phi-3 Technical Report

arxiv.org · Source: Anatomy of Model Release Benchmark Standard Sets

Qwen2.5 Technical Report

arxiv.org · Source: Anatomy of Model Release Benchmark Standard Sets

Meta Llama 3.1 Model Card

huggingface.co · Source: Anatomy of Model Release Benchmark Standard Sets

Open LLM Leaderboard

huggingface.co · Source: Anatomy of Model Release Benchmark Standard Sets , Interpreting Leaderboards and Model Selection

OpenVINO Neural Network Compression Framework (NNCF)

github.com · Source: Impact of Optimization on Accuracy

Optimum Intel

huggingface.co · Source: Impact of Optimization on Accuracy

llama.cpp

github.com · Source: Impact of Optimization on Accuracy

Chatbot Arena (LMSYS)

lmarena.ai · Source: Interpreting Leaderboards and Model Selection

Artificial Analysis LLM Leaderboard

artificialanalysis.ai · Source: Interpreting Leaderboards and Model Selection

lm-eval Documentation

lm-evaluation-harness.readthedocs.io · Source: lm-eval-harness Practical Guide

SWE-bench GitHub

github.com · Source: SWE-bench Practical Guide

SWE-agent GitHub

github.com · Source: SWE-bench Practical Guide

Gorilla / BFCL GitHub

github.com · Source: BFCL Practical Guide

Ollama + llama.cpp Deep Dive 20 resources

Ollama GitHub

github.com · Source: Ollama + llama.cpp Architecture Overview , The Complete Journey of a Single Inference , KV Cache and Batch Scheduling , Server Layer and Scheduling

llama.cpp GitHub

github.com · Source: Ollama + llama.cpp Architecture Overview , The Complete Journey of a Single Inference , Compute Graphs and Inference Engines

GGML GitHub

github.com · Source: Ollama + llama.cpp Architecture Overview , Compute Graphs and Inference Engines , Hardware Backends

Qwen3 Technical Report

arxiv.org · Source: The Complete Journey of a Single Inference

GGUF Specification

github.com · Source: The GGUF Model Format

Safetensors Documentation

huggingface.co · Source: The GGUF Model Format

ONNX

onnx.ai · Source: The GGUF Model Format

llama.cpp Quantization Types

github.com · Source: llama.cpp Quantization Methods

K-quant PR

github.com · Source: llama.cpp Quantization Methods

GPTQ: Accurate Post-Training Quantization

arxiv.org · Source: llama.cpp Quantization Methods

AWQ: Activation-aware Weight Quantization

arxiv.org · Source: llama.cpp Quantization Methods

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arxiv.org · Source: Compute Graphs and Inference Engines

Efficient Memory Management for LLM Serving with PagedAttention

arxiv.org · Source: KV Cache and Batch Scheduling

CUDA Programming Guide

docs.nvidia.com · Source: Hardware Backends

Metal Shading Language

developer.apple.com · Source: Hardware Backends

Vulkan Compute

khronos.org · Source: Hardware Backends

Ollama FAQ

github.com · Source: Server Layer and Scheduling

Ollama Modelfile

github.com · Source: Model Ecosystem

Ollama API

github.com · Source: Model Ecosystem

LLaVA: Visual Instruction Tuning

arxiv.org · Source: Model Ecosystem

llama.cpp Source Code Walkthrough 1 resources

llama.cpp GitHub

github.com · Source: llama.cpp Execution Pipeline Overview , Tool Landscape and GGUF Binary Parsing , Model Loading: From File to Device , Warmup, Tokenization & Chat Template , Batch, Ubatch & the Decoding Main Loop , Compute Graph Construction & Architecture Dispatch , Backend Scheduling, Op Fusion & Memory Allocation , Execution, Sampling & Context Management

AI Compute Stack 21 resources

NVIDIA CUDA C++ Programming Guide

docs.nvidia.com · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA , GPU Architecture — From Transistors to Threads , CUDA Programming Model — From Code to Hardware

Khronos OpenCL Specification

khronos.org · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA

Khronos SYCL Specification

khronos.org · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA

Intel oneAPI Level Zero Specification

spec.oneapi.io · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA

AMD ROCm HIP Programming Guide

rocm.docs.amd.com · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA

Apple Metal Shading Language Specification

developer.apple.com · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA

ggml / llama.cpp

github.com · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA

ONNX Runtime Documentation

onnxruntime.ai · Source: AI Compute Stack Overview — From Inference Frameworks to Hardware ISA

NVIDIA H100 Tensor Core GPU Architecture Whitepaper

resources.nvidia.com · Source: GPU Architecture — From Transistors to Threads , Matrix Acceleration Units — Tensor Core and XMX

Why Systolic Architectures? — H.T. Kung

cs.virginia.edu · Source: Matrix Acceleration Units — Tensor Core and XMX

NVIDIA PTX ISA — Matrix Multiply-Accumulate

docs.nvidia.com · Source: Matrix Acceleration Units — Tensor Core and XMX

Intel Xe2 Architecture — Xe-Core and XMX

intel.com · Source: Matrix Acceleration Units — Tensor Core and XMX , CUDA Programming Model — From Code to Hardware

DeepSeek-V3 Technical Report

arxiv.org · Source: Matrix Acceleration Units — Tensor Core and XMX

NVIDIA Kernel Profiling Guide — Memory Coalescing

docs.nvidia.com · Source: CUDA Programming Model — From Code to Hardware

CUDA Occupancy Calculator

docs.nvidia.com · Source: CUDA Programming Model — From Code to Hardware

SYCL 2020 Specification

registry.khronos.org · Source: CUDA Programming Model — From Code to Hardware

CUTLASS: Fast Linear Algebra in CUDA C++

github.com · Source: GEMM Optimization — From Naive to Peak Performance

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance

siboehm.com · Source: GEMM Optimization — From Naive to Peak Performance

CUDA C++ Programming Guide — Warp Matrix Functions

docs.nvidia.com · Source: GEMM Optimization — From Naive to Peak Performance

Intel oneAPI DPC++ — joint_matrix Extension

github.com · Source: GEMM Optimization — From Naive to Peak Performance

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

arxiv.org · Source: GEMM Optimization — From Naive to Peak Performance

Graph Compilation & Optimization 79 resources

PyTorch 2.0: Our next generation release

pytorch.org · Source: Panorama: The World of ML Compilers , Graph Capture: TorchDynamo, AOTAutograd & Functionalization

MLIR: Multi-Level Intermediate Representation

mlir.llvm.org · Source: Panorama: The World of ML Compilers

Triton Language and Compiler

triton-lang.org · Source: Panorama: The World of ML Compilers

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

arxiv.org · Source: Panorama: The World of ML Compilers

MLIR: A Compiler Infrastructure for the End of Moore's Law

arxiv.org · Source: Panorama: The World of ML Compilers , IR Design (Part 1): SSA, FX IR & MLIR Dialects , IR Design (Part 2): Progressive Lowering and Multi-Level IR

TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation

dev-discuss.pytorch.org · Source: Panorama: The World of ML Compilers , Graph Capture: TorchDynamo, AOTAutograd & Functionalization

PEP 523 – Adding a frame evaluation API to CPython

peps.python.org · Source: Graph Capture: TorchDynamo, AOTAutograd & Functionalization

torch.compiler — PyTorch Documentation

pytorch.org · Source: Graph Capture: TorchDynamo, AOTAutograd & Functionalization

AOT Autograd — How to use and optimize?

pytorch.org · Source: Graph Capture: TorchDynamo, AOTAutograd & Functionalization

Efficiently Computing Static Single Assignment Form and the Control Dependence Graph

dl.acm.org · Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals

torch.fx — PyTorch Documentation

pytorch.org · Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals

MLIR Language Reference

mlir.llvm.org · Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects

MLIR Dialects

mlir.llvm.org · Source: IR Design (Part 1): SSA, FX IR & MLIR Dialects

MLIR Dialect Conversion

mlir.llvm.org · Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR

MLIR Bufferization

mlir.llvm.org · Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR

MLIR Pass Infrastructure

mlir.llvm.org · Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR , Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals

torch-mlir: PyTorch to MLIR compiler

github.com · Source: IR Design (Part 2): Progressive Lowering and Multi-Level IR

A Unified Approach to Global Program Optimization

dl.acm.org · Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals

MLIR Canonicalization

mlir.llvm.org · Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals

Constant Propagation with Conditional Branches

dl.acm.org · Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals

PyTorch FX Subgraph Rewriter

pytorch.org · Source: Graph Optimization Passes (Part 1): Data Flow Analysis & Pass Fundamentals

MLIR Declarative Rewrite Rules (DRR)

mlir.llvm.org · Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching

MLIR PDL — Pattern Description Language

mlir.llvm.org · Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching

torch.fx — Subgraph Rewriting

pytorch.org · Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

arxiv.org · Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching

NVIDIA Tensor Core Programming

docs.nvidia.com · Source: Graph Optimization Passes (Part 2): Advanced Optimizations & Pattern Matching

A Practical Automatic Polyhedral Parallelizer and Locality Optimizer

dl.acm.org · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations

MLIR Affine Dialect

mlir.llvm.org · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations

Polyhedral Compilation as a Design Pattern for Compiler Construction

link.springer.com · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations

MLIR Transform Dialect

mlir.llvm.org · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations , Autotuning and End-to-End Practice

Optimizing Compilers for Modern Architectures

elsevier.com · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations

Polly - Polyhedral optimizations for LLVM

polly.llvm.org · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations

Integer Set Library: A Library for Manipulating Integer Sets

libisl.sourceforge.io · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations

MLIR Linalg Dialect

mlir.llvm.org · Source: Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations , Operator Fusion (Part II): Cost Models & Fusion in Practice

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arxiv.org · Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice , Tiling Strategies & Memory Hierarchy Optimization

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

dl.acm.org · Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice , Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution

TorchInductor: a PyTorch-native Compiler

dev-discuss.pytorch.org · Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms

XLA: Optimizing Compiler for Machine Learning

tensorflow.org · Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms

Roofline Model

docs.nersc.gov · Source: Operator Fusion (Part I): Taxonomy & Decision Algorithms , Operator Fusion (Part II): Cost Models & Fusion in Practice

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

arxiv.org · Source: Operator Fusion (Part II): Cost Models & Fusion in Practice

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

arxiv.org · Source: Operator Fusion (Part II): Cost Models & Fusion in Practice

Roofline: An Insightful Visual Performance Model for Multicore Architectures

www2.eecs.berkeley.edu · Source: Tiling Strategies & Memory Hierarchy Optimization

NVIDIA CUDA C++ Programming Guide — Shared Memory

docs.nvidia.com · Source: Tiling Strategies & Memory Hierarchy Optimization

CUTLASS: CUDA Templates for Linear Algebra Subroutines

github.com · Source: Tiling Strategies & Memory Hierarchy Optimization , Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation

Triton Language Documentation

triton-lang.org · Source: Tiling Strategies & Memory Hierarchy Optimization , Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness

NVIDIA A100 GPU Architecture Whitepaper

images.nvidia.com · Source: Tiling Strategies & Memory Hierarchy Optimization

torch.compile Dynamic Shapes Documentation

pytorch.org · Source: Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution

TorchDynamo Deep Dive

pytorch.org · Source: Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution

MLIR Tensor Type — Dynamic Dimensions

mlir.llvm.org · Source: Dynamic Shapes: The Full-Pipeline Challenge from Capture to Execution

NVIDIA CUDA C++ Programming Guide — PTX ISA

docs.nvidia.com · Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation

LLVM Code Generator Documentation

llvm.org · Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation

NVIDIA GPU Architecture — Execution Units

docs.nvidia.com · Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

eecs.harvard.edu · Source: Code Generation (Part I): Instruction Selection, Vectorization & Register Allocation , Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness , Autotuning and End-to-End Practice

MLIR GPU Dialect

mlir.llvm.org · Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness

IREE Compiler and Runtime

iree.dev · Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness

TensorRT Developer Guide

docs.nvidia.com · Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness

What Every Computer Scientist Should Know About Floating-Point Arithmetic

docs.oracle.com · Source: Code Generation (Part II): Triton Pipeline, Compiler Backends & Numerical Correctness

A Survey of Quantization Methods for Efficient Neural Network Inference

arxiv.org · Source: Quantization Compilation and Mixed-Precision Optimization

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

arxiv.org · Source: Quantization Compilation and Mixed-Precision Optimization

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

arxiv.org · Source: Quantization Compilation and Mixed-Precision Optimization

FP8 Formats for Deep Learning

arxiv.org · Source: Quantization Compilation and Mixed-Precision Optimization

PyTorch Quantization Documentation

pytorch.org · Source: Quantization Compilation and Mixed-Precision Optimization

TensorRT Quantization Toolkit

docs.nvidia.com · Source: Quantization Compilation and Mixed-Precision Optimization

GSPMD: General and Scalable Parallelization for ML Computation Graphs

arxiv.org · Source: Distributed Compilation and Graph Partitioning

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

arxiv.org · Source: Distributed Compilation and Graph Partitioning

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

arxiv.org · Source: Distributed Compilation and Graph Partitioning

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

arxiv.org · Source: Distributed Compilation and Graph Partitioning

PyTorch Distributed Overview

pytorch.org · Source: Distributed Compilation and Graph Partitioning

XLA SPMD Partitioner

openxla.org · Source: Distributed Compilation and Graph Partitioning

CUDA C++ Programming Guide — Streams

docs.nvidia.com · Source: Scheduling and Execution Optimization

CUDA Graphs

docs.nvidia.com · Source: Scheduling and Execution Optimization

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

arxiv.org · Source: Scheduling and Execution Optimization

Dynamic Tensor Rematerialization

arxiv.org · Source: Scheduling and Execution Optimization

TorchInductor: A PyTorch Native Compiler

dev-discuss.pytorch.org · Source: Scheduling and Execution Optimization

PyTorch Activation Checkpointing

pytorch.org · Source: Scheduling and Execution Optimization

Ansor: Generating High-Performance Tensor Programs for Deep Learning

arxiv.org · Source: Autotuning and End-to-End Practice

Learning to Optimize Tensor Programs

arxiv.org · Source: Autotuning and End-to-End Practice

Triton Autotune Documentation

triton-lang.org · Source: Autotuning and End-to-End Practice

torch.compile Troubleshooting

pytorch.org · Source: Autotuning and End-to-End Practice

Reinforcement Learning: From Foundations to LLM Alignment & Reasoning 35 resources

Reinforcement Learning: An Introduction (Sutton & Barto, 2nd Edition)

incompleteideas.net · Source: Reinforcement Learning Foundations: From Agent to Bellman Equation

David Silver UCL Reinforcement Learning Course

davidsilver.uk · Source: Reinforcement Learning Foundations: From Agent to Bellman Equation

OpenAI Spinning Up: Introduction to RL

spinningup.openai.com · Source: Reinforcement Learning Foundations: From Agent to Bellman Equation

Hugging Face Deep RL Course

huggingface.co · Source: Reinforcement Learning Foundations: From Agent to Bellman Equation , Test-Time Scaling and Reasoning Enhancement

A (Long) Peek into Reinforcement Learning — Lilian Weng

lilianweng.github.io · Source: Reinforcement Learning Foundations: From Agent to Bellman Equation

UC Berkeley CS285: Deep Reinforcement Learning

rail.eecs.berkeley.edu · Source: Reinforcement Learning Foundations: From Agent to Bellman Equation , Policy Gradient: Directly Optimizing the Policy , Actor-Critic and PPO: Stable Policy Optimization

Deep Reinforcement Learning: Pong from Pixels — Andrej Karpathy

karpathy.github.io · Source: Reinforcement Learning Foundations: From Agent to Bellman Equation , Test-Time Scaling and Reasoning Enhancement

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (Williams, 1992)

link.springer.com · Source: Policy Gradient: Directly Optimizing the Policy

Policy Gradient Methods for Reinforcement Learning with Function Approximation (Sutton et al., 1999)

proceedings.neurips.cc · Source: Policy Gradient: Directly Optimizing the Policy

Policy Gradient Algorithms — Lilian Weng

lilianweng.github.io · Source: Policy Gradient: Directly Optimizing the Policy , Actor-Critic and PPO: Stable Policy Optimization , When RL Meets LLM: From Language Generation to Policy Optimization

OpenAI Spinning Up: Vanilla Policy Gradient

spinningup.openai.com · Source: Policy Gradient: Directly Optimizing the Policy

Proximal Policy Optimization Algorithms (Schulman et al., 2017)

arxiv.org · Source: Actor-Critic and PPO: Stable Policy Optimization

High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., 2016)

arxiv.org · Source: Actor-Critic and PPO: Stable Policy Optimization

Trust Region Policy Optimization (Schulman et al., 2015)

arxiv.org · Source: Actor-Critic and PPO: Stable Policy Optimization

Hugging Face Deep RL Course: PPO

huggingface.co · Source: Actor-Critic and PPO: Stable Policy Optimization

Training language models to follow instructions with human feedback (Ouyang et al., 2022)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization , From DPO to GRPO: Direct Preference Optimization

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al., 2024)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization , From DPO to GRPO: Direct Preference Optimization

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization , Test-Time Scaling and Reasoning Enhancement

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (Ross et al., 2011)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization

Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback

Learning to summarize from human feedback (Stiennon et al., 2020)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization

RLHF: Reinforcement Learning from Human Feedback — Chip Huyen

huyenchip.com · Source: When RL Meets LLM: From Language Generation to Policy Optimization , RLHF: Learning from Human Feedback

Let's Verify Step by Step (Lightman et al., 2023)

arxiv.org · Source: When RL Meets LLM: From Language Generation to Policy Optimization , Reward Design and Scaling , Test-Time Scaling and Reasoning Enhancement

Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017)

arxiv.org · Source: RLHF: Learning from Human Feedback

RLHF 系列 — Nathan Lambert (interconnects.ai)

interconnects.ai · Source: RLHF: Learning from Human Feedback , Reward Design and Scaling

Reward Hacking in Reinforcement Learning — Lilian Weng

lilianweng.github.io · Source: RLHF: Learning from Human Feedback , From DPO to GRPO: Direct Preference Optimization , Reward Design and Scaling

A General Theoretical Paradigm to Understand Learning from Human Feedback (Azar et al., 2023)

arxiv.org · Source: From DPO to GRPO: Direct Preference Optimization

KTO: Model Alignment as Prospect Theoretic Optimization (Ethayarajh et al., 2024)

arxiv.org · Source: From DPO to GRPO: Direct Preference Optimization

Hugging Face TRL Documentation: DPO Trainer

huggingface.co · Source: From DPO to GRPO: Direct Preference Optimization

Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)

arxiv.org · Source: Reward Design and Scaling

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

arxiv.org · Source: Reward Design and Scaling

Scaling Laws for Reward Model Overoptimization (Gao et al., 2022)

arxiv.org · Source: Reward Design and Scaling

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Snell et al., 2024)

arxiv.org · Source: Test-Time Scaling and Reasoning Enhancement

AlphaZero-like Tree-Search can Guide Large Language Model Decoding and Training (Feng et al., 2024)

arxiv.org · Source: Test-Time Scaling and Reasoning Enhancement

Intel iGPU Inference Deep Dive: Xe2 Architecture, oneDNN & OpenVINO 40 resources

Intel Xe2 Architecture — Intel

intel.com · Source: Xe2 GPU Architecture

Intel Data Center GPU Max Series Architecture — Intel

intel.com · Source: Xe2 GPU Architecture

oneAPI GPU Optimization Guide — Intel

intel.com · Source: Xe2 GPU Architecture

oneAPI GPU Optimization Guide — Thread Hierarchy — Intel

intel.com · Source: Xe2 Execution Model and Programming Abstractions

SYCL 2020 Specification — Khronos Group

registry.khronos.org · Source: Xe2 Execution Model and Programming Abstractions

Intel GPU Occupancy Calculator — Intel

intel.com · Source: Xe2 Execution Model and Programming Abstractions

DPC++ Language Extensions for SYCL — Intel

github.com · Source: Xe2 Execution Model and Programming Abstractions

SPIR-V Specification — Khronos Group

registry.khronos.org · Source: SPIR-V Compilation and Level Zero Runtime

oneAPI Level Zero Specification — Intel

spec.oneapi.io · Source: SPIR-V Compilation and Level Zero Runtime , LLM Inference on NPU: KV Cache and the Software Stack , NPU Execution Model and the Boundaries of Its Programming Model

Intel Graphics Compiler (IGC) — GitHub

github.com · Source: SPIR-V Compilation and Level Zero Runtime

SPIR-V Guide — Khronos

github.com · Source: SPIR-V Compilation and Level Zero Runtime

oneDNN Developer Guide — Intel

oneapi-src.github.io · Source: oneDNN Primitive System

oneAPI Deep Neural Network Library (oneDNN) — GitHub

github.com · Source: oneDNN Primitive System

oneDNN Programming Model — Intel

oneapi-src.github.io · Source: oneDNN Primitive System

Memory Format Propagation — oneDNN

oneapi-src.github.io · Source: oneDNN Primitive System

oneDNN Performance Profiling and Inspection — Intel

oneapi-src.github.io · Source: oneDNN GPU Kernel Optimization

oneAPI GPU Optimization Guide — GEMM — Intel

intel.com · Source: oneDNN GPU Kernel Optimization

XMX and XVE Architecture — Intel

intel.com · Source: oneDNN GPU Kernel Optimization

OpenVINO Architecture — Intel

docs.openvino.ai · Source: OpenVINO Graph Optimization Pipeline

OpenVINO GPU Plugin — Intel

docs.openvino.ai · Source: OpenVINO Graph Optimization Pipeline

OpenVINO Toolkit — GitHub

github.com · Source: OpenVINO Graph Optimization Pipeline

Optimum Intel Documentation

huggingface.co · Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO , Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

NNCF GitHub Repository

github.com · Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO

NNCF API Documentation

openvinotoolkit.github.io · Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO

OpenVINO Model Conversion

docs.openvino.ai · Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO , Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

Optimum Intel Source - Quantization

github.com · Source: Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO

Intel VTune Profiler — GPU Analysis — Intel

intel.com · Source: Performance Analysis and Bottleneck Diagnosis

OpenVINO Benchmark Tool — Intel

docs.openvino.ai · Source: Performance Analysis and Bottleneck Diagnosis

Intel GPU Top — intel_gpu_top man page

manpages.ubuntu.com · Source: Performance Analysis and Bottleneck Diagnosis

OpenVINO Multi-Device Execution — Intel

docs.openvino.ai · Source: NPU Architecture and GPU+NPU Co-Inference

OpenVINO AUTO Device — Intel

docs.openvino.ai · Source: NPU Architecture and GPU+NPU Co-Inference

Intel NPU Device — OpenVINO Documentation

docs.openvino.ai · Source: NPU Architecture and GPU+NPU Co-Inference , LLM Inference on NPU: KV Cache and the Software Stack

Heterogeneous Execution — OpenVINO Docs

docs.openvino.ai · Source: NPU Architecture and GPU+NPU Co-Inference

OpenVINO GenAI — Stateful LLM Pipeline

docs.openvino.ai · Source: LLM Inference on NPU: KV Cache and the Software Stack

openvinotoolkit/npu_compiler — GitHub

github.com · Source: LLM Inference on NPU: KV Cache and the Software Stack , NPU Execution Model and the Boundaries of Its Programming Model

Flash Attention — Tri Dao et al.

arxiv.org · Source: NPU Execution Model and the Boundaries of Its Programming Model

CUTLASS 3.0 & CuTe — NVIDIA

github.com · Source: NPU Execution Model and the Boundaries of Its Programming Model

llama.cpp GitHub Repository

github.com · Source: Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

ONNX Runtime Documentation

onnxruntime.ai · Source: Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths

lm-evaluation-harness

github.com · Source: Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths