Ollama + llama.cpp Deep Dive
Deep dive into Ollama and llama.cpp internals β architecture, quantization, compute graphs, hardware backends, and serving infrastructure.
- 1
Ollama + llama.cpp Architecture Overview
Intermediate#ollama#llama-cpp#architecture#inference - 2
The Complete Journey of a Single Inference
Intermediate#ollama#llama-cpp#inference#pipeline - 3
The GGUF Model Format
Intermediate#gguf#llama-cpp#model-format#serialization - 4
llama.cpp Quantization Methods
Advanced#quantization#llama-cpp#gguf#inference-optimization - 5
Compute Graphs and Inference Engines
Advanced#ggml#compute-graph#inference-engine#operator-fusion - 6
KV Cache and Batch Scheduling
Advanced#kv-cache#batch-scheduling#continuous-batching#prefix-cache - 7
Hardware Backends
Advanced#ggml#cuda#metal#vulkan#hardware-backend - 8
Server Layer and Scheduling
Advanced#ollama#scheduler#runner#model-management - 9
Model Ecosystem
Intermediate#ollama#registry#modelfile#lora#multimodal