Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Ollama + llama.cpp Deep Dive

Deep dive into Ollama and llama.cpp internals — architecture, quantization, compute graphs, hardware backends, and serving infrastructure.

1

Ollama + llama.cpp Architecture Overview
Intermediate

#ollama#llama-cpp#architecture#inference
2

The Complete Journey of a Single Inference
Intermediate

#ollama#llama-cpp#inference#pipeline
3

The GGUF Model Format
Intermediate

#gguf#llama-cpp#model-format#serialization
4

llama.cpp Quantization Methods
Advanced

#quantization#llama-cpp#gguf#inference-optimization
5

Compute Graphs and Inference Engines
Advanced

#ggml#compute-graph#inference-engine#operator-fusion
6

KV Cache and Batch Scheduling
Advanced

#kv-cache#batch-scheduling#continuous-batching#prefix-cache
7

Hardware Backends
Advanced

#ggml#cuda#metal#vulkan#hardware-backend
8

Server Layer and Scheduling
Advanced

#ollama#scheduler#runner#model-management
9

Model Ecosystem
Intermediate

#ollama#registry#modelfile#lora#multimodal