Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Ollama + llama.cpp Deep Dive

Deep dive into Ollama and llama.cpp internals β€” architecture, quantization, compute graphs, hardware backends, and serving infrastructure.

  1. 1

    Ollama + llama.cpp Architecture Overview

    Intermediate
    #ollama#llama-cpp#architecture#inference
  2. 2

    The Complete Journey of a Single Inference

    Intermediate
    #ollama#llama-cpp#inference#pipeline
  3. 3

    The GGUF Model Format

    Intermediate
    #gguf#llama-cpp#model-format#serialization
  4. 4

    llama.cpp Quantization Methods

    Advanced
    #quantization#llama-cpp#gguf#inference-optimization
  5. 5

    Compute Graphs and Inference Engines

    Advanced
    #ggml#compute-graph#inference-engine#operator-fusion
  6. 6

    KV Cache and Batch Scheduling

    Advanced
    #kv-cache#batch-scheduling#continuous-batching#prefix-cache
  7. 7

    Hardware Backends

    Advanced
    #ggml#cuda#metal#vulkan#hardware-backend
  8. 8

    Server Layer and Scheduling

    Advanced
    #ollama#scheduler#runner#model-management
  9. 9

    Model Ecosystem

    Intermediate
    #ollama#registry#modelfile#lora#multimodal