Hands-On: HF → GGUF / ONNX / OpenVINO — Three End-to-End Paths
Updated 2026-04-20
Three Paths, One Starting Point
In the previous two articles, we built a big-picture understanding of the layered structure of the quantization toolchain, then in the Intel deep dive we dissected the internal call relationships among Optimum Intel, NNCF, and OpenVINO. Theory is done — time to get our hands dirty.
The starting point for this article is one model: meta-llama/Llama-3.1-8B-Instruct — an 8B-parameter instruction-tuned model published on the Hugging Face Hub in FP16 safetensors format. From this single checkpoint, we will walk through three complete conversion + quantization + inference paths:
| Path | Conversion Pipeline | Inference Engine |
|---|---|---|
| Path A | HF → GGUF (convert_hf_to_gguf.py + llama-quantize) | llama.cpp CPU inference |
| Path B | HF → ONNX (optimum-cli export onnx) | ONNX Runtime inference |
| Path C | HF → OpenVINO IR (optimum-cli export openvino + NNCF) | OpenVINO iGPU inference |
After completing each path, we focus on three dimensions: conversion complexity (number of steps, time, extra dependencies), model size (file size before and after quantization), and inference speed (time to first token TTFT + decode throughput TPS).
Disclaimer: The commands in this article are based on official tool documentation. Since tool versions iterate rapidly, we recommend checking the latest docs for your specific version before running commands. Accuracy and speed figures are representative reference values — actual results vary by hardware and environment.
Environment Setup
Software Versions
Below are the software versions used in this article. These were the latest stable or recommended versions of each tool at the time of writing (April 2026):
| Component | Version | Purpose |
|---|---|---|
| Python | 3.11 | Runtime environment |
| llama.cpp | b5200 (latest stable tag) | Path A: GGUF conversion and inference |
| Optimum Intel | 1.22.0 | Path B/C: ONNX/OpenVINO export |
| OpenVINO | 2025.1 | Path C: OpenVINO IR inference |
| ONNX Runtime | 1.21.0 | Path B: ONNX inference |
| NNCF | 3.1.0 | Path C: accuracy-aware quantization |
| transformers | 4.48+ | Model and tokenizer loading |
Installing Dependencies
Python dependencies for Path B and Path C can be installed in a single command:
pip install optimum-intel[openvino] openvino==2025.1 \
onnxruntime==1.21.0 nncf==3.1.0
optimum-intel[openvino] automatically pulls in optimum, transformers, openvino, and other dependencies. If you only need Path B (ONNX), you can install just optimum[onnxruntime].
Building llama.cpp (Path A)
llama.cpp must be compiled from source. On Intel CPUs, enabling BLAS acceleration is recommended:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp && git checkout b5200
# Intel CPU: enable OpenBLAS acceleration
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release -j$(nproc)
If you want to try llama.cpp’s SYCL backend (Intel iGPU acceleration, experimental):
# Experimental: Intel iGPU via SYCL (requires Intel oneAPI 2025.1+)
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j$(nproc)
Note: The llama.cpp SYCL backend is still under active development, and both stability and performance lag behind OpenVINO on iGPU. If your target is Intel iGPU inference, go with Path C.
Intel iGPU Drivers (Path C)
On Ubuntu, make sure the OpenCL and Level Zero runtimes are installed:
# Ubuntu 22.04 / 24.04
sudo apt install intel-opencl-icd intel-level-zero-gpu level-zero
On Windows, Intel graphics drivers typically include OpenCL and Level Zero support out of the box. WSL2 users should verify that intel-opencl-icd is installed and the host driver version is >= 31.0.101.5592.
Downloading the Model
# You must first accept the Llama 3.1 Community License Agreement at
# https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
--local-dir ./llama-3.1-8b-fp16
Llama 3.1 is a gated model — you need to accept Meta’s Community License Agreement on the Hugging Face Hub page before downloading. After download, you will have a ~15 GB directory containing FP16 safetensors weight files and the tokenizer.
Path A: HF → GGUF
GGUF is llama.cpp’s native format and the most popular path in the community for running models locally. The full workflow has three steps: format conversion → quantization → inference.
Step 1: Format Conversion (HF safetensors → GGUF FP16)
python llama.cpp/convert_hf_to_gguf.py ./llama-3.1-8b-fp16 \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
convert_hf_to_gguf.py reads the HF checkpoint’s config.json to identify the model architecture (LlamaForCausalLM), parses the safetensors weight files, and repacks them according to GGUF’s tensor layout. --outtype f16 means no quantization is applied — just a format conversion, keeping weights at FP16 precision.
The output llama-3.1-8b-f16.gguf is approximately 15 GB, roughly the same size as the original safetensors.
Step 2: Quantization (GGUF FP16 → Q4_K_M)
./llama.cpp/build/bin/llama-quantize \
llama-3.1-8b-f16.gguf \
llama-3.1-8b-q4km.gguf \
Q4_K_M
llama-quantize is llama.cpp’s built-in quantization tool (Layer 4: inference-engine-native quantization). Q4_K_M is the classic balanced choice in the K-quant family: 4-bit quantization + multi-level super-block scales + mixed precision (the attention output projection and feed-forward gate layers are kept at 6-bit).
The quantized llama-3.1-8b-q4km.gguf is approximately 4.7 GB — about 3.2x compression compared to FP16.
Step 3: Inference Test
# Quick inference test
./llama.cpp/build/bin/llama-cli \
-m llama-3.1-8b-q4km.gguf \
-p "Hello, world!" \
-n 50
If everything works correctly, you will see the model generate a 50-token continuation, one token at a time.
Step 4: Perplexity Measurement
# Download the WikiText-2 test set (llama.cpp uses raw text format)
# Source: https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
wget -O wiki.test.raw https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
./llama.cpp/build/bin/llama-perplexity \
-m llama-3.1-8b-q4km.gguf \
-f wikitext-2-raw/wiki.test.raw \
--chunks 100
llama-perplexity computes perplexity chunk by chunk. --chunks 100 limits the run to the first 100 chunks to save time. The full test set is about 240 chunks — running it all produces more accurate results but takes longer.
Step 5: Speed Benchmark
./llama.cpp/build/bin/llama-bench \
-m llama-3.1-8b-q4km.gguf \
-t 8 \
-ngl 0
-t 8 specifies 8 CPU threads, and -ngl 0 means no layers are offloaded to the GPU (pure CPU inference). llama-bench reports prompt processing (prefill) and token generation (decode) speeds.
Common Pitfalls
Tokenizer handling differences. convert_hf_to_gguf.py rebuilds the tokenizer from the HF checkpoint’s tokenizer.json / tokenizer_config.json during conversion and embeds it into the GGUF file. Llama 3 uses a tiktoken-based BPE tokenizer (not Llama 2’s SentencePiece), and the conversion script needs to identify this correctly. If you encounter tokenizer-related errors, first check that your convert_hf_to_gguf.py matches your llama.cpp version.
Importance Matrix (imatrix) impact. Step 2 above uses “plain” quantization — uniform quantization applied to all layers. llama.cpp also supports importance-matrix-assisted quantization: first compute an importance score for each weight on calibration data, then allocate more bits to important weights during quantization. For Q4_K_M, which already includes mixed precision, imatrix gains are around 0.02–0.05 PPL, but the impact is significant for more aggressive quantization schemes like IQ2_XS.
# Optional: generate importance matrix
./llama.cpp/build/bin/llama-imatrix \
-m llama-3.1-8b-f16.gguf \
-f calibration_data.txt \
-o imatrix.dat
# Quantize with imatrix
./llama.cpp/build/bin/llama-quantize \
--imatrix imatrix.dat \
llama-3.1-8b-f16.gguf \
llama-3.1-8b-q4km-imat.gguf \
Q4_K_M
Why not run GGUF on iGPU? The llama.cpp SYCL backend (for Intel iGPU) is still at an early stage, with limited supported quantization types and kernel optimization far behind the CUDA backend. By contrast, OpenVINO is Intel’s official inference engine with deep optimization for Intel hardware, delivering much better iGPU performance than llama.cpp SYCL. If your target hardware is Intel iGPU, go directly to Path C.
Path B: HF → ONNX
ONNX (Open Neural Network Exchange) is the “lingua franca” for cross-platform model exchange. If you are unsure what hardware the model will ultimately run on, or need the flexibility to switch between multiple backends, ONNX is a safe middle-ground choice.
Step 1: Export to ONNX
optimum-cli export onnx \
--model ./llama-3.1-8b-fp16 \
--task text-generation-with-past \
./llama-3.1-8b-onnx
optimum-cli export onnx is the export command provided by Hugging Face Optimum. --task text-generation-with-past is critical — it tells the exporter this is an autoregressive language model that needs to handle the KV cache (past key values). Without the -with-past suffix, the exported model would recompute attention over the entire sequence for every token, making inference extremely slow.
After export, the ./llama-3.1-8b-onnx directory contains the .onnx model files and corresponding tokenizer files. The FP16 ONNX model is approximately 15 GB.
Step 2: INT8 Dynamic Quantization
optimum-cli onnxruntime quantize \
--onnx_model ./llama-3.1-8b-onnx \
--avx512_vnni \
-o ./llama-3.1-8b-onnx-int8
--avx512_vnni enables AVX-512 VNNI instruction set optimization — this is a vectorized integer operation instruction set supported by Intel 10th-gen Core and newer CPUs, providing significant INT8 inference acceleration. If your CPU does not support AVX-512 VNNI, simply drop this flag (it will fall back to the generic INT8 quantization path).
The quantized model is approximately 7.8 GB. This uses dynamic quantization: weights are pre-quantized to INT8 for storage, while activations are dynamically quantized at inference time. Compared to static quantization (which requires a calibration dataset), dynamic quantization needs no extra data but typically incurs slightly higher accuracy loss.
Step 3: Python Inference
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
model = ORTModelForCausalLM.from_pretrained(
'./llama-3.1-8b-onnx-int8'
)
tok = AutoTokenizer.from_pretrained('./llama-3.1-8b-fp16')
inputs = tok('Hello, how are you?', return_tensors='pt')
output = model.generate(**inputs, max_new_tokens=50)
print(tok.decode(output[0], skip_special_tokens=True))
ORTModelForCausalLM is Optimum’s wrapper around ONNX Runtime. Its API is highly compatible with Hugging Face transformers’ AutoModelForCausalLM, allowing seamless drop-in replacement.
Common Pitfalls
KV cache handling. As mentioned above, --task text-generation-with-past is crucial. Optimum splits the model into two subgraphs during export: initial prompt processing (no past) and incremental token generation (with past key values). If you use the wrong task type (e.g., just text-generation), the model will export successfully but inference efficiency will be very poor.
Execution Provider selection. ONNX Runtime supports different hardware backends through Execution Providers (EPs):
| EP | Hardware | Notes |
|---|---|---|
| CPUExecutionProvider | Generic CPU | Default, no extra installation needed |
| OpenVINOExecutionProvider | Intel CPU/GPU/NPU | Requires onnxruntime-openvino |
| CUDAExecutionProvider | NVIDIA GPU | Requires onnxruntime-gpu |
| DirectMLExecutionProvider | Windows GPU (generic) | Requires onnxruntime-directml |
If you are running ONNX Runtime on Intel hardware, you can select the OpenVINO EP for better performance — but at that point you are effectively using OpenVINO as the backend, so why not go straight to Path C? The core value of ONNX is not peak performance on any single hardware target, but cross-platform portability: the same ONNX model can run on Intel CPUs, NVIDIA GPUs, and ARM phones — just switch the EP.
Larger ONNX model sizes. ONNX INT8 quantization typically uses symmetric quantization with dynamic range, yielding a lower compression ratio than GGUF’s K-quant (Q4_K_M is 4-bit + mixed precision). If size is the primary concern, GGUF is almost always smaller.
Path C: HF → OpenVINO IR
OpenVINO is Intel’s official inference engine, deeply optimized for Intel CPUs, iGPUs, and NPUs. Path C offers the most options of the three — from a simple one-command export to sophisticated accuracy-aware quantization — allowing you to choose different levels of depth based on your needs.
Config 1: FP16 Baseline (No Quantization)
optimum-cli export openvino \
--model ./llama-3.1-8b-fp16 \
--task text-generation-with-past \
--weight-format fp16 \
./llama-3.1-8b-ov-fp16
This is the simplest export — just a format conversion (HF safetensors → OpenVINO IR), keeping weights at FP16. The output is openvino_model.xml (graph structure) + openvino_model.bin (weights), totaling approximately 15 GB.
Config 2: INT8 Weight-Only (One Command)
optimum-cli export openvino \
--model ./llama-3.1-8b-fp16 \
--weight-format int8 \
./llama-3.1-8b-ov-int8
Changing --weight-format from fp16 to int8 causes Optimum Intel to automatically invoke NNCF for INT8 weight-only quantization (symmetric, per-channel) during export. The model size drops to approximately 7.5 GB.
Note that no calibration dataset is needed here — 8-bit quantization has enough representational range to cover the vast majority of weight distributions, so simple min-max statistics suffice.
Config 3: INT4 Weight-Only (Advanced)
optimum-cli export openvino \
--model ./llama-3.1-8b-fp16 \
--weight-format int4 \
--group-size 128 \
--ratio 0.8 \
./llama-3.1-8b-ov-int4
INT4 is more aggressive quantization. Two key parameters:
--group-size 128: Every 128 weights share one set of scale and zero-point values. Smaller group sizes yield higher accuracy but greater overhead. 128 is the recommended default; 64 is more precise but slightly slower.--ratio 0.8: 80% of layers are quantized to INT4, while the remaining 20% of sensitive layers (e.g., attention output projection) use INT8 (backup_mode). A higher ratio means more aggressive compression but greater accuracy loss.
After INT4 quantization, the model is approximately 4.2 GB — only slightly smaller than GGUF Q4_K_M’s 4.7 GB.
INT4 quantization requires a calibration dataset. Optimum Intel uses a built-in wikitext subset as calibration data by default. If you have domain-specific data (e.g., legal documents or medical conversations), you can customize the calibration set through the NNCF API (see “Advanced Usage” below).
Inference: OpenVINO GenAI Pipeline
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(
'./llama-3.1-8b-ov-int4',
device='GPU' # 'CPU' for CPU inference
)
print(pipe.generate('Hello, how are you?', max_new_tokens=50))
openvino_genai is OpenVINO’s high-level API for LLM inference (C++ implementation with Python bindings), which internally handles tokenization, KV cache management, beam search / sampling, and more. device='GPU' specifies Intel iGPU inference.
You can also use Optimum Intel’s Python API (consistent with the transformers style):
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model = OVModelForCausalLM.from_pretrained(
'./llama-3.1-8b-ov-int4',
device='GPU'
)
tok = AutoTokenizer.from_pretrained('./llama-3.1-8b-fp16')
inputs = tok('Hello, how are you?', return_tensors='pt')
output = model.generate(**inputs, max_new_tokens=50)
print(tok.decode(output[0], skip_special_tokens=True))
Advanced Usage: NNCF Accuracy-Aware Quantization (AAQ)
If you have strict accuracy requirements (e.g., medical dialogue, code generation), you can bypass the Optimum CLI and use NNCF’s quantize_with_accuracy_control API directly. This is covered in detail in the Intel deep dive’s section on NNCF internals. Here is a complete runnable script:
import nncf
import openvino as ov
from functools import partial
from datasets import load_dataset
from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM
# ── 1. Load the FP16 OpenVINO model ──
model_path = './llama-3.1-8b-ov-fp16'
ov_model = ov.Core().read_model(f'{model_path}/openvino_model.xml')
tokenizer = AutoTokenizer.from_pretrained(model_path)
# ── 2. Prepare calibration dataset ──
calibration_data = load_dataset(
'wikitext', 'wikitext-2-raw-v1', split='train[:1000]'
)
def preprocess_fn(example, tokenizer):
return tokenizer(
example['text'], truncation=True, max_length=512
)
calibration_dataset = nncf.Dataset(
calibration_data,
partial(preprocess_fn, tokenizer=tokenizer)
)
# ── 3. Prepare validation dataset and validation function ──
val_data = load_dataset(
'wikitext', 'wikitext-2-raw-v1', split='validation[:200]'
)
validation_dataset = nncf.Dataset(
val_data,
partial(preprocess_fn, tokenizer=tokenizer)
)
def validation_fn(compiled_model, validation_data):
# Simplified example: compute validation loss as the accuracy metric
# In practice, replace with task-specific metrics like MMLU / HumanEval
total_loss = 0.0
count = 0
for batch in validation_data:
# ... forward pass to compute loss ...
count += 1
return (total_loss / max(count, 1),)
# ── 4. Run accuracy-aware quantization ──
quantized_model = nncf.quantize_with_accuracy_control(
ov_model,
calibration_dataset=calibration_dataset,
validation_dataset=validation_dataset,
validation_fn=validation_fn,
max_drop=0.01, # accuracy drop must not exceed 1%
drop_type=nncf.DropType.ABSOLUTE,
preset=nncf.QuantizationPreset.MIXED,
advanced_parameters=nncf.AdvancedQuantizationParameters(
overflow_fix=nncf.OverflowFix.DISABLE # LLMs typically don't need overflow fix
)
)
# ── 5. Save the quantized model ──
ov.save_model(quantized_model, './llama-3.1-8b-ov-int4-aaq/openvino_model.xml')
# Don't forget to copy the tokenizer files to the output directory
The key differences from the one-line Optimum CLI command:
- Accuracy constraint:
max_drop=0.01sets an upper bound on accuracy degradation. AAQ first attempts full INT4 quantization; if accuracy drops below the threshold, it automatically falls back to INT8 or FP16 for sensitive layers until the constraint is met. - Custom validation function: You can define arbitrary evaluation logic (e.g., pass@1, BLEU, domain-specific accuracy), not just perplexity.
- Full control: You can adjust preset, overflow fix, subset size, and other advanced parameters.
The trade-off: AAQ requires multiple iterations over the validation set (typically 5–20 rounds), increasing quantization time from minutes to hours.
Common Pitfalls
Device specification. device='GPU' refers to the Intel iGPU (integrated graphics), not a discrete GPU. If the system lacks OpenCL/Level Zero drivers or the iGPU is unavailable, you will get an error. Use device='CPU' to fall back to CPU inference.
Impact of --ratio. The ratio controls the mix between INT4 and INT8. On Llama-3.1-8B, ratio=0.8 (default) increases PPL by about 0.45; ratio=1.0 (all INT4) increases PPL by about 0.8. If accuracy is important, lower the ratio or use AAQ.
--group-size trade-offs. Group-size 128 is the standard choice for INT4. Group-size 64 means only 64 weights share each scale, reducing quantization error, but the storage overhead for scale/zero-point values increases by about 10%, and dequantization computation at inference time also grows. For accuracy-sensitive scenarios, try 64 but benchmark to confirm that speed does not drop significantly.
WSL2 permission issues. When running OpenVINO GPU inference in WSL2, you may encounter insufficient permissions for /dev/dri/renderD128. Ensure the current user is in the render and video groups:
sudo usermod -a -G render,video $USER
# Log out and back into the WSL2 session for changes to take effect
End-to-End Pipeline Pitfall Checklist
After walking all three paths, you will notice that many pitfalls are not specific to one path but are cross-cutting concerns. Here are the five most common traps:
1. Tokenizer Consistency
Model conversion is not just about changing the weight format — the tokenizer gets repackaged too. GGUF files embed an independent tokenizer, and the OpenVINO IR directory also contains tokenizer_config.json. The problem is:
- llama.cpp’s GGUF tokenizer is reverse-engineered from the HF
tokenizer.json. Certain edge cases (such as special token ID mappings and the order of added_tokens) may not be perfectly consistent with the original tokenizer - If you measure perplexity using Path A’s tokenizer but run inference with Path C’s tokenizer, the token sequences may differ — even with identical input text
Recommendation: When comparing across paths, ensure all paths use the same tokenizer to encode input text. You can encode with the HF AutoTokenizer first, then pass the token IDs to each inference engine.
2. Chat Template Loss
The tokenizer_config.json in a Hugging Face checkpoint typically includes a chat_template field (Jinja2 format) that defines the formatting rules for system/user/assistant messages. However:
- The GGUF format does not store chat templates. llama.cpp requires you to specify the template via the
--chat-templateCLI argument or manually in code - When exporting to OpenVINO IR, Optimum Intel copies
tokenizer_config.json, butopenvino_genai.LLMPipelinemay not correctly parse all Jinja2 templates
Recommendation: When using instruction-tuned models (e.g., *-Instruct), always verify that the chat template is being applied correctly. The simplest approach is to manually construct input in the <|begin_of_text|><|start_header_id|>user<|end_header_id|>...<|eot_id|> format and compare outputs between the original HF model and the converted model.
3. Dynamic Shape vs Static Shape
ONNX and OpenVINO use static shape export by default — the input sequence length is fixed at export time. But LLM input lengths are dynamic (different prompts have different lengths), so dynamic shapes must be explicitly configured:
optimum-cli export onnxhandles dynamic axes by default (batch_size and sequence_length are marked as dynamic), so manual configuration is typically unnecessaryoptimum-cli export openvinobehaves similarly, but if you manually convert withov.convert_model, you need to specify dynamic dimensions for theinputparameters yourself
If you see “shape mismatch” or “input size doesn’t match” errors at inference time, check the dynamic shape configuration first.
4. Non-Comparable Accuracy Measurements
Quantization accuracy is typically measured with perplexity (PPL) and benchmark accuracy (MMLU, HumanEval, etc.). But measurement conditions can differ across tools:
- Batch size: PPL is sensitive to batch size. llama.cpp’s
llama-perplexitydefaults to batch=512, whilelm-evaluation-harnesscan be configured with different batch sizes - Sequence length: Context window length affects PPL values. PPL measured with a 2048-token context differs from one measured with 4096
- Few-shot count: MMLU 0-shot and 5-shot results can differ by 5–10 percentage points
Recommendation: When comparing accuracy across paths, use the same evaluation framework (we recommend lm-evaluation-harness) with identical configuration (batch size, context length, few-shot count).
5. WikiText-2 Version Differences
A subtle but important issue: the wiki.test.raw used by llama.cpp’s llama-perplexity and wikitext-2-raw-v1 from Hugging Face datasets contain the same data, but tokenization differs:
- llama.cpp uses the tokenizer embedded in the GGUF file to directly tokenize the raw text
lm-evaluation-harnessuses the Hugging FaceAutoTokenizer
The two tokenizers may produce slightly different tokenization results for the same text (see Pitfall 1), causing PPL values that are not perfectly comparable. The difference is typically in the 0.01–0.05 PPL range — it does not affect qualitative conclusions but should not be treated as a basis for precise quantitative comparison.
Recommendation: For strict comparisons, use lm-evaluation-harness uniformly to run the same benchmarks across all quantized models from every path.
Summary and Path Selection
Each of the three paths has its strengths. The right choice depends on your hardware and use case:
Just want to run a model locally, not sensitive to accuracy → GGUF Q4_K_M + llama.cpp. The simplest path: two commands (convert + quantize), llama.cpp’s CPU inference performance is excellent, and GGUF resources are abundant in the community (Hugging Face hosts many pre-quantized GGUF files available for direct download).
Cross-platform deployment, uncertain final hardware → ONNX + ONNX Runtime. ONNX is the only truly cross-hardware format — the same model file can run on Intel CPUs, NVIDIA GPUs, ARM phones, and Qualcomm NPUs, just by switching the Execution Provider. The trade-off: performance is not optimal on any single hardware target, but it is good enough.
Intel hardware + best performance + accuracy-sensitive → OpenVINO IR + NNCF AAQ. Intel’s official full-stack solution, with the deepest optimization for Intel CPU/iGPU/NPU. INT4 + iGPU delivers the highest inference speed (on Intel platforms), and AAQ finds the best balance between accuracy and performance.
These three paths are not mutually exclusive — many real-world scenarios combine them. For example, during development and debugging you might use GGUF + llama.cpp for quick model quality validation, then package the model with OpenVINO for production deployment on Intel hardware.
This article is the third (and final) installment of the quantization and model conversion toolchain series. Together with the first two articles, it forms a complete “theory → selection → practice” chain:
- The Landscape article: understand the layered structure and ecosystem of the toolchain
- The Intel Deep Dive: dissect the Optimum Intel / NNCF / OpenVINO stack
- The Hands-On Guide (this article): walk through three conversion paths end to end
This article is also the second-to-last stop on the Intel iGPU Inference Deep Dive learning path — after understanding the Xe2 architecture, oneDNN primitives, OpenVINO graph optimization, and the Intel optimization stack, this is where all that knowledge comes together in complete end-to-end practice. Next up: iGPU performance analysis and NPU co-inference.