Quantization and Model Conversion Toolchain Landscape
Updated 2026-04-17
A Common Confusion
“I want to run Llama-3.1-8B on my laptop with an Intel iGPU (Iris Xe). I found the model checkpoint on Hugging Face, but what do I do next?”
This is one of the most common questions in the community. You start searching and run into a pile of tool names: Optimum, NNCF, OpenVINO, llama.cpp, ONNX Runtime, TensorRT-LLM. Each claims to “quantize” and “accelerate” models. Some docs tell you to “convert to ONNX first,” others say “convert directly to OpenVINO IR,” and still others insist that “GGUF is the most universal format.” The more you read, the more confused you get — how do these tools relate to each other? Are they competitors or upstream/downstream dependencies? Which one should I use?
If this sounds familiar, this article will give you a clear framework for navigating the landscape. We won’t just list tool names. Instead, we’ll start from a pipeline layering perspective to understand each tool’s role in the quantization and model conversion pipeline, then use a comparison matrix and decision tree to help you pick the best tool combination for your specific scenario.
The Pipeline Layer Map
The key to understanding the toolchain is recognizing that quantization and model conversion are not a single step — they form a four-layer pipeline. Different tools operate at different layers; some handle only one layer, while others orchestrate across multiple layers. Here is the four-layer framework:
Layer 1: Algorithm Libraries
Tools at this layer do one thing: modify the numerical values of weights (quantization, pruning, knowledge distillation, etc.) while keeping the input and output format identical. For example, you feed in a Hugging Face safetensors checkpoint and get back safetensors — except the weights have changed from FP16 to INT4.
Typical tools:
- AutoGPTQ / AutoAWQ: Implement the GPTQ and AWQ quantization algorithms, outputting Hugging Face-compatible quantized models
- bitsandbytes: NF4 (4-bit NormalFloat) quantization, primarily used during training (QLoRA) and for inference memory optimization
- llm-compressor: The officially recommended quantization tool for vLLM (more on this below)
- NNCF (Neural Network Compression Framework): Intel’s open-source compression algorithm library, supporting PTQ, QAT, pruning, and knowledge distillation
- SmoothQuant: Smooths activation outliers to enable W8A8 (weight + activation INT8) quantization
Key characteristic: Pure algorithms, format-agnostic. These tools don’t care what inference engine you’ll use — you can quantize a model with NNCF and then export it to ONNX, OpenVINO, or even TensorRT.
Layer 2: Format Converters
Tools at this layer do one thing: change the model format (from training framework format to inference engine format), optionally performing lightweight quantization (e.g., FP32 to FP16 or simple symmetric INT8 quantization).
Typical tools:
ov.convert_model: OpenVINO’s official conversion API, from PyTorch / TensorFlow / ONNX to OpenVINO IR (.xml+.bin)convert_hf_to_gguf.py: llama.cpp’s official conversion script, from Hugging Face checkpoint to GGUF formattrtllm-build: TensorRT-LLM’s build tool, from checkpoint to TensorRT engine (.planfiles). When used with quantization options, it also belongs to Layer 4 (see “Runtime-Integrated” below)optimum-cli export onnx: Hugging Face Optimum’s export command, from Transformers to ONNX
Key characteristic: Format conversion first, quantization second. Many converters only perform format mapping (e.g., PyTorch Module to ONNX graph); quantization is an optional post-processing step.
Layer 3: Orchestrators
Tools at this layer don’t implement specific algorithms or format conversion logic — instead, they chain Layer 1 and Layer 2 tools into a pipeline and provide higher-level features like search, evaluation, and configuration management.
Typical tools:
- Microsoft Olive: A cross-backend model optimization orchestrator whose core abstractions are Pass + Workflow + Evaluator + Search Strategy. A Pass is a single optimization step (e.g., quantization, graph optimization, Graph Capture). A Workflow chains Passes in sequence. An Evaluator assesses results (accuracy / latency). A Search Strategy automates hyperparameter tuning. Olive doesn’t implement quantization algorithms itself — it calls underlying tools like ONNX Runtime Quantization, TensorRT, and OpenVINO. Target hardware covers Qualcomm, AMD, Nvidia, and Intel NPU/GPU/CPU. Latest version: olive-ai 0.11.0 (2026-01-29).
- Optimum Intel: A semi-orchestrator — it wraps
ov.convert_model(Layer 2) + NNCF (Layer 1) to provide a one-click Hugging Face to OpenVINO IR quantization pipeline. Unlike Olive, it doesn’t support multiple backends; it focuses exclusively on the Intel hardware stack.
Key characteristic: Compose and reuse, don’t reinvent the wheel. The value of orchestrators lies in reducing the complexity of assembling toolchains, especially when you need multiple optimization techniques in combination (e.g., quantization + pruning + knowledge distillation).
Layer 4: Runtime-Integrated
Tools at this layer tightly couple algorithm + format + execution — quantization and the inference engine are different components of the same system and cannot be used independently.
Typical tools:
- llama.cpp’s
llama-quantize: Directly applies Q4_0, Q4_K_M, IQ2_XS, and other quantization schemes to GGUF files; the resulting models can only be used with llama.cpp for inference - TensorRT-LLM built-in quantization: Embeds quantization options in the
trtllm-buildpipeline (--quant_mode int8_kv_cache, etc.); the resulting.planfiles can only be used with TensorRT for inference - SGLang quantization: Built-in AWQ/GPTQ/FP8 support, but the highly optimized kernels are tightly bound to SGLang’s RadixAttention scheduler
Key characteristic: Maximum performance, minimum portability. Tight coupling between format and engine means you can’t migrate to another inference system, but it also enables extreme co-design optimization (e.g., llama.cpp’s K-quant multi-level scales and deep SIMD kernel integration).
The interactive component below shows the complete four-layer structure. Click each layer to see the tools at that layer and their typical input/output formats:
量化与转换工具链分层地图
四层架构:从算法库到推理引擎内置方案
算法库
定义
只做数学转换,输入/输出通常同格式,只改权重数值
通用 I/O 契约
输入:模型权重(HF safetensors / PyTorch / ONNX / OpenVINO IR)→ 输出:同格式权重(量化后)
包含工具
Tour of the Four Major Ecosystems
With the layered framework in hand, let’s tour the major tool ecosystems organized by hardware camp. Each camp has its own “full-stack solution,” forming a closed loop from algorithm library to inference engine.
The Intel Camp
Intel’s model optimization stack is a three-piece suite: Optimum Intel (orchestrator) + NNCF (algorithm library) + OpenVINO Converter (ov.convert_model, format converter). Optimum Intel wraps the latter two, providing a one-click Hugging Face to OpenVINO IR quantization pipeline — notably integrating NNCF’s accuracy-aware quantization (via the quantize_with_accuracy_control API).
An important distinction: Intel Neural Compressor (INC) and NNCF are different product lines, not competitors. INC v3.7 (2025-12-25) is still actively maintained and targets Gaudi AI accelerators + Xeon/Core Ultra native stacks — it does not target OpenVINO as a backend and is suited for server and Gaudi cluster deployments. NNCF is positioned as an OpenVINO-first compression library but is also integrated into ecosystems like PyTorch (torch.compile), ExecuTorch, and Microsoft Olive.
A deep dive into the Intel ecosystem (NNCF’s algorithm matrix, Optimum Intel’s specific usage, OpenVINO IR format details) will be covered in the next article, Deep Dive into the Intel Model Optimization Stack.
The NVIDIA Camp
NVIDIA’s toolchain centers on TensorRT-LLM (inference engine) + NVIDIA Model Optimizer (ModelOpt, quantization tool). ModelOpt was originally an internal project codenamed AMMO (Algorithmic Model Optimizer) and was officially renamed to the modelopt package in May 2024 with v0.11. In December 2025, NVIDIA rebranded it again from “TensorRT Model Optimizer” to “NVIDIA Model Optimizer,” dropping the “TensorRT” prefix and migrating the repository to NVIDIA/Model-Optimizer (the old URL still works). Latest version: ModelOpt 0.43.0 (2026-04-16).
The call chain is: ModelOpt produces a quantized model -> TensorRT-LLM compiles it into an engine. ModelOpt supports multiple quantization methods (FP8, INT8, INT4 weight-only, AWQ, SmoothQuant), and its output is still a PyTorch checkpoint or Hugging Face safetensors — but with embedded quantization metadata (scale / zero-point / quantization scheme). Next, trtllm-build reads this metadata and compiles a TensorRT engine (.plan file) — a typical Layer 4 flow where the final artifact can only run on TensorRT-LLM.
It’s worth emphasizing that while NVIDIA’s stack delivers exceptional performance on H100/A100 (especially with FP8 Tensor Core acceleration), portability is low — TensorRT engines can’t be migrated across hardware (even different cards within the same GPU architecture require recompilation). This contrasts with open formats like ONNX and GGUF.
The Hugging Face Universal Layer
The Hugging Face ecosystem is the cross-hardware “neutral layer,” centered on the Optimum umbrella project, which branches into multiple sub-packages:
- Optimum Intel (covered above)
- Optimum NVIDIA (TensorRT integration)
- Optimum AMD (ROCm integration)
- Optimum Habana (Gaudi integration)
Beyond Optimum, Hugging Face also maintains or integrates several quantization tools:
- bitsandbytes: NF4 (4-bit NormalFloat) quantization, primarily used for QLoRA training and the transformers library’s
.load_in_4bit()API - AutoGPTQ / AutoAWQ: Community-maintained GPTQ and AWQ implementations, outputting Hugging Face-compatible quantized models
- llm-compressor: A tool that rose to prominence in the second half of 2024, worth a closer look
llm-compressor’s official positioning is “a Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM.” It is co-maintained by the vLLM Project and Red Hat AI, hosted under the vllm-project organization. vLLM’s official documentation explicitly states in its quantization chapter: “To get started with quantization, see LLM Compressor” — this is vLLM’s first-party quantization path.
Supported quantization methods include:
- Activation quantization: W8A8 (int8 & fp8), MXFP8 (experimental)
- Mixed precision: W4A16, W8A16, MXFP8A16 (experimental), NVFP4
- Algorithms: simple PTQ, GPTQ, AWQ, SmoothQuant, AutoRound
The output format is compressed-tensors, which vLLM loads natively with no additional conversion needed. Latest version: v0.10.0.1 (2026-03-13), 3.1k GitHub stars.
One caveat to note: vLLM also supports loading quantized models in other formats (AutoAWQ, GPTQModel, bitsandbytes, GGUF, FP8, ModelOpt, Quark, TorchAO, etc.) — llm-compressor is the officially recommended first choice, but it’s not the only option.
The ggml Camp
ggml (Georgi Gerganov’s ML library) is the underlying tensor library for llama.cpp and has spawned its own ecosystem. The core tools are:
convert_hf_to_gguf.py: Converts Hugging Face checkpoints to GGUF format (Layer 2 converter)llama-quantize: Applies quantization to GGUF files (Layer 4 runtime-integrated)
The GGUF format and the llama.cpp inference engine are a textbook case of tight coupling — GGUF’s design revolves entirely around llama.cpp’s kernel optimizations (e.g., Q4_K_M’s super-block structure, IQ2_XS’s vector quantization), and other inference engines (like ONNX Runtime or TensorRT) cannot load GGUF directly. But this tight coupling also brings exceptional flexibility: llama.cpp supports 30+ quantization types (Q4_0, Q4_K_M, Q5_K_S, Q6_K, IQ2_XXS, IQ3_S, etc.), allowing fine-grained tuning for different VRAM and accuracy requirements.
For an in-depth analysis of llama.cpp’s quantization schemes (K-quant’s super-block design, I-quant’s codebook, mixed-precision strategies), see our earlier article llama.cpp Quantization Schemes.
Cross-Backend Orchestrators
Microsoft Olive is the quintessential Layer 3 orchestrator. Its official positioning is “AI Model Optimization Toolkit for the ONNX Runtime,” but the backends it actually supports extend well beyond ONNX Runtime — including TensorRT, OpenVINO, DirectML, Qualcomm QNN, and AMD ROCm.
Olive’s core abstractions are:
- Pass: A single model optimization step, such as model compression, Graph Capture, quantization, or graph optimization
- Workflow: A sequence of Passes arranged in order (e.g., INT8 quantization -> redundant operator removal -> constant folding)
- Evaluator: Assesses optimization results, supporting accuracy, latency, throughput, and memory metrics
- Search Strategy: Automated hyperparameter search, trying different Pass combinations and hyperparameters to find the Pareto-optimal solution
Olive includes 40+ built-in optimization components covering quantization (ONNX Runtime PTQ, TensorRT INT8, OpenVINO INT8), pruning, knowledge distillation, operator fusion, memory planning, and more. Its core value isn’t implementing new algorithms — it’s composing and reusing existing tools while providing an automated search framework. For example, it can automatically try “FP16 baseline vs INT8 PTQ vs INT8 QAT” and find the fastest option under a constraint like “accuracy loss < 1%.”
Latest version: olive-ai 0.11.0 (2026-01-29). Target hardware covers Qualcomm (Snapdragon), AMD (ROCm), Nvidia (TensorRT), and Intel NPU/GPU/CPU.
Mobile and Edge (Brief Overview)
Mobile and edge devices have their own tool ecosystems. Here’s a brief overview (a detailed treatment will come in a future “Edge Deployment” topic):
-
Apple coremltools 9.0 (2025-11-10): Supports INT4/INT8 linear quantization, 1-8 bit palettization (lookup table), sparsity, and various joint compression combinations (sparsity + quantization, sparsity + palettization). W8A8 mode can leverage Apple Silicon’s INT8 compute acceleration. Note that coremltools does not have a dedicated LLM quantization API — it provides general-purpose weight-only quantization + palettization + sparsity capabilities, and transformer/LLM workflows can use these general features (with OPT and BERT as examples). Output format is CoreML (
.mlmodel), executed on iOS/macOS via the Core ML inference engine. -
AMD Quark 0.11.1 (2026-02-19): AMD’s official cross-platform quantization tool, supporting PyTorch (Eager / FX Graph / QAT) and ONNX (via ONNX Runtime Quantization) paths. Target hardware covers Ryzen AI NPU (
NPU_CNN/NPU_Transformermodes), AMD Instinct GPU (ROCm, including MI300X), and CPU; it can also calibrate on NVIDIA hardware via CUDA/HIP. Algorithm support is comprehensive: PTQ, QAT, SmoothQuant, AWQ, GPTQ, QuaRot, Qronos, CLE, BiasCorrection, AdaQuant, AdaRound; calibration methods include MinMax, Percentile, MSE, and Entropy. The v0.11.1 highlight is “file-to-file quantization for ultra-large models” (weight-only + dynamic activation). -
Google AI Edge Torch 0.8.0 (2026-01-26): The official PyTorch to LiteRT (TFLite) conversion bridge, targeting Android / iOS / IoT devices. Output format is
.tfliteflatbuffer, executed by the LiteRT runtime (LiteRT is TensorFlow Lite’s rebrand). Target hardware covers a wide range of CPUs, plus preliminary GPU and NPU support (the docs refer generically to NPU rather than specifically naming Edge TPU). It has two components: PyTorch Converter (Beta) and Generative API (Alpha, for LLM model rewriting and quantization). Importantly, the Generative API is still in Alpha — its stability and feature completeness have not reached production grade.
Six-Dimension Comparison Matrix
Now that we’ve surveyed the major ecosystems, let’s compare tool characteristics across six dimensions to help you filter candidates for your specific needs.
The Six Dimensions
- Role: Algorithm library / Format converter / Orchestrator / Runtime-integrated (corresponding to Layers 1-4 above)
- Input format: Hugging Face checkpoint / PyTorch
state_dict/ ONNX / TensorFlow SavedModel / custom format - Output format: This is the most critical dimension, as it determines your downstream inference engine choices
- Supported quantization methods: GPTQ / AWQ / SmoothQuant / NF4 / K-quant / FP8 / INT4 weight-only / W8A8 / etc.
- Target hardware: CPU / NVIDIA GPU / Intel GPU/NPU / AMD GPU/NPU / Apple Silicon / ARM Mobile / Gaudi / Qualcomm
- Pipeline position: Used standalone / wrapped and called by higher-level tools
The Key Distinction in Output Formats
It’s important to understand the difference between inference-ready formats and intermediate formats:
Inference-ready formats (can be loaded and executed directly):
- OpenVINO IR (
.xml+.bin): OpenVINO inference engine exclusive - TensorRT engine (
.plan): TensorRT / TensorRT-LLM exclusive - GGUF: llama.cpp exclusive
- ONNX (
.onnx): ONNX Runtime / ORT-derived engines (e.g., ONNX Runtime Mobile) - CoreML (
.mlmodel): Apple Core ML engine exclusive - LiteRT (TFLite) (
.tflite): Google LiteRT runtime exclusive
Intermediate formats (require one more conversion step before inference):
- GPTQ / AWQ-produced Hugging Face safetensors: Although quantized, these are still in PyTorch format and require the Hugging Face transformers library to load, or need further conversion to another format (e.g., ONNX, OpenVINO) for deployment
- PyTorch state_dict / checkpoint: Training format, requires conversion
The interactive component below shows 15-20 mainstream tools compared across the six dimensions. Use the filter bar at the top to quickly find tools matching your needs (e.g., “show only GGUF output,” “show only AWQ support,” “show only Intel GPU”):
量化工具矩阵浏览器
17 个工具 × 6 个维度,交互式筛选
| 工具名称 | 定位 | 输入格式 | 输出格式 | 量化方法 | 目标硬件 | Pipeline 位置 |
|---|---|---|---|---|---|---|
| AutoGPTQ | algorithm-lib | HF-safetensors, PyTorch | HF-safetensors | GPTQ | NVIDIA-GPU, Intel-CPU, Intel-GPU | standalone |
| AutoAWQ | algorithm-lib | HF-safetensors | HF-safetensors | AWQ | NVIDIA-GPU, Intel-CPU | standalone |
| bitsandbytes | algorithm-lib | PyTorch (runtime) | PyTorch (runtime) | NF4, FP4, INT8 | NVIDIA-GPU | standalone (runtime) |
| NNCF | algorithm-lib | PyTorch, TorchFX, ONNX, OpenVINO-IR | same as input (compressed) | PTQ, WOQ, QAT, Sparsity, WO-QAT+LoRA | Intel-CPU, Intel-GPU, Intel-NPU | standalone or wrapped by Optimum Intel |
| llm-compressor | algorithm-lib | HF-safetensors | HF-safetensors (compressed-tensors) | W8A8, W4A16, AWQ, GPTQ, SmoothQuant, AutoRound, INT8, FP8 | NVIDIA-GPU, AMD-GPU | standalone (vLLM first-party) |
| ModelOpt (NVIDIA) | algorithm-lib | PyTorch | PyTorch (annotated) | FP8, INT8, INT4, AWQ, SmoothQuant | NVIDIA-GPU | wrapped by TensorRT-LLM |
| AMD Quark | algorithm-lib | PyTorch, ONNX | same as input (compressed) | PTQ, QAT, AWQ, GPTQ, SmoothQuant, QuaRot | AMD-Ryzen-AI-NPU, AMD-Instinct-GPU, CPU | standalone |
| coremltools (Apple) | converter + algorithm-lib | PyTorch, TensorFlow | CoreML | INT4, INT8, palettization | Apple-Silicon | standalone |
| convert_hf_to_gguf.py | converter | HF-safetensors | GGUF (FP16/BF16/F32) | none | any | standalone, feeds llama-quantize |
| ov.convert_model | converter | PyTorch, ONNX, TensorFlow | OpenVINO-IR (FP32/FP16) | none | any | standalone |
| torch.onnx.export | converter | PyTorch | ONNX | none | any | standalone |
| optimum-cli export onnx | converter | HF-checkpoint | ONNX (+ optional INT8) | INT8 | any | wraps torch.onnx.export |
| Google AI Edge Torch | converter | PyTorch | LiteRT/TFLite | INT8, INT4 | mobile-CPU, mobile-GPU, mobile-NPU | standalone |
| Microsoft Olive | orchestrator | HF-checkpoint, PyTorch, ONNX | ONNX, OpenVINO-IR, TensorRT-engine | delegates to backend (40+ passes) | Intel-CPU, Intel-GPU, AMD-GPU, NVIDIA-GPU | top-level orchestrator |
| Optimum Intel | orchestrator | HF-checkpoint | OpenVINO-IR (FP16/INT8/INT4) | WOQ, PTQ, AAQ | Intel-CPU, Intel-GPU, Intel-NPU | wraps NNCF + ov.convert_model |
| llama.cpp llama-quantize | engine-builtin | GGUF (FP16) | GGUF (Q2_K…Q8_0) | Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0, IQ1_S, IQ2_XXS | CPU, NVIDIA-GPU, Apple-Silicon, Intel-GPU, AMD-GPU | engine-builtin |
| TensorRT-LLM | engine-builtin | HF-checkpoint, ModelOpt output | TensorRT-engine | FP8, INT8, INT4, AWQ, SmoothQuant | NVIDIA-GPU | engine-builtin |
提示:同一维度内为「或」逻辑,跨维度为「与」逻辑。匹配的单元格会高亮显示。
Related but Often Confused: Quantization During Fine-Tuning
When discussing quantization toolchains, you’ll frequently encounter the terms LoRA and QLoRA. How do they relate to the inference quantization toolchains covered in this article? Here are three clarifications:
LoRA Is Not Quantization
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) method, not a quantization technique. Its core idea: freeze the pretrained model’s original weights and train only two low-rank matrices and (of dimensions and , where ). At inference time, they are merged: . The number of trainable parameters is only 0.1-1% of the original model, enabling fine-tuning of large models on consumer GPUs.
LoRA’s goal is “fewer parameters to update,” not “lower precision” — both the original weights and the LoRA adapters remain in FP16 or BF16.
QLoRA = Quantized Base Weights + LoRA Training
QLoRA (Quantized LoRA) is an advanced variant of LoRA designed specifically for memory-constrained scenarios. Key technical points:
- Frozen base weights are stored in NF4 quantization (4-bit NormalFloat, implemented by bitsandbytes); they are dequantized to FP16 only when needed for the forward pass
- LoRA adapters are trained in FP16, and gradients are also in FP16
- The sole purpose of quantization is to save training memory (since base weights are the dominant cost), not to speed up inference
After QLoRA training completes, you get a LoRA adapter checkpoint (typically tens to hundreds of MB). There are then two inference paths:
- Path A: Continue using the NF4-quantized base model + FP16 adapter at inference time — saves memory but inference speed is mediocre
- Path B: Use
peft.merge_and_unload()to merge the adapter back into the base model, producing a full FP16 checkpoint, then apply the PTQ tools discussed in this article (e.g., llm-compressor, NNCF, llama-quantize) for inference quantization
Path B is the standard approach for connecting QLoRA training with inference quantization.
QLoRA-Trained Models Can Be Converted to GGUF
A typical end-to-end workflow:
- Fine-tune Llama-3.1-8B on a custom dataset with QLoRA (base model in NF4 quantization)
- After training, run
peft.merge_and_unload()to get a full FP16 checkpoint - Convert to GGUF format with
convert_hf_to_gguf.py - Apply Q4_K_M or Q5_K_M quantization with
llama-quantize - Deploy to local llama.cpp for inference
This shows that QLoRA and inference quantization toolchains are not mutually exclusive — they can be chained together. QLoRA solves “how to fine-tune large models at low cost,” while inference quantization toolchains solve “how to deploy large models at low cost.”
To be clear: this section does not cover LoRA / QLoRA training details (that belongs to the fine-tuning topic). We’re only clarifying “does it count as a quantization tool” — the answer is QLoRA is a training-time memory optimization technique, not part of the inference quantization toolchain, but the two can be connected end-to-end.
Tool Selection Decision Tree
With the layered framework, ecosystem map, and comparison matrix in hand, we now need a decision process to quickly narrow down the candidate tools. The decision is based on three key questions:
Q1: What Is Your Target Hardware?
This is the top-priority question, because hardware determines the ecosystem:
- Intel CPU / iGPU / Arc GPU: Intel camp (Optimum Intel + NNCF + OpenVINO)
- NVIDIA GPU (consumer / datacenter): NVIDIA camp (TensorRT-LLM + ModelOpt) or Hugging Face universal layer (vLLM + llm-compressor)
- Apple Silicon (M1/M2/M3/M4): Apple coremltools (mobile/edge) or llama.cpp (general purpose)
- AMD GPU (ROCm): AMD Quark + ONNX Runtime, or vLLM (ROCm backend)
- CPU only (x86 / ARM): llama.cpp (first choice) or ONNX Runtime
- Mobile devices (Android / iOS): Google AI Edge Torch (Android) / Apple coremltools (iOS)
- Edge NPU (Qualcomm / Rockchip / etc.): Vendor-specific tools + ONNX / TFLite
Q2: How Sensitive Are You to Accuracy?
Quantization introduces accuracy loss, and different tasks have different tolerances:
- Lenient (UI copy generation, casual chatbots): Aggressive quantization like Q4_K_S or IQ2_XS is fine; a perplexity increase of 0.1-0.2 is acceptable
- Moderate (general conversation, document QA): Q4_K_M or Q5_K_M recommended; perplexity increase < 0.1
- Strict (code generation, math reasoning, medical consultation): Q6_K, Q8_0, or FP8 recommended, or use accuracy-aware quantization
Q3: Do You Need Accuracy-Aware Constraints?
Accuracy-aware quantization (AAQ) means setting an accuracy constraint during quantization (e.g., “inference accuracy must not drop below 99% of baseline”), and the tool automatically adjusts the quantization strategy (e.g., preserving higher precision for sensitive layers). Tools that support AAQ:
- NNCF:
nncf.quantize_with_accuracy_control()API - Microsoft Olive: Evaluator + Search Strategy combination
- Intel Neural Compressor (INC):
PostTrainingQuantConfig(accuracy_criterion=...)API
If your scenario is extremely accuracy-sensitive and you’re willing to accept longer quantization times (AAQ requires multiple iterations on a validation set), prioritize tools with AAQ support.
Decision Tree Example
Here is a concrete leaf-node example:
Scenario: Intel iGPU + strict accuracy requirements + accuracy-aware needed
-> Recommended solution: Optimum Intel + NNCF quantize_with_accuracy_control
-> Rationale: NNCF is Intel’s official compression library with the deepest OpenVINO integration; quantize_with_accuracy_control automatically iterates on a validation set to find the most aggressive quantization strategy that meets the accuracy constraint (potentially mixed precision with some layers in INT8 and others in FP16)
-> Output format: OpenVINO IR (.xml + .bin)
-> Inference engine: OpenVINO Runtime
The interactive component below implements the complete decision tree. Click through each question’s options to progressively narrow down to a recommended solution (including tool combination, rationale, and jump links):
工具选择决策树
3 个问题引导推荐
Q1: 目标硬件?
Transition to the Next Article
If you selected the Intel camp in the decision tree, your next question will be how to choose among Optimum Intel, NNCF, and OpenVINO Converter specifically — what are the calling relationships between them? When should you use NNCF directly versus going through the Optimum Intel wrapper? How do you navigate NNCF’s 20+ algorithms across 4 backends (OpenVINO / PyTorch / TorchFX / ONNX)? What does the OpenVINO IR format look like internally?
These questions will be answered in detail in the next article, Deep Dive into the Intel Model Optimization Stack. The article after that (the third in the series) will be a hands-on practice guide: starting from a Hugging Face checkpoint, performing INT8 quantization + accuracy-aware tuning with Optimum Intel + NNCF, exporting to OpenVINO IR, and deploying inference on an Intel iGPU — walking through the complete end-to-end flow.
If you chose a different camp (e.g., NVIDIA, llama.cpp, mobile), we’ll publish corresponding deep dives and hands-on guides in the future. The overarching goal of the Quantization and Model Conversion Toolchain series is to take you beyond just knowing “what tools exist” to understanding “why they’re designed this way,” “what to use when,” and “how to combine them” — from confusion to clarity, from tool selection to production deployment.