Quantization and Model Conversion Toolchain Landscape

A Common Confusion

“I want to run Llama-3.1-8B on my laptop with an Intel iGPU (Iris Xe). I found the model checkpoint on Hugging Face, but what do I do next?”

This is one of the most common questions in the community. You start searching and run into a pile of tool names: Optimum, NNCF, OpenVINO, llama.cpp, ONNX Runtime, TensorRT-LLM. Each claims to “quantize” and “accelerate” models. Some docs tell you to “convert to ONNX first,” others say “convert directly to OpenVINO IR,” and still others insist that “GGUF is the most universal format.” The more you read, the more confused you get — how do these tools relate to each other? Are they competitors or upstream/downstream dependencies? Which one should I use?

If this sounds familiar, this article will give you a clear framework for navigating the landscape. We won’t just list tool names. Instead, we’ll start from a pipeline layering perspective to understand each tool’s role in the quantization and model conversion pipeline, then use a comparison matrix and decision tree to help you pick the best tool combination for your specific scenario.

The Pipeline Layer Map

The key to understanding the toolchain is recognizing that quantization and model conversion are not a single step — they form a four-layer pipeline. Different tools operate at different layers; some handle only one layer, while others orchestrate across multiple layers. Here is the four-layer framework:

Layer 1: Algorithm Libraries

Tools at this layer do one thing: modify the numerical values of weights (quantization, pruning, knowledge distillation, etc.) while keeping the input and output format identical. For example, you feed in a Hugging Face safetensors checkpoint and get back safetensors — except the weights have changed from FP16 to INT4.

Typical tools:

AutoGPTQ / AutoAWQ: Implement the GPTQ and AWQ quantization algorithms, outputting Hugging Face-compatible quantized models
bitsandbytes: NF4 (4-bit NormalFloat) quantization, primarily used during training (QLoRA) and for inference memory optimization
llm-compressor: The officially recommended quantization tool for vLLM (more on this below)
NNCF (Neural Network Compression Framework): Intel’s open-source compression algorithm library, supporting PTQ, QAT, pruning, and knowledge distillation
SmoothQuant: Smooths activation outliers to enable W8A8 (weight + activation INT8) quantization

Key characteristic: Pure algorithms, format-agnostic. These tools don’t care what inference engine you’ll use — you can quantize a model with NNCF and then export it to ONNX, OpenVINO, or even TensorRT.

Layer 2: Format Converters

Tools at this layer do one thing: change the model format (from training framework format to inference engine format), optionally performing lightweight quantization (e.g., FP32 to FP16 or simple symmetric INT8 quantization).

Typical tools:

ov.convert_model: OpenVINO’s official conversion API, from PyTorch / TensorFlow / ONNX to OpenVINO IR (.xml + .bin)
convert_hf_to_gguf.py: llama.cpp’s official conversion script, from Hugging Face checkpoint to GGUF format
trtllm-build: TensorRT-LLM’s build tool, from checkpoint to TensorRT engine (.plan files). When used with quantization options, it also belongs to Layer 4 (see “Runtime-Integrated” below)
optimum-cli export onnx: Hugging Face Optimum’s export command, from Transformers to ONNX

Key characteristic: Format conversion first, quantization second. Many converters only perform format mapping (e.g., PyTorch Module to ONNX graph); quantization is an optional post-processing step.

Layer 3: Orchestrators

Tools at this layer don’t implement specific algorithms or format conversion logic — instead, they chain Layer 1 and Layer 2 tools into a pipeline and provide higher-level features like search, evaluation, and configuration management.

Typical tools:

Microsoft Olive: A cross-backend model optimization orchestrator whose core abstractions are Pass + Workflow + Evaluator + Search Strategy. A Pass is a single optimization step (e.g., quantization, graph optimization, Graph Capture). A Workflow chains Passes in sequence. An Evaluator assesses results (accuracy / latency). A Search Strategy automates hyperparameter tuning. Olive doesn’t implement quantization algorithms itself — it calls underlying tools like ONNX Runtime Quantization, TensorRT, and OpenVINO. Target hardware covers Qualcomm, AMD, Nvidia, and Intel NPU/GPU/CPU. Latest version: olive-ai 0.11.0 (2026-01-29).
Optimum Intel: A semi-orchestrator — it wraps ov.convert_model (Layer 2) + NNCF (Layer 1) to provide a one-click Hugging Face to OpenVINO IR quantization pipeline. Unlike Olive, it doesn’t support multiple backends; it focuses exclusively on the Intel hardware stack.

Key characteristic: Compose and reuse, don’t reinvent the wheel. The value of orchestrators lies in reducing the complexity of assembling toolchains, especially when you need multiple optimization techniques in combination (e.g., quantization + pruning + knowledge distillation).

Layer 4: Runtime-Integrated

Tools at this layer tightly couple algorithm + format + execution — quantization and the inference engine are different components of the same system and cannot be used independently.

Typical tools:

llama.cpp’s llama-quantize: Directly applies Q4_0, Q4_K_M, IQ2_XS, and other quantization schemes to GGUF files; the resulting models can only be used with llama.cpp for inference
TensorRT-LLM built-in quantization: Embeds quantization options in the trtllm-build pipeline (--quant_mode int8_kv_cache, etc.); the resulting .plan files can only be used with TensorRT for inference
SGLang quantization: Built-in AWQ/GPTQ/FP8 support, but the highly optimized kernels are tightly bound to SGLang’s RadixAttention scheduler

Key characteristic: Maximum performance, minimum portability. Tight coupling between format and engine means you can’t migrate to another inference system, but it also enables extreme co-design optimization (e.g., llama.cpp’s K-quant multi-level scales and deep SIMD kernel integration).

The interactive component below shows the complete four-layer structure. Click each layer to see the tools at that layer and their typical input/output formats:

量化与转换工具链分层地图

四层架构：从算法库到推理引擎内置方案

算法库

定义

只做数学转换，输入/输出通常同格式，只改权重数值

通用 I/O 契约

输入：模型权重（HF safetensors / PyTorch / ONNX / OpenVINO IR）→ 输出：同格式权重（量化后）

包含工具

AutoGPTQAutoAWQbitsandbytesNNCFllm-compressorModelOpt

Tour of the Four Major Ecosystems

With the layered framework in hand, let’s tour the major tool ecosystems organized by hardware camp. Each camp has its own “full-stack solution,” forming a closed loop from algorithm library to inference engine.

The Intel Camp

Intel’s model optimization stack is a three-piece suite: Optimum Intel (orchestrator) + NNCF (algorithm library) + OpenVINO Converter (ov.convert_model, format converter). Optimum Intel wraps the latter two, providing a one-click Hugging Face to OpenVINO IR quantization pipeline — notably integrating NNCF’s accuracy-aware quantization (via the quantize_with_accuracy_control API).

An important distinction: Intel Neural Compressor (INC) and NNCF are different product lines, not competitors. INC v3.7 (2025-12-25) is still actively maintained and targets Gaudi AI accelerators + Xeon/Core Ultra native stacks — it does not target OpenVINO as a backend and is suited for server and Gaudi cluster deployments. NNCF is positioned as an OpenVINO-first compression library but is also integrated into ecosystems like PyTorch (torch.compile), ExecuTorch, and Microsoft Olive.

A deep dive into the Intel ecosystem (NNCF’s algorithm matrix, Optimum Intel’s specific usage, OpenVINO IR format details) will be covered in the next article, Deep Dive into the Intel Model Optimization Stack.

The NVIDIA Camp

NVIDIA’s toolchain centers on TensorRT-LLM (inference engine) + NVIDIA Model Optimizer (ModelOpt, quantization tool). ModelOpt was originally an internal project codenamed AMMO (Algorithmic Model Optimizer) and was officially renamed to the modelopt package in May 2024 with v0.11. In December 2025, NVIDIA rebranded it again from “TensorRT Model Optimizer” to “NVIDIA Model Optimizer,” dropping the “TensorRT” prefix and migrating the repository to NVIDIA/Model-Optimizer (the old URL still works). Latest version: ModelOpt 0.43.0 (2026-04-16).

The call chain is: ModelOpt produces a quantized model -> TensorRT-LLM compiles it into an engine. ModelOpt supports multiple quantization methods (FP8, INT8, INT4 weight-only, AWQ, SmoothQuant), and its output is still a PyTorch checkpoint or Hugging Face safetensors — but with embedded quantization metadata (scale / zero-point / quantization scheme). Next, trtllm-build reads this metadata and compiles a TensorRT engine (.plan file) — a typical Layer 4 flow where the final artifact can only run on TensorRT-LLM.

It’s worth emphasizing that while NVIDIA’s stack delivers exceptional performance on H100/A100 (especially with FP8 Tensor Core acceleration), portability is low — TensorRT engines can’t be migrated across hardware (even different cards within the same GPU architecture require recompilation). This contrasts with open formats like ONNX and GGUF.

The Hugging Face Universal Layer

The Hugging Face ecosystem is the cross-hardware “neutral layer,” centered on the Optimum umbrella project, which branches into multiple sub-packages:

Optimum Intel (covered above)
Optimum NVIDIA (TensorRT integration)
Optimum AMD (ROCm integration)
Optimum Habana (Gaudi integration)

Beyond Optimum, Hugging Face also maintains or integrates several quantization tools:

bitsandbytes: NF4 (4-bit NormalFloat) quantization, primarily used for QLoRA training and the transformers library’s .load_in_4bit() API
AutoGPTQ / AutoAWQ: Community-maintained GPTQ and AWQ implementations, outputting Hugging Face-compatible quantized models
llm-compressor: A tool that rose to prominence in the second half of 2024, worth a closer look

llm-compressor’s official positioning is “a Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM.” It is co-maintained by the vLLM Project and Red Hat AI, hosted under the vllm-project organization. vLLM’s official documentation explicitly states in its quantization chapter: “To get started with quantization, see LLM Compressor” — this is vLLM’s first-party quantization path.

Supported quantization methods include:

Activation quantization: W8A8 (int8 & fp8), MXFP8 (experimental)
Mixed precision: W4A16, W8A16, MXFP8A16 (experimental), NVFP4
Algorithms: simple PTQ, GPTQ, AWQ, SmoothQuant, AutoRound

The output format is compressed-tensors, which vLLM loads natively with no additional conversion needed. Latest version: v0.10.0.1 (2026-03-13), 3.1k GitHub stars.

One caveat to note: vLLM also supports loading quantized models in other formats (AutoAWQ, GPTQModel, bitsandbytes, GGUF, FP8, ModelOpt, Quark, TorchAO, etc.) — llm-compressor is the officially recommended first choice, but it’s not the only option.

The ggml Camp

ggml (Georgi Gerganov’s ML library) is the underlying tensor library for llama.cpp and has spawned its own ecosystem. The core tools are:

convert_hf_to_gguf.py: Converts Hugging Face checkpoints to GGUF format (Layer 2 converter)
llama-quantize: Applies quantization to GGUF files (Layer 4 runtime-integrated)

The GGUF format and the llama.cpp inference engine are a textbook case of tight coupling — GGUF’s design revolves entirely around llama.cpp’s kernel optimizations (e.g., Q4_K_M’s super-block structure, IQ2_XS’s vector quantization), and other inference engines (like ONNX Runtime or TensorRT) cannot load GGUF directly. But this tight coupling also brings exceptional flexibility: llama.cpp supports 30+ quantization types (Q4_0, Q4_K_M, Q5_K_S, Q6_K, IQ2_XXS, IQ3_S, etc.), allowing fine-grained tuning for different VRAM and accuracy requirements.

For an in-depth analysis of llama.cpp’s quantization schemes (K-quant’s super-block design, I-quant’s codebook, mixed-precision strategies), see our earlier article llama.cpp Quantization Schemes.

Cross-Backend Orchestrators

Microsoft Olive is the quintessential Layer 3 orchestrator. Its official positioning is “AI Model Optimization Toolkit for the ONNX Runtime,” but the backends it actually supports extend well beyond ONNX Runtime — including TensorRT, OpenVINO, DirectML, Qualcomm QNN, and AMD ROCm.

Olive’s core abstractions are:

Pass: A single model optimization step, such as model compression, Graph Capture, quantization, or graph optimization
Workflow: A sequence of Passes arranged in order (e.g., INT8 quantization -> redundant operator removal -> constant folding)
Evaluator: Assesses optimization results, supporting accuracy, latency, throughput, and memory metrics
Search Strategy: Automated hyperparameter search, trying different Pass combinations and hyperparameters to find the Pareto-optimal solution

Olive includes 40+ built-in optimization components covering quantization (ONNX Runtime PTQ, TensorRT INT8, OpenVINO INT8), pruning, knowledge distillation, operator fusion, memory planning, and more. Its core value isn’t implementing new algorithms — it’s composing and reusing existing tools while providing an automated search framework. For example, it can automatically try “FP16 baseline vs INT8 PTQ vs INT8 QAT” and find the fastest option under a constraint like “accuracy loss < 1%.”

Latest version: olive-ai 0.11.0 (2026-01-29). Target hardware covers Qualcomm (Snapdragon), AMD (ROCm), Nvidia (TensorRT), and Intel NPU/GPU/CPU.

Mobile and Edge (Brief Overview)

Mobile and edge devices have their own tool ecosystems. Here’s a brief overview (a detailed treatment will come in a future “Edge Deployment” topic):

Apple coremltools 9.0 (2025-11-10): Supports INT4/INT8 linear quantization, 1-8 bit palettization (lookup table), sparsity, and various joint compression combinations (sparsity + quantization, sparsity + palettization). W8A8 mode can leverage Apple Silicon’s INT8 compute acceleration. Note that coremltools does not have a dedicated LLM quantization API — it provides general-purpose weight-only quantization + palettization + sparsity capabilities, and transformer/LLM workflows can use these general features (with OPT and BERT as examples). Output format is CoreML (.mlmodel), executed on iOS/macOS via the Core ML inference engine.
AMD Quark 0.11.1 (2026-02-19): AMD’s official cross-platform quantization tool, supporting PyTorch (Eager / FX Graph / QAT) and ONNX (via ONNX Runtime Quantization) paths. Target hardware covers Ryzen AI NPU (NPU_CNN / NPU_Transformer modes), AMD Instinct GPU (ROCm, including MI300X), and CPU; it can also calibrate on NVIDIA hardware via CUDA/HIP. Algorithm support is comprehensive: PTQ, QAT, SmoothQuant, AWQ, GPTQ, QuaRot, Qronos, CLE, BiasCorrection, AdaQuant, AdaRound; calibration methods include MinMax, Percentile, MSE, and Entropy. The v0.11.1 highlight is “file-to-file quantization for ultra-large models” (weight-only + dynamic activation).
Google AI Edge Torch 0.8.0 (2026-01-26): The official PyTorch to LiteRT (TFLite) conversion bridge, targeting Android / iOS / IoT devices. Output format is .tflite flatbuffer, executed by the LiteRT runtime (LiteRT is TensorFlow Lite’s rebrand). Target hardware covers a wide range of CPUs, plus preliminary GPU and NPU support (the docs refer generically to NPU rather than specifically naming Edge TPU). It has two components: PyTorch Converter (Beta) and Generative API (Alpha, for LLM model rewriting and quantization). Importantly, the Generative API is still in Alpha — its stability and feature completeness have not reached production grade.

Six-Dimension Comparison Matrix

Now that we’ve surveyed the major ecosystems, let’s compare tool characteristics across six dimensions to help you filter candidates for your specific needs.

The Six Dimensions

Role: Algorithm library / Format converter / Orchestrator / Runtime-integrated (corresponding to Layers 1-4 above)
Input format: Hugging Face checkpoint / PyTorch state_dict / ONNX / TensorFlow SavedModel / custom format
Output format: This is the most critical dimension, as it determines your downstream inference engine choices
Supported quantization methods: GPTQ / AWQ / SmoothQuant / NF4 / K-quant / FP8 / INT4 weight-only / W8A8 / etc.
Target hardware: CPU / NVIDIA GPU / Intel GPU/NPU / AMD GPU/NPU / Apple Silicon / ARM Mobile / Gaudi / Qualcomm
Pipeline position: Used standalone / wrapped and called by higher-level tools

The Key Distinction in Output Formats

It’s important to understand the difference between inference-ready formats and intermediate formats:

Inference-ready formats (can be loaded and executed directly):

OpenVINO IR (.xml + .bin): OpenVINO inference engine exclusive
TensorRT engine (.plan): TensorRT / TensorRT-LLM exclusive
GGUF: llama.cpp exclusive
ONNX (.onnx): ONNX Runtime / ORT-derived engines (e.g., ONNX Runtime Mobile)
CoreML (.mlmodel): Apple Core ML engine exclusive
LiteRT (TFLite) (.tflite): Google LiteRT runtime exclusive

Intermediate formats (require one more conversion step before inference):

GPTQ / AWQ-produced Hugging Face safetensors: Although quantized, these are still in PyTorch format and require the Hugging Face transformers library to load, or need further conversion to another format (e.g., ONNX, OpenVINO) for deployment
PyTorch state_dict / checkpoint: Training format, requires conversion

The interactive component below shows 15-20 mainstream tools compared across the six dimensions. Use the filter bar at the top to quickly find tools matching your needs (e.g., “show only GGUF output,” “show only AWQ support,” “show only Intel GPU”):

量化工具矩阵浏览器

17 个工具 × 6 个维度，交互式筛选

量化方法

目标硬件

输出格式

工具名称	定位	输入格式	输出格式	量化方法	目标硬件	Pipeline 位置
AutoGPTQ	algorithm-lib	HF-safetensors, PyTorch	HF-safetensors	GPTQ	NVIDIA-GPU, Intel-CPU, Intel-GPU	standalone
AutoAWQ	algorithm-lib	HF-safetensors	HF-safetensors	AWQ	NVIDIA-GPU, Intel-CPU	standalone
bitsandbytes	algorithm-lib	PyTorch (runtime)	PyTorch (runtime)	NF4, FP4, INT8	NVIDIA-GPU	standalone (runtime)
NNCF	algorithm-lib	PyTorch, TorchFX, ONNX, OpenVINO-IR	same as input (compressed)	PTQ, WOQ, QAT, Sparsity, WO-QAT+LoRA	Intel-CPU, Intel-GPU, Intel-NPU	standalone or wrapped by Optimum Intel
llm-compressor	algorithm-lib	HF-safetensors	HF-safetensors (compressed-tensors)	W8A8, W4A16, AWQ, GPTQ, SmoothQuant, AutoRound, INT8, FP8	NVIDIA-GPU, AMD-GPU	standalone (vLLM first-party)
ModelOpt (NVIDIA)	algorithm-lib	PyTorch	PyTorch (annotated)	FP8, INT8, INT4, AWQ, SmoothQuant	NVIDIA-GPU	wrapped by TensorRT-LLM
AMD Quark	algorithm-lib	PyTorch, ONNX	same as input (compressed)	PTQ, QAT, AWQ, GPTQ, SmoothQuant, QuaRot	AMD-Ryzen-AI-NPU, AMD-Instinct-GPU, CPU	standalone
coremltools (Apple)	converter + algorithm-lib	PyTorch, TensorFlow	CoreML	INT4, INT8, palettization	Apple-Silicon	standalone
convert_hf_to_gguf.py	converter	HF-safetensors	GGUF (FP16/BF16/F32)	none	any	standalone, feeds llama-quantize
ov.convert_model	converter	PyTorch, ONNX, TensorFlow	OpenVINO-IR (FP32/FP16)	none	any	standalone
torch.onnx.export	converter	PyTorch	ONNX	none	any	standalone
optimum-cli export onnx	converter	HF-checkpoint	ONNX (+ optional INT8)	INT8	any	wraps torch.onnx.export
Google AI Edge Torch	converter	PyTorch	LiteRT/TFLite	INT8, INT4	mobile-CPU, mobile-GPU, mobile-NPU	standalone
Microsoft Olive	orchestrator	HF-checkpoint, PyTorch, ONNX	ONNX, OpenVINO-IR, TensorRT-engine	delegates to backend (40+ passes)	Intel-CPU, Intel-GPU, AMD-GPU, NVIDIA-GPU	top-level orchestrator
Optimum Intel	orchestrator	HF-checkpoint	OpenVINO-IR (FP16/INT8/INT4)	WOQ, PTQ, AAQ	Intel-CPU, Intel-GPU, Intel-NPU	wraps NNCF + ov.convert_model
llama.cpp llama-quantize	engine-builtin	GGUF (FP16)	GGUF (Q2_K…Q8_0)	Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0, IQ1_S, IQ2_XXS	CPU, NVIDIA-GPU, Apple-Silicon, Intel-GPU, AMD-GPU	engine-builtin
TensorRT-LLM	engine-builtin	HF-checkpoint, ModelOpt output	TensorRT-engine	FP8, INT8, INT4, AWQ, SmoothQuant	NVIDIA-GPU	engine-builtin

提示：同一维度内为「或」逻辑，跨维度为「与」逻辑。匹配的单元格会高亮显示。

When discussing quantization toolchains, you’ll frequently encounter the terms LoRA and QLoRA. How do they relate to the inference quantization toolchains covered in this article? Here are three clarifications:

LoRA Is Not Quantization

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) method, not a quantization technique. Its core idea: freeze the pretrained model’s original weights $W$ and train only two low-rank matrices $A$ and $B$ (of dimensions $d \times r$ and $r \times d$ , where $r \ll d$ ). At inference time, they are merged: $W' = W + \alpha \cdot AB$ . The number of trainable parameters is only 0.1-1% of the original model, enabling fine-tuning of large models on consumer GPUs.

LoRA’s goal is “fewer parameters to update,” not “lower precision” — both the original weights $W$ and the LoRA adapters $(A, B)$ remain in FP16 or BF16.

QLoRA = Quantized Base Weights + LoRA Training

QLoRA (Quantized LoRA) is an advanced variant of LoRA designed specifically for memory-constrained scenarios. Key technical points:

Frozen base weights are stored in NF4 quantization (4-bit NormalFloat, implemented by bitsandbytes); they are dequantized to FP16 only when needed for the forward pass
LoRA adapters are trained in FP16, and gradients are also in FP16
The sole purpose of quantization is to save training memory (since base weights are the dominant cost), not to speed up inference

After QLoRA training completes, you get a LoRA adapter checkpoint (typically tens to hundreds of MB). There are then two inference paths:

Path A: Continue using the NF4-quantized base model + FP16 adapter at inference time — saves memory but inference speed is mediocre
Path B: Use peft.merge_and_unload() to merge the adapter back into the base model, producing a full FP16 checkpoint, then apply the PTQ tools discussed in this article (e.g., llm-compressor, NNCF, llama-quantize) for inference quantization

Path B is the standard approach for connecting QLoRA training with inference quantization.

QLoRA-Trained Models Can Be Converted to GGUF

A typical end-to-end workflow:

Fine-tune Llama-3.1-8B on a custom dataset with QLoRA (base model in NF4 quantization)
After training, run peft.merge_and_unload() to get a full FP16 checkpoint
Convert to GGUF format with convert_hf_to_gguf.py
Apply Q4_K_M or Q5_K_M quantization with llama-quantize
Deploy to local llama.cpp for inference

This shows that QLoRA and inference quantization toolchains are not mutually exclusive — they can be chained together. QLoRA solves “how to fine-tune large models at low cost,” while inference quantization toolchains solve “how to deploy large models at low cost.”

To be clear: this section does not cover LoRA / QLoRA training details (that belongs to the fine-tuning topic). We’re only clarifying “does it count as a quantization tool” — the answer is QLoRA is a training-time memory optimization technique, not part of the inference quantization toolchain, but the two can be connected end-to-end.

Tool Selection Decision Tree

With the layered framework, ecosystem map, and comparison matrix in hand, we now need a decision process to quickly narrow down the candidate tools. The decision is based on three key questions:

Q1: What Is Your Target Hardware?

This is the top-priority question, because hardware determines the ecosystem:

Intel CPU / iGPU / Arc GPU: Intel camp (Optimum Intel + NNCF + OpenVINO)
NVIDIA GPU (consumer / datacenter): NVIDIA camp (TensorRT-LLM + ModelOpt) or Hugging Face universal layer (vLLM + llm-compressor)
Apple Silicon (M1/M2/M3/M4): Apple coremltools (mobile/edge) or llama.cpp (general purpose)
AMD GPU (ROCm): AMD Quark + ONNX Runtime, or vLLM (ROCm backend)
CPU only (x86 / ARM): llama.cpp (first choice) or ONNX Runtime
Mobile devices (Android / iOS): Google AI Edge Torch (Android) / Apple coremltools (iOS)
Edge NPU (Qualcomm / Rockchip / etc.): Vendor-specific tools + ONNX / TFLite

Q2: How Sensitive Are You to Accuracy?

Quantization introduces accuracy loss, and different tasks have different tolerances:

Lenient (UI copy generation, casual chatbots): Aggressive quantization like Q4_K_S or IQ2_XS is fine; a perplexity increase of 0.1-0.2 is acceptable
Moderate (general conversation, document QA): Q4_K_M or Q5_K_M recommended; perplexity increase < 0.1
Strict (code generation, math reasoning, medical consultation): Q6_K, Q8_0, or FP8 recommended, or use accuracy-aware quantization

Q3: Do You Need Accuracy-Aware Constraints?

Accuracy-aware quantization (AAQ) means setting an accuracy constraint during quantization (e.g., “inference accuracy must not drop below 99% of baseline”), and the tool automatically adjusts the quantization strategy (e.g., preserving higher precision for sensitive layers). Tools that support AAQ:

NNCF: nncf.quantize_with_accuracy_control() API
Microsoft Olive: Evaluator + Search Strategy combination
Intel Neural Compressor (INC): PostTrainingQuantConfig(accuracy_criterion=...) API

If your scenario is extremely accuracy-sensitive and you’re willing to accept longer quantization times (AAQ requires multiple iterations on a validation set), prioritize tools with AAQ support.

Decision Tree Example

Here is a concrete leaf-node example:

Scenario: Intel iGPU + strict accuracy requirements + accuracy-aware needed -> Recommended solution: Optimum Intel + NNCF quantize_with_accuracy_control -> Rationale: NNCF is Intel’s official compression library with the deepest OpenVINO integration; quantize_with_accuracy_control automatically iterates on a validation set to find the most aggressive quantization strategy that meets the accuracy constraint (potentially mixed precision with some layers in INT8 and others in FP16) -> Output format: OpenVINO IR (.xml + .bin) -> Inference engine: OpenVINO Runtime

The interactive component below implements the complete decision tree. Click through each question’s options to progressively narrow down to a recommended solution (including tool combination, rationale, and jump links):

工具选择决策树

3 个问题引导推荐

Q1: 目标硬件？

Transition to the Next Article

If you selected the Intel camp in the decision tree, your next question will be how to choose among Optimum Intel, NNCF, and OpenVINO Converter specifically — what are the calling relationships between them? When should you use NNCF directly versus going through the Optimum Intel wrapper? How do you navigate NNCF’s 20+ algorithms across 4 backends (OpenVINO / PyTorch / TorchFX / ONNX)? What does the OpenVINO IR format look like internally?

These questions will be answered in detail in the next article, Deep Dive into the Intel Model Optimization Stack. The article after that (the third in the series) will be a hands-on practice guide: starting from a Hugging Face checkpoint, performing INT8 quantization + accuracy-aware tuning with Optimum Intel + NNCF, exporting to OpenVINO IR, and deploying inference on an Intel iGPU — walking through the complete end-to-end flow.

If you chose a different camp (e.g., NVIDIA, llama.cpp, mobile), we’ll publish corresponding deep dives and hands-on guides in the future. The overarching goal of the Quantization and Model Conversion Toolchain series is to take you beyond just knowing “what tools exist” to understanding “why they’re designed this way,” “what to use when,” and “how to combine them” — from confusion to clarity, from tool selection to production deployment.

Quantization and Model Conversion Toolchain Landscape

A Common Confusion

The Pipeline Layer Map

Layer 1: Algorithm Libraries

Layer 2: Format Converters

Layer 3: Orchestrators

Layer 4: Runtime-Integrated

量化与转换工具链分层地图

算法库

格式转换器

编排器

推理引擎内置

算法库

Tour of the Four Major Ecosystems

The Intel Camp

The NVIDIA Camp

The Hugging Face Universal Layer

The ggml Camp

Cross-Backend Orchestrators

Mobile and Edge (Brief Overview)

Six-Dimension Comparison Matrix

The Six Dimensions

The Key Distinction in Output Formats

量化工具矩阵浏览器

LoRA Is Not Quantization

QLoRA = Quantized Base Weights + LoRA Training

QLoRA-Trained Models Can Be Converted to GGUF

Tool Selection Decision Tree

Q1: What Is Your Target Hardware?

Q2: How Sensitive Are You to Accuracy?

Q3: Do You Need Accuracy-Aware Constraints?

Decision Tree Example

工具选择决策树

Q1: 目标硬件？

Transition to the Next Article

A Common Confusion

The Pipeline Layer Map

Layer 1: Algorithm Libraries

Layer 2: Format Converters

Layer 3: Orchestrators

Layer 4: Runtime-Integrated

量化与转换工具链分层地图

算法库

格式转换器

编排器

推理引擎内置

算法库

Tour of the Four Major Ecosystems

The Intel Camp

The NVIDIA Camp

The Hugging Face Universal Layer

The ggml Camp

Cross-Backend Orchestrators

Mobile and Edge (Brief Overview)

Six-Dimension Comparison Matrix

The Six Dimensions

The Key Distinction in Output Formats

量化工具矩阵浏览器

Related but Often Confused: Quantization During Fine-Tuning

LoRA Is Not Quantization

QLoRA = Quantized Base Weights + LoRA Training

QLoRA-Trained Models Can Be Converted to GGUF

Tool Selection Decision Tree

Q1: What Is Your Target Hardware?

Q2: How Sensitive Are You to Accuracy?

Q3: Do You Need Accuracy-Aware Constraints?

Decision Tree Example

工具选择决策树

Q1: 目标硬件？

Transition to the Next Article