Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO

Intel Model Optimization Stack: Choosing Between Optimum Intel, NNCF, and OpenVINO

Updated 2026-04-17

Introduction: Three Tools, One Pain Point

“I want to quantize Llama-3.1-8B and run it on an Arc GPU.”

You open the Optimum Intel docs and see code like this:

model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)

Then you flip to the NNCF docs and find:

quantized_model = nncf.quantize(model, calibration_dataset)

And then the OpenVINO docs:

ov_model = ov.convert_model(torch_model)
quantized = nncf.compress_weights(ov_model)

Three tools, all claiming to do quantization. Which one should you use? Are they competing products? Or do you need all three?

The answer: they’re not competitors — they form a three-layer stack. Which one you choose depends on your use case. This article gives you the decision framework.

Call Chain Analysis: Three Layers, Not Three Choices

The core insight: Optimum Intel, NNCF, and OpenVINO Core aren’t competing tools — they’re a three-layer technology stack.

  • Optimum Intel (top layer) = Hugging Face API compatibility layer + one-liner wrappers
  • NNCF (middle layer) = compression algorithm library (PTQ / QAT / WOQ / AAQ / Sparsity)
  • OpenVINO Core (bottom layer) = format conversion (to IR) + inference engine

Call Chain 1: load_in_8bit=True

When you write this line of code:

model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", 
    load_in_8bit=True
)

What happens under the hood? Tracing the call chain from Optimum Intel’s source code:

  1. OVBaseModel.from_pretrained() detects the load_in_8bit=True parameter
  2. Internally creates an OVWeightQuantizationConfig(bits=8) object
  3. Maps bits=8 to CompressWeightsMode.INT8_ASYM
  4. Calls _apply_quantization() then OVQuantizer.quantize()
  5. Routes to the _weight_only_quantization() method
  6. Finally calls nncf.compress_weights(model, dataset=None, mode=INT8_ASYM, ...)

Key detail: 8-bit weight-only quantization doesn’t require a calibration dataset (dataset=None), because INT8’s quantization range is wide enough to cover most weight distributions — simple min-max statistics are sufficient.

Call Chain 2: load_in_4bit=True

The 4-bit path is similar, but with one important difference:

model = OVModelForCausalLM.from_pretrained(
    model_id, 
    load_in_4bit=True
)

Call chain:

  1. OVWeightQuantizationConfig(bits=4) maps to CompressWeightsMode.INT4_ASYM
  2. Routes to the same _weight_only_quantization() method
  3. Calls nncf.compress_weights(model, mode=INT4_ASYM, dataset=..., ...)
  4. But this time dataset is not None — Optimum Intel automatically downloads the wikitext dataset for calibration

Why does INT4 need calibration while INT8 doesn’t? Because 4-bit quantization has much less precision headroom. You need real data activation distributions to determine optimal scale and zero-point values, otherwise accuracy degrades sharply.

Call Chain 3: Full Quantization (Weight + Activation)

If you need to quantize activations too (W8A8), Optimum Intel uses a different code path:

from optimum.intel import OVQuantizationConfig

config = OVQuantizationConfig(bits=8)
model = OVModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=config
)

Call chain:

  1. OVQuantizationConfig is recognized as “full quantization” rather than weight-only
  2. Routes to the _full_quantization() method
  3. Calls nncf.quantize(model, calibration_dataset, ...)
  4. A calibration dataset is mandatory — activation quantization is impossible without data

Where Does Format Conversion Happen?

Every call chain starts with: convert to OpenVINO IR first, then quantize.

  • If OpenVINO IR model files (openvino_model.xml + openvino_model.bin) already exist on the Hugging Face Hub, from_pretrained() downloads them directly
  • If only a PyTorch checkpoint is available, Optimum Intel calls the main_export() function (from optimum.exporters.openvino), which internally wraps ov.convert_model()
  • If the input is an ONNX file, it calls ov.convert_model(file_name) directly

Ultimately, all paths converge to: OpenVINO IR -> NNCF quantization -> save back to OpenVINO IR.

The interactive component below shows the complete call chains for four typical scenarios. Click each scenario to see the specific function call stack:

Intel 模型优化栈调用关系

三层架构:Optimum Intel → NNCF → OpenVINO Core

点击左侧层级或选择顶部场景查看详情

Summary

The three-layer stack relationship is: Optimum Intel calls NNCF, and NNCF operates on models in OpenVINO IR format. You can choose to:

  • Use Optimum Intel: one line of code gets the job done — ideal for standard scenarios
  • Bypass Optimum, use NNCF directly: when you need custom calibration, accuracy constraints, or unsupported model architectures
  • Bypass NNCF, use OpenVINO Converter directly: when your model source is ONNX, you need graph rewriting, or you want to avoid Python dependencies

The following sections will tell you how to decide when to pick each layer.

NNCF Deep Dive: The Compression Algorithm System

NNCF (Neural Network Compression Framework) is Intel’s open-source model compression library, currently at version v3.1.0 (2026-04-08). Its design philosophy: decouple from training frameworks, support multiple backends, focus exclusively on compression algorithms.

NNCF’s core API falls into three categories:

Core Function 1: nncf.quantize()

This is the standard entry point for full quantization (Weight + Activation PTQ).

nncf.quantize(
    model: TModel,
    calibration_dataset: nncf.Dataset,
    *,
    mode: Optional[QuantizationMode] = None,           # FP8_E4M3 | FP8_E5M2
    preset: Optional[QuantizationPreset] = None,
    target_device: TargetDevice = TargetDevice.ANY,
    subset_size: int = 300,
    fast_bias_correction: bool = True,
    model_type: Optional[ModelType] = None,
    ignored_scope: Optional[IgnoredScope] = None,
    advanced_parameters: Optional[AdvancedQuantizationParameters] = None
) -> TModel

Key parameters:

  • calibration_dataset: Required. NNCF runs model forward passes on this dataset to collect per-layer activation distributions (min/max/percentile) for determining quantization parameters
  • mode: Optional FP8_E4M3 (NVIDIA Hopper+) or FP8_E5M2 (experimental). Defaults to INT8 when unspecified
  • preset: PERFORMANCE (aggressive quantization) or MIXED (accuracy-first)
  • target_device: CPU / GPU / NPU / ANY. Different devices have different quantization strategies (e.g., NPU may be more aggressive)
  • subset_size: How many samples from the dataset to use for calibration; defaults to 300 (balancing accuracy and speed)
  • fast_bias_correction: Whether to enable fast bias correction (compensating for systematic bias introduced by quantization); enabled by default

Core Function 2: nncf.compress_weights()

This is the entry point for weight-only quantization, supporting a rich set of compression modes.

nncf.compress_weights(
    model: TModel,
    *,
    mode: CompressWeightsMode = CompressWeightsMode.INT8_ASYM,
    ratio: Optional[float] = None,
    group_size: Optional[int] = None,
    ignored_scope: Optional[IgnoredScope] = None,
    all_layers: Optional[bool] = None,
    dataset: Optional[nncf.Dataset] = None,
    sensitivity_metric: Optional[SensitivityMetric] = None,
    subset_size: int = 128,
    awq: Optional[bool] = None,
    scale_estimation: Optional[bool] = None,
    gptq: Optional[bool] = None,
    lora_correction: Optional[bool] = None,
    backup_mode: Optional[BackupMode] = None,
    compression_format: CompressionFormat = CompressionFormat.DQ,
    advanced_parameters: Optional[AdvancedCompressionParameters] = None
) -> TModel

CompressWeightsMode enum values (partial):

  • INT8_SYM, INT8_ASYM — 8-bit symmetric/asymmetric quantization
  • INT4_SYM, INT4_ASYM — 4-bit mixed precision (mainstream)
  • NF4 — NormalFloat 4-bit, no zero-point (same as QLoRA)
  • FP8_E4M3, FP8_E5M2 — FP8 variants
  • MXFP4, MXFP8_E4M3 — Microscaling Floating Point
  • NVFP4 — NVIDIA FP4 (E4M3 group scale)
  • CODEBOOK, ADAPTIVE_CODEBOOK, CB4 — lookup table (LUT) quantization

Key parameters:

  • ratio: Mixed precision ratio. For example, ratio=0.8 means 80% of layers use the precision specified by mode (e.g., INT4), while the remaining 20% use backup_mode (e.g., INT8 or FP16)
  • group_size: Group size for group quantization. INT4 quantization typically uses 128 or 64 — smaller groups yield higher accuracy but more overhead
  • dataset: Optional calibration dataset. Not needed for INT8 (None); strongly recommended for INT4/NF4
  • sensitivity_metric: Determines which layers get higher precision. Options:
    • WEIGHT_QUANTIZATION_ERROR — minimize weight quantization error
    • HESSIAN_INPUT_ACTIVATION — Hessian-based sensitivity (highest accuracy, most expensive)
    • MEAN_ACTIVATION_VARIANCE / MAX_ACTIVATION_VARIANCE
    • MEAN_ACTIVATION_MAGNITUDE
  • Preprocessing algorithm switches (require dataset):
    • awq=True — Activation-Aware Quantization: adjusts weight scales based on activation magnitudes
    • scale_estimation=True — L2 error-minimizing scale estimation
    • gptq=True — classic GPTQ algorithm (layer-wise optimization)
    • lora_correction=True — uses low-rank matrices to compensate for quantization error (similar to LoRA)

Why doesn’t INT8 need a dataset?

8-bit quantization’s representable range (-128 to 127) is wide enough to cover the vast majority of weight distributions — simple min-max statistics suffice. But 4-bit has only 16 discrete values. Without real activation distributions to optimize scales, important weights get mapped to the same discrete value, causing accuracy collapse.

Core Function 3: nncf.quantize_with_accuracy_control()

This is the entry point for Accuracy-Aware Quantization (AAQ).

nncf.quantize_with_accuracy_control(
    model: TModel,
    calibration_dataset: nncf.Dataset,
    validation_dataset: nncf.Dataset,
    validation_fn: Callable[[Any, Iterable[Any]], tuple[float, ...]],
    *,
    max_drop: float = 0.01,
    drop_type: DropType = DropType.ABSOLUTE,
    preset: Optional[QuantizationPreset] = None,
    target_device: TargetDevice = TargetDevice.ANY,
    subset_size: int = 300,
    fast_bias_correction: bool = True,
    model_type: Optional[ModelType] = None,
    ignored_scope: Optional[IgnoredScope] = None,
    advanced_quantization_parameters: Optional[AdvancedQuantizationParameters] = None,
    advanced_accuracy_restorer_parameters: Optional[AdvancedAccuracyRestorerParameters] = None
) -> TModel

The key difference: you must provide a validation_fn.

Here’s how this function works:

  1. First performs standard PTQ quantization (calls quantize())
  2. Runs validation_fn on validation_dataset to compute accuracy metrics (e.g., accuracy, perplexity)
  3. If accuracy drops beyond the max_drop threshold (e.g., 0.01 = 1%), enters the recovery phase:
    • Tries “rolling back” quantization layer by layer (using FP16 instead of INT8 for that layer)
    • Re-runs validation_fn after each rollback
    • Finds the minimal rollback set that satisfies the accuracy constraint
  4. Returns a mixed-precision model (some layers INT8, some layers FP16)

validation_fn signature example:

def validate_fn(model, validation_dataset):
    # Return one or more metrics (higher is better)
    # NNCF compares pre- and post-quantization metric differences
    accuracy = compute_accuracy(model, validation_dataset)
    return (accuracy,)

How this differs from AutoGPTQ/bitsandbytes:

  • AutoGPTQ/bitsandbytes perform “one-shot quantization” — specify INT4 and everything goes to INT4, with no automatic rollback mechanism
  • NNCF AAQ uses “iterative rollback” — you just specify the accuracy constraint, and the algorithm automatically finds the optimal mixed-precision strategy

Why doesn’t Optimum Intel expose this API?

Because validation_fn is highly task-specific — different tasks need different evaluation metrics (code generation uses pass@k, QA uses F1, conversational models use perplexity). Optimum Intel is designed for “out-of-the-box” use and isn’t suited for exposing APIs that require user-defined functions. If you need AAQ, you must bypass Optimum and use NNCF directly.

QAT: Quantization-Aware Training

NNCF supports Quantization-Aware Training (QAT), but only on the PyTorch backend:

import nncf

# 1. Wrap the PyTorch model with NNCF
nncf_config = nncf.NNCFConfig.from_dict({
    "input_info": {"sample_size": [1, 3, 224, 224]},
    "compression": {"algorithm": "quantization"}
})
nncf_network = nncf.NNCFNetwork(torch_model, nncf_config)

# 2. Train normally (forward, backward, optimizer step)
for batch in train_loader:
    loss = nncf_network(batch)
    loss.backward()
    optimizer.step()

# 3. Export the quantized model
quantized_model = nncf_network.export_model()

QAT’s core mechanism: inserting Fake Quantization nodes during training so that backpropagation simulates quantization error, allowing the model to learn robustness to quantization. This yields higher accuracy than PTQ but requires training resources.

Support Matrix

The interactive component below shows NNCF’s algorithm x backend support matrix:

NNCF 算法支持矩阵

6 种算法 × 4 个后端

算法OpenVINOPyTorchTorchFXONNX
训练后量化 (PTQ)
权重压缩 (WOQ)
量化感知训练 (QAT)
Weight-Only QAT + LoRA/NLS
剪枝 (Pruning)
激活稀疏 (Activation Sparsity)
* PTQ 也支持 accuracy-aware 变体 nncf.quantize_with_accuracy_control(),仅 OpenVINO 后端
支持
实验性
不支持

A few important notes:

  1. OpenVINO backend is the top choice for PTQ/WOQ — most mature with the best performance
  2. TorchFX is marked as Experimental — PTQ and WOQ work but may have edge-case bugs
  3. Activation Sparsity is PyTorch Experimental only — not production-ready
  4. AAQ is not a separate row — it’s an API variant of quantize() (quantize_with_accuracy_control), only supported on the OpenVINO backend
  5. Weight-Only QAT with LoRA/NLS — new feature (v3.1.0), reflecting the LLM era

The Boundaries of Optimum Intel

Now back to the opening question: when is Optimum Intel sufficient, and when must you bypass it?

Scenarios Where Optimum Intel Is Enough

  1. Standard Hugging Face models: architectures supported by the transformers library (BERT, GPT, LLaMA, Mistral, etc.)
  2. Standard quantization needs: INT8 or INT4 weight-only, no custom calibration required
  3. No AAQ needed: accuracy requirements aren’t strict, or you can accept “one-shot quantization”
  4. Rapid prototyping: experimental phase, quickly verifying whether OpenVINO runs on your hardware

Example code (Optimum one-liner):

from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=OVWeightQuantizationConfig(
        bits=4,
        sym=True,
        group_size=128,
        ratio=0.8
    )
)
model.save_pretrained("llama-3.1-8b-int4-ov")

Behind these 6 lines of code, Optimum Intel automatically:

  1. Downloads the PyTorch checkpoint (if not cached locally)
  2. Converts it to OpenVINO IR format
  3. Downloads a calibration dataset (wikitext)
  4. Calls NNCF to perform INT4 symmetric quantization (CompressWeightsMode.INT4_SYM)
  5. Uses INT4 for 80% of layers, backup_mode (default INT8) for the remaining 20%
  6. Saves the quantized OpenVINO IR files (.xml + .bin)

Scenarios Where You Must Bypass Optimum Intel

  1. AAQ (Accuracy-Aware Quantization): quantize_with_accuracy_control is not exposed in the Optimum API
  2. Custom calibration datasets: you have domain-specific data (e.g., medical dialogues, legal documents) and don’t want generic wikitext
  3. QAT (Quantization-Aware Training): Optimum Intel only supports PTQ
  4. Unsupported model architectures: if the transformers library doesn’t support it, neither does Optimum Intel
  5. Preprocessing algorithm control: flags like awq=True, scale_estimation=True, gptq=True, etc.

Example code (NNCF with full control):

import nncf
import openvino as ov
from transformers import AutoModelForCausalLM
import torch

# 1. Load PyTorch model
torch_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16
)

# 2. Convert to OpenVINO IR
ov_model = ov.convert_model(torch_model)

# 3. Prepare calibration dataset (custom data)
def calibration_data_generator():
    # Your domain data
    for text in medical_corpus:
        yield tokenizer(text, return_tensors="pt")

calibration_dataset = nncf.Dataset(calibration_data_generator)

# 4. NNCF quantization (full parameter control)
quantized = nncf.compress_weights(
    ov_model,
    mode=nncf.CompressWeightsMode.INT4_SYM,
    group_size=128,
    ratio=0.8,
    sensitivity_metric=nncf.SensitivityMetric.HESSIAN_INPUT_ACTIVATION,
    dataset=calibration_dataset,
    subset_size=256,
    awq=True,
    scale_estimation=True
)

# 5. Save
ov.save_model(quantized, "model.xml")

How this differs from the Optimum Intel version:

  • You can use custom calibration data (medical_corpus)
  • You can specify sensitivity_metric=HESSIAN_INPUT_ACTIVATION (Optimum uses the default)
  • You can enable awq=True and scale_estimation=True (not exposed by Optimum)
  • You need to write ~15 lines of code (Optimum needs just 6)

The trade-off: Optimum Intel prioritizes ease-of-use; NNCF prioritizes control. Your choice depends on your scenario and how much you care about quantization quality.

Standalone Scenarios for OpenVINO Converter

When should you use ov.convert_model() directly, bypassing both Optimum and NNCF?

Scenario 1: Model Source Is ONNX

If your model is already in ONNX format (not a Hugging Face checkpoint), use OpenVINO Converter directly:

import openvino as ov

# ONNX -> OpenVINO IR
ov_model = ov.convert_model("model.onnx")
ov.save_model(ov_model, "model.xml")

Optimum Intel assumes Hugging Face models as input and doesn’t support ONNX input.

Scenario 2: Custom example_input Needed

For models with dynamic shapes (like LLMs with variable sequence lengths), ov.convert_model() lets you specify example_input to set input shapes:

ov_model = ov.convert_model(
    torch_model,
    example_input={
        "input_ids": torch.zeros(1, 512, dtype=torch.long),
        "attention_mask": torch.ones(1, 512, dtype=torch.long)
    }
)

This is useful when dealing with non-standard input structures (e.g., multimodal models, encoder-decoder architectures).

Scenario 3: Graph Rewriting Needed

OpenVINO IR is a computation graph format. If you need to modify the graph structure before quantization (e.g., removing dropout, freezing normalization parameters), you can use OpenVINO’s graph rewriting API:

from openvino.runtime import Core, serialize

ov_model = ov.convert_model(torch_model)

# Remove dropout nodes (not needed during inference)
# Modify the IR's XML representation (requires manual parsing/editing)
# ...

ov.save_model(ov_model, "model.xml")

Optimum Intel doesn’t expose a graph rewriting interface — it assumes you’re quantizing the original model as-is.

Scenario 4: Avoiding Hugging Face Hub Dependencies

In production deployments, you may not want to depend on the Hugging Face Hub (network restrictions, offline environments, enterprise security policies). Using ov.convert_model() + ov.compile_model() directly requires only the OpenVINO Runtime, without transformers / optimum / safetensors dependencies.

Minimal-dependency deployment script:

import openvino as ov
import numpy as np

# Only depends on OpenVINO Runtime (C++ library + Python binding)
core = ov.Core()
model = core.read_model("model.xml")
compiled = core.compile_model(model, "GPU")

# Inference
input_ids = np.zeros((1, 512), dtype=np.int64)
output = compiled(input_ids)

This pattern is common on edge devices (industrial controllers, embedded Linux, etc.).

Clarification: INC vs. NNCF Divergence

In Intel’s documentation and community discussions, another tool frequently comes up: Intel Neural Compressor (INC). How does it relate to NNCF?

Different Product Positioning

  • INC: Targets the Intel Gaudi AI accelerator + Xeon/Core Ultra CPU native stack, not using OpenVINO as the backend. Primary scenarios are data centers (Gaudi clusters) and servers (Xeon). The latest version v3.7 (2025-12-25) is still actively maintained.

  • NNCF: The official compression library for the OpenVINO ecosystem, also supporting PyTorch (torch.compile), TorchFX, and ONNX backends. Primary scenarios are client hardware (iGPU, Arc GPU, NPU) and edge devices.

When to Choose INC?

If your deployment target is:

  • Intel Gaudi 2/3 AI accelerators (data center LLM inference)
  • Xeon server native stack (no OpenVINO, using PyTorch/IPEX directly)
  • Intel Core Ultra laptop CPUs (leveraging AVX-512 / AMX instruction sets, not using the NPU)

Then INC is the better choice — it has specialized optimizations for these hardware targets without introducing OpenVINO Runtime overhead.

When to Choose NNCF?

If your deployment target is:

  • Intel iGPU (Iris Xe)
  • Intel Arc GPU (discrete graphics)
  • Intel NPU (AI Boost)
  • Cross-platform models (inference on Intel + ARM + NVIDIA simultaneously, using ONNX or OpenVINO IR as the interchange format)

Then NNCF + OpenVINO is the way to go — they’re the official supported path for iGPU/Arc/NPU.

Bottom line: INC and NNCF aren’t competitors — they serve different product lines. INC = Gaudi/Xeon native stack; NNCF = OpenVINO ecosystem. The choice depends on your hardware target.

Decision Tree for Tool Selection

Now that we have the complete knowledge map, how do you choose the right tool combination for a specific scenario?

The decision tree below narrows down your options through 3 core questions:

Q1: What Is Your Model Source?

  • Checkpoint on Hugging Face Hub: you can use Optimum Intel, or use NNCF (after manually converting to IR)
  • Local PyTorch state_dict: use NNCF (first ov.convert_model, then quantize)
  • ONNX file: use ov.convert_model + NNCF
  • Non-standard architecture (e.g., custom model): can only use ov.convert_model + NNCF (Optimum Intel won’t support it)

Q2: What Are Your Quantization Requirements?

  • Standard INT8/INT4 weight-only: Optimum Intel is sufficient
  • AAQ needed (accuracy constraints): must use NNCF quantize_with_accuracy_control
  • Custom calibration data: use NNCF (construct your own nncf.Dataset)
  • QAT needed: must use NNCF (Optimum Intel doesn’t support training-time quantization)
  • Preprocessing algorithms needed (AWQ/GPTQ/scale_estimation): use NNCF (Optimum doesn’t expose these switches)

Q3: What Is Your Deployment Scenario?

  • Rapid prototyping/experimentation: use Optimum Intel (one line of code, quick feasibility check)
  • Production deployment (accuracy-sensitive): use NNCF AAQ (automatically finds optimal mixed precision)
  • Edge devices (minimal dependencies): use ov.convert_model + NNCF + OpenVINO Runtime (no transformers/optimum dependency)
  • Non-OpenVINO scenarios (e.g., PyTorch Mobile): use NNCF’s PyTorch backend (outputs TorchScript)

The interactive component below implements the complete decision tree. Click through the options step by step to reach a recommendation:

Intel 工具选择决策树

3 个问题引导工具选择

Q1: 模型源头?

Example Decision Path

Scenario: Intel Arc GPU + LLaMA-3.1-8B + medical code generation (accuracy-sensitive)

Decision path:

  1. Q1 Model Source: Hugging Face Hub -> can use Optimum or NNCF
  2. Q2 Quantization Requirements: accuracy-sensitive -> needs AAQ -> must use NNCF
  3. Q3 Deployment Scenario: production deployment -> use AAQ

Recommended approach:

import nncf
import openvino as ov
from transformers import AutoModelForCausalLM

# 1. Load model and convert to OpenVINO IR
torch_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
ov_model = ov.convert_model(torch_model)

# 2. Prepare medical code dataset
calibration_dataset = nncf.Dataset(medical_code_generator)
validation_dataset = nncf.Dataset(medical_code_eval_generator)

# 3. AAQ quantization
def validation_fn(model, val_data):
    # Evaluate code generation quality (e.g., pass@1)
    accuracy = evaluate_code_generation(model, val_data)
    return (accuracy,)

quantized = nncf.quantize_with_accuracy_control(
    ov_model,
    calibration_dataset=calibration_dataset,
    validation_dataset=validation_dataset,
    validation_fn=validation_fn,
    max_drop=0.01,  # Accuracy drop no more than 1%
    drop_type=nncf.DropType.ABSOLUTE
)

# 4. Save and compile
ov.save_model(quantized, "llama-3.1-8b-medical-aaq.xml")
core = ov.Core()
compiled = core.compile_model(quantized, "GPU")

Key points:

  • You can’t use Optimum Intel because it doesn’t expose the validation_fn parameter
  • AAQ automatically preserves higher precision for sensitive layers (e.g., attention layers) and uses INT8 for less sensitive ones (e.g., FFN layers)
  • max_drop=0.01 guarantees code generation quality won’t degrade by more than 1% (a strict constraint for medical scenarios)

Transition to the Hands-On Guide

The selection framework is clear. Next step: hands-on practice.

In the next article, “Intel Optimization Stack Hands-On Guide: Three Conversion Paths Compared”, we’ll take the same model (LLaMA-3.1-8B) through three conversion paths:

  1. Hugging Face -> GGUF (llama.cpp stack)
  2. Hugging Face -> ONNX (ONNX Runtime stack)
  3. Hugging Face -> OpenVINO IR (Optimum Intel + NNCF stack)

Then run inference on Intel iGPU and compare the three paths on:

  • Conversion workflow complexity (how many commands? how complex are the configuration parameters?)
  • Post-quantization model size (15GB -> how many GB?)
  • Inference speed (tokens/s)
  • Accuracy loss (perplexity, MMLU, HumanEval)

Through hands-on comparison, you’ll develop “muscle memory” for tool selection — knowing not just what to choose in theory, but the real-world pitfalls and advantages of each path.

Preview: The hands-on guide will include fully reproducible scripts (from model download to inference deployment), plus a troubleshooting section for common errors (e.g., “Why can’t OpenVINO detect my iGPU?”, “What to do when NNCF calibration dataset causes OOM?”).