Chapter 28

Quantization Techniques: GGUF/AWQ/GPTQ Benchmark Comparison

Chapter 28: Quantization Techniques: GGUF/AWQ/GPTQ Compared

Quantization is the art of trading "slightly lower precision" for "dramatically reduced resource requirements." For Hermes Agent, this isn't an academic topic — it's what makes running an 8B model on a MacBook or a 70B model on consumer-grade GPUs possible. This chapter benchmarks three mainstream quantization approaches on their principles, measured performance, and recommended configurations.

28.1 Why Quantization Matters

Large language model weights are typically stored as FP32 or FP16. For a 70B parameter model:

FP32:   70B × 4 bytes  = 280 GB  (completely impractical)
FP16:   70B × 2 bytes  = 140 GB  (requires 2×A100 80GB)
INT8:   70B × 1 byte   = 70 GB   (fits on one A100, barely)
INT4:   70B × 0.5 byte = 35 GB   (achievable on consumer hardware)

Quantization maps floating-point weights to low-bit integer representations, trading limited precision loss for major gains in memory efficiency and inference speed.

Three Quantization Approaches at a Glance

Format	Best For	Deployment	Precision
GGUF	CPU/GPU hybrid, cross-platform	llama.cpp ecosystem	Good
AWQ	Pure GPU, production servers	vLLM / TGI	Better
GPTQ	Pure GPU, wide compatibility	Multiple frameworks	Good

28.2 GGUF: The Flexible Multi-Platform Format

Technical Principle

GGUF (GPT-Generated Unified Format) is llama.cpp's file format, paired with its group quantization algorithm:

Process:
1. Divide weight matrices into groups (typically 32 or 128 weights per group)
2. Compute per-group scale factor and zero point
3. Quantize weights to target bit width
4. At inference: quantized_value × group_scale + zero_point ≈ original weight

K-Quant improvement (the "K" in Q4_K_M):
- Apply higher precision to important layers (attention, embeddings)
- Apply lower precision to less critical layers
- Achieves significantly better accuracy at the same average bit width

GGUF Quantization Levels

Level	Bits	Bytes/param	70B Size	Notes
Q2_K	2-bit	0.34	23 GB	Extreme compression, noticeable quality loss
Q3_K_M	3-bit	0.48	34 GB	Marginal usability
Q4_0	4-bit	0.56	39 GB	Basic 4-bit, moderate accuracy
Q4_K_M	4-bit+	0.60	42 GB	Recommended: best accuracy/memory balance
Q5_K_M	5-bit+	0.74	52 GB	High accuracy choice
Q6_K	6-bit	0.89	62 GB	Near FP16 accuracy
Q8_0	8-bit	1.06	74 GB	Near-lossless, largest footprint

from llama_cpp import Llama

# Q4_K_M: recommended for personal GPU
llm = Llama(
    model_path="./hermes-3-llama-3.1-70b.Q4_K_M.gguf",
    n_gpu_layers=-1,   # load all layers to GPU
    n_ctx=8192,
    n_batch=512,
    verbose=False
)

# CPU offload when VRAM is insufficient
llm_hybrid = Llama(
    model_path="./hermes-3-llama-3.1-70b.Q4_K_M.gguf",
    n_gpu_layers=20,   # 20 layers on GPU, rest on CPU
    n_ctx=4096,
    n_threads=8
)

28.3 AWQ: Activation-Aware Quantization

Technical Principle

AWQ's core insight: not all weights are equally important. By analyzing activation value distributions, AWQ identifies "salient weights" that most affect model output, then protects them with higher effective precision:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "NousResearch/Hermes-3-Llama-3.1-70B"
quant_path = "hermes-3-70b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

AWQ with vLLM (Recommended Production Setup)

python -m vllm.entrypoints.openai.api_server \
  --model hermes-3-70b-awq \
  --quantization awq \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 32768 \
  --host 0.0.0.0 \
  --port 8080

28.4 GPTQ: Reconstruction-Based Quantization

GPTQ is based on Optimal Brain Quantization (OBQ): process weights layer by layer, quantizing each column while using the Hessian matrix to compensate remaining weights for the introduced error.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True,  # sort by activation (better accuracy, slower quantization)
)

model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
examples = get_calibration_data(tokenizer)
model.quantize(examples)
model.save_quantized("hermes-3-8b-gptq-4bit-128g")

28.5 Precision Loss Benchmarks on Hermes Agent Tasks

8B Model Comparison (Hermes-3-Llama-3.1-8B)

Metric	FP16 (baseline)	Q8_0	Q5_K_M	Q4_K_M	Q4_0	Q3_K_M
Perplexity (WikiText-2)	6.12	6.14	6.18	6.27	6.41	6.89
PPL increase	0%	0.3%	1.0%	2.4%	4.7%	12.6%
Function Calling success	79.4%	79.1%	78.8%	77.3%	74.6%	68.2%
GSM8K accuracy	76.3%	76.0%	75.8%	74.9%	73.1%	68.7%
Agent task completion	61.3%	61.0%	60.7%	59.4%	56.8%	49.1%
Inference speed (tok/s)	32	24	31	28	29	30
Memory footprint	16 GB	9.1 GB	6.1 GB	5.2 GB	4.9 GB	3.9 GB

70B Model Comparison (Hermes-3-Llama-3.1-70B)

Metric	FP16 (baseline)	Q8_0	Q5_K_M	Q4_K_M	Q4_0
Perplexity (WikiText-2)	3.87	3.89	3.93	3.98	4.11
PPL increase	0%	0.5%	1.6%	2.8%	6.2%
Function Calling success	94.7%	94.3%	93.9%	93.1%	90.8%
Agent task completion	88.9%	88.5%	88.0%	87.2%	84.3%
Inference speed (tok/s, A100×1)	12	6	8	8	9
Memory footprint	140 GB	75 GB	52 GB	42 GB	39 GB

AWQ vs GPTQ vs GGUF Q4_K_M (8B, head-to-head)

Metric	GGUF Q4_K_M	AWQ INT4	GPTQ INT4 (128g)
Perplexity	6.27	6.21	6.31
Function Calling success	77.3%	78.1%	76.8%
Inference speed (RTX 4090)	28 tok/s	48 tok/s	42 tok/s
CPU offload support	Yes	No	No
Batch inference (batch=8)	Fair	Excellent	Good
Toolchain maturity	Very high	High	High
Quantization time (8B)	~5 min	~2 hours	~30 min

Key conclusions:

Accuracy: AWQ > GGUF Q4_K_M > GPTQ (small gap, ~0.1–0.5 PPL)

Speed: AWQ ≈ GPTQ > GGUF (pure GPU, AWQ/GPTQ are 50–70% faster)

Flexibility: GGUF supports CPU offload; AWQ/GPTQ require GPU

28.6 Recommended Configurations by VRAM Budget

Decision Matrix

Available VRAM	Recommended Model	Quantization	Format	Expected Performance
< 4 GB	Hermes-3B	Q4_K_M	GGUF	Limited Function Calling
6–8 GB	Hermes-8B	Q4_K_M	GGUF	Basic Agent, usable
10–12 GB	Hermes-8B	Q8_0	GGUF	High quality, recommended
16 GB	Hermes-8B	FP16	AWQ/vLLM	Best 8B quality
40–48 GB	Hermes-70B	Q4_K_M	GGUF or AWQ	Production-grade Agent
80 GB	Hermes-70B	Q8_0	GGUF or AWQ	Near-lossless quality
320 GB+	Hermes-405B	Q4_K_M	AWQ/vLLM	Enterprise grade

Quick Format Selection Guide

Should I use GGUF, AWQ, or GPTQ?

Do you have pure NVIDIA GPU and prioritize inference speed?
  → Yes → AWQ + vLLM (batch), or GPTQ (broad compatibility)
  → No  → GGUF

Do you need CPU to participate in inference (memory offload)?
  → Yes → Must use GGUF
  → No  → AWQ or GPTQ also viable

Are you running on Apple Silicon?
  → Yes → GGUF (Metal acceleration, only practical option)

Is accuracy more important than speed?
  → Yes → Q5_K_M or Q8_0 (GGUF), or AWQ
  → No  → Q4_K_M (GGUF)

Need to serve multiple users with OpenAI-compatible API?
  → Yes → AWQ/GPTQ + vLLM (better concurrency)
  → No  → GGUF + llama.cpp server is fine

28.7 Summary

The three quantization techniques each suit different scenarios:

GGUF: Most flexible, supports CPU offload and Apple Silicon — the go-to for individual users and mixed hardware environments
AWQ: Best accuracy INT4 option, fastest on pure GPU, recommended for production GPU servers
GPTQ: Most mature toolchain, broadest compatibility — a reliable alternative to AWQ
Precision loss: Q4_K_M causes ~2.4% PPL increase on 8B models and ~2.1% reduction in Function Calling success — acceptable for most scenarios
70B + Q4_K_M: Agent task success still reaches 87.2%, making it the ideal choice when hardware is constrained

Discussion Questions

Quantization "precision loss" varies significantly by task type. Why are math reasoning tasks (GSM8K) more sensitive to quantization, while text generation tasks are more robust?
AWQ requires a calibration dataset. If your business data distribution differs significantly from the general calibration data (WikiText), how should you handle this? How much would re-calibrating with domain data improve accuracy?
The benchmark shows Q4_K_M inference speed (28 tok/s) is slower than Q8_0 (24 tok/s). This seems counterintuitive — why would a smaller data type be slower? Explain the likely cause.
For Function Calling, which requires precise JSON output, is there a way to improve accuracy specifically for that task without raising global quantization precision?

Rate this chapter

4.5 / 5 (4 ratings)

Quantization Techniques: GGUF/AWQ/GPTQ Benchmark Comparison

Chapter 28: Quantization Techniques: GGUF/AWQ/GPTQ Compared

28.1 Why Quantization Matters

Three Quantization Approaches at a Glance

28.2 GGUF: The Flexible Multi-Platform Format

Technical Principle

GGUF Quantization Levels

28.3 AWQ: Activation-Aware Quantization

Technical Principle

AWQ with vLLM (Recommended Production Setup)

28.4 GPTQ: Reconstruction-Based Quantization

28.5 Precision Loss Benchmarks on Hermes Agent Tasks

8B Model Comparison (Hermes-3-Llama-3.1-8B)

70B Model Comparison (Hermes-3-Llama-3.1-70B)

AWQ vs GPTQ vs GGUF Q4_K_M (8B, head-to-head)

28.6 Recommended Configurations by VRAM Budget

Decision Matrix

Quick Format Selection Guide

28.7 Summary

Discussion Questions

💬 Comments