llama.cpp: Pushing CPU Inference to Its Limits
Chapter 46: llama.cpp — Pushing CPU Inference to the Limit
Introduction
No GPU? No problem. llama.cpp is both the last resort for CPU-only deployments and a hidden gem for Apple Silicon users. Written in pure C/C++ by Georgi Gerganov, it requires no CUDA and no Python runtime. Running a 70B model on a MacBook Pro is not a fantasy — it is a documented reality. This chapter covers compilation flags for maximum performance, optimal GGUF format selection, multi-thread tuning, memory mapping strategies, and how to unlock near-GPU inference speed on Apple M-series chips.
46.1 Compiling llama.cpp
Build from Source (Recommended — Best Performance)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Optional: pin to a known-stable build
git checkout b3900
macOS — Metal GPU Acceleration
brew install cmake
cmake -B build \
-DLLAMA_METAL=ON \ # Enable Apple Metal GPU
-DLLAMA_NATIVE=ON \ # Optimize for current CPU
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j $(sysctl -n hw.logicalcpu)
# Verify Metal support
./build/bin/llama-cli --list-devices
# Output should include "Metal: Apple M..."
Linux — AVX2 / AVX-512
# Check what your CPU supports
grep -m1 flags /proc/cpuinfo | tr ' ' '\n' | grep -E "avx|sse4"
# AVX2 (most CPUs since 2013)
cmake -B build \
-DLLAMA_NATIVE=ON \
-DLLAMA_AVX=ON \
-DLLAMA_AVX2=ON \
-DLLAMA_FMA=ON \
-DCMAKE_BUILD_TYPE=Release
# AVX-512 (Intel Skylake-X / Ice Lake and later)
cmake -B build \
-DLLAMA_NATIVE=ON \
-DLLAMA_AVX512=ON \
-DLLAMA_AVX512_VBMI=ON \
-DLLAMA_AVX512_VNNI=ON \
-DCMAKE_BUILD_TYPE=Release
# With CUDA (hybrid CPU+GPU)
cmake -B build \
-DLLAMA_CUDA=ON \
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \
-DLLAMA_NATIVE=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j $(nproc)
Pre-built Binaries (Quick Start)
# macOS ARM64 (Metal)
wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-b3900-bin-macos-arm64.zip
# Linux x64 (AVX2)
wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-b3900-bin-ubuntu-x64.zip
unzip llama-b3900-bin-*.zip && chmod +x llama-*
46.2 Choosing the Right GGUF Format
Quantization Format Reference
| Format | Bits | Method | Size (70B) | Quality | Speed | Recommendation |
|---|---|---|---|---|---|---|
| F16 | 16 | Half precision | ~130 GB | Baseline | 1.0× | Enterprise only |
| Q8_0 | 8 | Absolute | ~70 GB | 99.5% | 1.5× | Near-lossless |
| Q6_K | 6 | K-quant | ~58 GB | 99.1% | 1.8× | High quality |
| Q5_K_M | 5 | K-quant Medium | ~48 GB | 98.6% | 2.0× | High quality recommended |
| Q4_K_M | 4 | K-quant Medium | ~40 GB | 97.8% | 2.5× | Best balance (recommended) |
| Q4_K_S | 4 | K-quant Small | ~38 GB | 97.1% | 2.7× | Speed priority |
| Q4_0 | 4 | Absolute | ~37 GB | 96.5% | 2.8× | Fast, slightly worse quality |
| Q3_K_M | 3 | K-quant Medium | ~31 GB | 95.0% | 3.2× | Extreme memory constraint |
| Q2_K | 2 | K-quant | ~25 GB | 88.0% | 4.0× | Not recommended |
K-quantization note: K-quants dynamically assign higher precision to weights that are more sensitive to quantization error, and lower precision to less critical weights. This delivers substantially better quality than naive fixed-bit quantization at the same bit count.
Selection Decision Tree
Available system RAM (excluding OS overhead)?
│
├─► >= 80 GB → Q5_K_M or Q6_K (high quality)
│
├─► 40–80 GB → Q4_K_M (best balance, strongly recommended)
│
├─► 30–40 GB → Q3_K_M for 70B (quality loss)
│ OR switch to Hermes 13B Q4_K_M (better quality)
│
└─► < 30 GB → Hermes 7B Q4_K_M only
Downloading GGUF Files
pip install huggingface_hub
# Download a specific quantization
huggingface-cli download \
NousResearch/Hermes-3-Llama-3.1-70B-GGUF \
--include "hermes-3-llama3.1-70b-q4_k_m.gguf" \
--local-dir ./models/
# Verify file integrity
python3 -c "
with open('models/hermes-3-llama3.1-70b-q4_k_m.gguf', 'rb') as f:
magic = f.read(4)
print('Valid GGUF' if magic == b'GGUF' else 'ERROR: invalid format')
"
46.3 Thread Count Optimization
Why More Threads ≠ More Speed
CPU inference is memory-bandwidth bound, not compute bound. After a certain point, adding threads only creates contention for the same memory bus.
#!/bin/bash
# thread_benchmark.sh
MODEL="./models/hermes-3-llama3.1-70b-q4_k_m.gguf"
echo "=== Thread Count vs Speed ==="
for T in 1 2 4 6 8 12 16 20 24 32; do
printf "Threads %3d: " $T
./build/bin/llama-bench \
--model "$MODEL" --n-gen 100 --threads "$T" --output json 2>/dev/null | \
python3 -c "
import json,sys
d=json.load(sys.stdin)
pp=[x for x in d if 'pp' in x['test']]
tg=[x for x in d if 'tg' in x['test']]
print(f\"PP={pp[0]['avg_ts']:5.1f} t/s, TG={tg[0]['avg_ts']:4.1f} t/s\" if pp else 'N/A')
"
done
Typical results on i9-13900K / DDR5-6000:
Threads 1: PP= 3.2 t/s, TG= 1.8 t/s
Threads 4: PP= 11.2 t/s, TG= 4.1 t/s
Threads 8: PP= 19.3 t/s, TG= 5.6 t/s
Threads 12: PP= 21.7 t/s, TG= 5.8 t/s ← diminishing returns start
Threads 16: PP= 22.4 t/s, TG= 5.7 t/s ← memory bandwidth ceiling
Threads 24: PP= 21.2 t/s, TG= 5.1 t/s ← hyperthreading contention
Rule: 50–75% of physical cores is typically optimal.
Finding Your Optimal Thread Count
# Get physical core count (excluding hyperthreading)
PHYSICAL_CORES=$(lscpu | grep "Core(s) per socket" | awk '{print $NF}')
SOCKETS=$(lscpu | grep "Socket(s)" | awk '{print $NF}')
TOTAL_PHYSICAL=$((PHYSICAL_CORES * SOCKETS))
RECOMMENDED=$((TOTAL_PHYSICAL * 3 / 4))
echo "Physical cores: $TOTAL_PHYSICAL"
echo "Recommended threads: $RECOMMENDED"
./build/bin/llama-server \
--model ./models/hermes-70b-q4_k_m.gguf \
--threads $RECOMMENDED \
--threads-batch $RECOMMENDED \
--ctx-size 8192 \
--port 8080
46.4 Memory Mapping (mmap) Configuration
mmap vs Direct Load Comparison
| Config | Startup Time | Inference Speed | Memory Use | Best For |
|---|---|---|---|---|
--mmap (default) |
Fast (seconds) | Normal | Shared (can be evicted) | Multi-process, RAM-constrained |
--no-mmap |
Slow (minutes) | Slightly faster | Exclusive | Single process with ample RAM |
--mmap + --mlock |
Slow (locking) | Fastest | Exclusive + locked | Production (no page faults) |
# Default: mmap — shared filesystem cache
./build/bin/llama-server \
--model ./models/hermes-70b-q4.gguf \
--use-mmap --threads 12 --ctx-size 8192
# No mmap — entire model loaded into RAM
# Use when RAM >> model size × 1.5
./build/bin/llama-server \
--model ./models/hermes-70b-q4.gguf \
--no-mmap --threads 12 --ctx-size 8192
# mmap + mlock — prevent any page swapping (requires root or ulimit)
sudo ulimit -l unlimited
./build/bin/llama-server \
--model ./models/hermes-70b-q4.gguf \
--use-mmap --use-mlock \
--threads 12 --ctx-size 8192
Detecting Swap Usage During Inference
# Monitor swap (should stay at 0 during inference)
watch -n 1 'free -h && echo "---" && vmstat 1 1 | tail -1'
# If swap increases: model exceeds available RAM
# Solutions:
# 1. Use smaller quantization (Q4 → Q3)
# 2. Reduce --ctx-size
# 3. Switch to a smaller model
46.5 Metal GPU Acceleration — Apple Silicon
Why Apple Silicon Is Exceptional for LLM Inference
Apple Silicon's unified memory architecture means CPU and GPU share one physical memory pool (LPDDR5X):
- No CPU→GPU data copies (a major bottleneck in discrete GPU setups)
- GPU has direct access to model weights at full memory bandwidth
- 400–800 GB/s memory bandwidth (vs 50–90 GB/s for x86 DDR5)
# Verify Metal device
./build/bin/llama-cli --list-devices
# Should show: "GPU Metal: Apple M3 Ultra [...]"
# All layers to Metal GPU
./build/bin/llama-server \
--model ./models/hermes-70b-q4_k_m.gguf \
-ngl 99 \ # All layers to GPU
--threads 4 \ # Minimal CPU threads (GPU handles the work)
--ctx-size 65536 \ # M3 Ultra 192GB can handle large context
--flash-attn \ # Flash Attention (Metal-supported)
--port 8080 \
--host 127.0.0.1
# Monitor GPU usage
sudo powermetrics --samplers gpu_power -i 1000 -n 5
Apple Silicon Performance Benchmarks
| Chip | Unified Memory | Bandwidth | Hermes 70B Q4 Speed |
|---|---|---|---|
| M1 Max | 32 GB | 400 GB/s | Cannot run (insufficient RAM) |
| M2 Max | 96 GB | 400 GB/s | ~8 t/s |
| M3 Max | 128 GB | 400 GB/s | ~10 t/s |
| M2 Ultra | 192 GB | 800 GB/s | ~20 t/s |
| M3 Ultra | 192 GB | 800 GB/s | ~30 t/s |
Adaptive Launch Script
#!/bin/bash
# start_hermes_apple.sh
MEMORY_GB=$(sysctl -n hw.memsize | awk '{print int($1/1024/1024/1024)}')
echo "Unified memory: ${MEMORY_GB}GB"
if [ "$MEMORY_GB" -ge 128 ]; then
CTX=65536; THREADS=4; MODEL="hermes-70b-q4_k_m.gguf"
elif [ "$MEMORY_GB" -ge 64 ]; then
CTX=32768; THREADS=4; MODEL="hermes-13b-q4_k_m.gguf"
else
CTX=8192; THREADS=4; MODEL="hermes-7b-q4_k_m.gguf"
fi
echo "Config: ctx=${CTX}, model=${MODEL}"
./build/bin/llama-server \
--model "./models/$MODEL" \
-ngl 99 --threads "$THREADS" \
--ctx-size "$CTX" --flash-attn \
--port 8080 --host 127.0.0.1 --log-disable
46.6 Speed Benchmarks Across Configurations
Benchmark Script
#!/bin/bash
# comprehensive_benchmark.sh
MODEL="${1:-./models/hermes-70b-q4_k_m.gguf}"
run_bench() {
local desc="$1"; shift
printf "%-25s " "$desc:"
./build/bin/llama-bench \
--model "$MODEL" --n-prompt 512 --n-gen 128 "$@" --output json 2>/dev/null | \
python3 -c "
import json,sys
d=json.load(sys.stdin)
pp=[x for x in d if 'pp' in x['test']]
tg=[x for x in d if 'tg' in x['test']]
print(f\"PP={pp[0]['avg_ts']:6.1f} t/s TG={tg[0]['avg_ts']:5.1f} t/s\" if pp else 'FAILED')
"
}
run_bench "CPU 4 threads" --threads 4 --n-gpu-layers 0
run_bench "CPU 8 threads" --threads 8 --n-gpu-layers 0
run_bench "CPU 12 threads" --threads 12 --n-gpu-layers 0
[[ "$OSTYPE" == "darwin"* ]] && {
run_bench "Metal all layers" --n-gpu-layers 99 --threads 4
run_bench "Metal + Flash Attention" --n-gpu-layers 99 --threads 4 --flash-attn
}
command -v nvidia-smi &>/dev/null && {
run_bench "CUDA 40 layers" --n-gpu-layers 40 --threads 8
run_bench "CUDA all layers" --n-gpu-layers 99 --threads 4
}
Reference Results
| Configuration | Hardware | PP (t/s) | TG (t/s) | Practical Use |
|---|---|---|---|---|
| CPU 8 threads | i9-13900K | 19.3 | 5.6 | Dev / debugging |
| CPU 12 threads | Ryzen 9 7950X | 23.1 | 6.2 | Batch processing |
| Metal all layers | M2 Max 96GB | 45.2 | 8.3 | Mac users |
| Metal + FlashAttn | M3 Ultra 192GB | 98.7 | 32.1 | High-end Mac |
| CUDA all layers | RTX 3090 24GB | 78.3 | 15.2 | Personal GPU |
| CUDA all layers | A100 80GB | 156.8 | 28.7 | Enterprise |
| CUDA all layers | H100 80GB | 245.3 | 45.8 | High-performance |
PP = Prefill speed (prompt processing); TG = Token Generation. For user experience, TG is the critical metric — it determines how fast text appears.
46.7 Production Server Configuration
Complete llama-server Launch Command
./build/bin/llama-server \
--model ./models/hermes-70b-q4_k_m.gguf \
\
# GPU/CPU
-ngl 99 \ # Metal/CUDA: all layers to GPU
--threads 4 \
--threads-batch 4 \
\
# Context
--ctx-size 65536 \
--n-predict 4096 \
\
# Performance
--flash-attn \
--use-mmap \
--cache-type-k q8_0 \ # Quantize KV cache (saves memory)
--cache-type-v q8_0 \
\
# Server
--host 0.0.0.0 --port 8080 \
--api-key "your-secret-key" \
\
# Batching
--parallel 4 \ # Concurrent request slots
--cont-batching \ # Continuous batching (higher throughput)
\
--log-disable
Python Client (OpenAI-Compatible)
from openai import AsyncOpenAI
import asyncio
client = AsyncOpenAI(
base_url="http://localhost:8080/v1",
api_key="your-secret-key"
)
async def chat(messages: list[dict]) -> str:
response = await client.chat.completions.create(
model="hermes", # model name is arbitrary for llama-server
messages=messages,
max_tokens=2048,
temperature=0.1
)
return response.choices[0].message.content
Chapter Summary
Key optimization principles for llama.cpp:
- Compile flags:
LLAMA_NATIVE+ platform accelerator (METAL/AVX512/CUDA) - Quantization: Q4_K_M is the best-balanced choice for most scenarios
- Thread count: 50–75% of physical cores; memory bandwidth is the ceiling
- mmap strategy:
mmap + mlockin production (no swapping);mmapalone under RAM pressure - Apple Silicon: Metal +
-ngl 99is the killer combination; M3 Ultra achieves 30+ t/s
llama.cpp's greatest value proposition: runs without any GPU, and runs beautifully on Apple Silicon.
Review Questions
-
K-quantization achieves better quality than naive Q4_0 by dynamically assigning precision based on weight sensitivity. Explain the mathematical principle — how does the algorithm determine which weights need higher precision?
-
Apple Silicon's unified memory architecture theoretically eliminates CPU→GPU copy overhead, yet measured inference speeds are still below a same-priced NVIDIA GPU. Identify and explain the likely bottlenecks.
-
llama-server's
--cont-batching(continuous batching) and vLLM's PagedAttention both improve throughput under concurrent load. What is the fundamental difference between these approaches? In what specific scenario could llama.cpp's continuous batching compete with vLLM?