Chapter 46

llama.cpp: Pushing CPU Inference to Its Limits

Chapter 46: llama.cpp โ€” Pushing CPU Inference to the Limit

Introduction

No GPU? No problem. llama.cpp is both the last resort for CPU-only deployments and a hidden gem for Apple Silicon users. Written in pure C/C++ by Georgi Gerganov, it requires no CUDA and no Python runtime. Running a 70B model on a MacBook Pro is not a fantasy โ€” it is a documented reality. This chapter covers compilation flags for maximum performance, optimal GGUF format selection, multi-thread tuning, memory mapping strategies, and how to unlock near-GPU inference speed on Apple M-series chips.


46.1 Compiling llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Optional: pin to a known-stable build
git checkout b3900

macOS โ€” Metal GPU Acceleration

brew install cmake

cmake -B build \
    -DLLAMA_METAL=ON \        # Enable Apple Metal GPU
    -DLLAMA_NATIVE=ON \       # Optimize for current CPU
    -DCMAKE_BUILD_TYPE=Release

cmake --build build -j $(sysctl -n hw.logicalcpu)

# Verify Metal support
./build/bin/llama-cli --list-devices
# Output should include "Metal: Apple M..."

Linux โ€” AVX2 / AVX-512

# Check what your CPU supports
grep -m1 flags /proc/cpuinfo | tr ' ' '\n' | grep -E "avx|sse4"

# AVX2 (most CPUs since 2013)
cmake -B build \
    -DLLAMA_NATIVE=ON \
    -DLLAMA_AVX=ON \
    -DLLAMA_AVX2=ON \
    -DLLAMA_FMA=ON \
    -DCMAKE_BUILD_TYPE=Release

# AVX-512 (Intel Skylake-X / Ice Lake and later)
cmake -B build \
    -DLLAMA_NATIVE=ON \
    -DLLAMA_AVX512=ON \
    -DLLAMA_AVX512_VBMI=ON \
    -DLLAMA_AVX512_VNNI=ON \
    -DCMAKE_BUILD_TYPE=Release

# With CUDA (hybrid CPU+GPU)
cmake -B build \
    -DLLAMA_CUDA=ON \
    -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \
    -DLLAMA_NATIVE=ON \
    -DCMAKE_BUILD_TYPE=Release

cmake --build build -j $(nproc)

Pre-built Binaries (Quick Start)

# macOS ARM64 (Metal)
wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-b3900-bin-macos-arm64.zip

# Linux x64 (AVX2)
wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-b3900-bin-ubuntu-x64.zip

unzip llama-b3900-bin-*.zip && chmod +x llama-*

46.2 Choosing the Right GGUF Format

Quantization Format Reference

Format Bits Method Size (70B) Quality Speed Recommendation
F16 16 Half precision ~130 GB Baseline 1.0ร— Enterprise only
Q8_0 8 Absolute ~70 GB 99.5% 1.5ร— Near-lossless
Q6_K 6 K-quant ~58 GB 99.1% 1.8ร— High quality
Q5_K_M 5 K-quant Medium ~48 GB 98.6% 2.0ร— High quality recommended
Q4_K_M 4 K-quant Medium ~40 GB 97.8% 2.5ร— Best balance (recommended)
Q4_K_S 4 K-quant Small ~38 GB 97.1% 2.7ร— Speed priority
Q4_0 4 Absolute ~37 GB 96.5% 2.8ร— Fast, slightly worse quality
Q3_K_M 3 K-quant Medium ~31 GB 95.0% 3.2ร— Extreme memory constraint
Q2_K 2 K-quant ~25 GB 88.0% 4.0ร— Not recommended

K-quantization note: K-quants dynamically assign higher precision to weights that are more sensitive to quantization error, and lower precision to less critical weights. This delivers substantially better quality than naive fixed-bit quantization at the same bit count.

Selection Decision Tree

Available system RAM (excluding OS overhead)?
โ”‚
โ”œโ”€โ–บ >= 80 GB โ†’ Q5_K_M or Q6_K (high quality)
โ”‚
โ”œโ”€โ–บ 40โ€“80 GB โ†’ Q4_K_M (best balance, strongly recommended)
โ”‚
โ”œโ”€โ–บ 30โ€“40 GB โ†’ Q3_K_M for 70B (quality loss)
โ”‚              OR switch to Hermes 13B Q4_K_M (better quality)
โ”‚
โ””โ”€โ–บ < 30 GB โ†’ Hermes 7B Q4_K_M only

Downloading GGUF Files

pip install huggingface_hub

# Download a specific quantization
huggingface-cli download \
    NousResearch/Hermes-3-Llama-3.1-70B-GGUF \
    --include "hermes-3-llama3.1-70b-q4_k_m.gguf" \
    --local-dir ./models/

# Verify file integrity
python3 -c "
with open('models/hermes-3-llama3.1-70b-q4_k_m.gguf', 'rb') as f:
    magic = f.read(4)
    print('Valid GGUF' if magic == b'GGUF' else 'ERROR: invalid format')
"

46.3 Thread Count Optimization

Why More Threads โ‰  More Speed

CPU inference is memory-bandwidth bound, not compute bound. After a certain point, adding threads only creates contention for the same memory bus.

#!/bin/bash
# thread_benchmark.sh
MODEL="./models/hermes-3-llama3.1-70b-q4_k_m.gguf"

echo "=== Thread Count vs Speed ==="
for T in 1 2 4 6 8 12 16 20 24 32; do
    printf "Threads %3d: " $T
    ./build/bin/llama-bench \
        --model "$MODEL" --n-gen 100 --threads "$T" --output json 2>/dev/null | \
    python3 -c "
import json,sys
d=json.load(sys.stdin)
pp=[x for x in d if 'pp' in x['test']]
tg=[x for x in d if 'tg' in x['test']]
print(f\"PP={pp[0]['avg_ts']:5.1f} t/s, TG={tg[0]['avg_ts']:4.1f} t/s\" if pp else 'N/A')
"
done

Typical results on i9-13900K / DDR5-6000:

Threads   1: PP=  3.2 t/s, TG= 1.8 t/s
Threads   4: PP= 11.2 t/s, TG= 4.1 t/s
Threads   8: PP= 19.3 t/s, TG= 5.6 t/s
Threads  12: PP= 21.7 t/s, TG= 5.8 t/s  โ† diminishing returns start
Threads  16: PP= 22.4 t/s, TG= 5.7 t/s  โ† memory bandwidth ceiling
Threads  24: PP= 21.2 t/s, TG= 5.1 t/s  โ† hyperthreading contention

Rule: 50โ€“75% of physical cores is typically optimal.

Finding Your Optimal Thread Count

# Get physical core count (excluding hyperthreading)
PHYSICAL_CORES=$(lscpu | grep "Core(s) per socket" | awk '{print $NF}')
SOCKETS=$(lscpu | grep "Socket(s)" | awk '{print $NF}')
TOTAL_PHYSICAL=$((PHYSICAL_CORES * SOCKETS))
RECOMMENDED=$((TOTAL_PHYSICAL * 3 / 4))

echo "Physical cores: $TOTAL_PHYSICAL"
echo "Recommended threads: $RECOMMENDED"

./build/bin/llama-server \
    --model ./models/hermes-70b-q4_k_m.gguf \
    --threads $RECOMMENDED \
    --threads-batch $RECOMMENDED \
    --ctx-size 8192 \
    --port 8080

46.4 Memory Mapping (mmap) Configuration

mmap vs Direct Load Comparison

Config Startup Time Inference Speed Memory Use Best For
--mmap (default) Fast (seconds) Normal Shared (can be evicted) Multi-process, RAM-constrained
--no-mmap Slow (minutes) Slightly faster Exclusive Single process with ample RAM
--mmap + --mlock Slow (locking) Fastest Exclusive + locked Production (no page faults)
# Default: mmap โ€” shared filesystem cache
./build/bin/llama-server \
    --model ./models/hermes-70b-q4.gguf \
    --use-mmap --threads 12 --ctx-size 8192

# No mmap โ€” entire model loaded into RAM
# Use when RAM >> model size ร— 1.5
./build/bin/llama-server \
    --model ./models/hermes-70b-q4.gguf \
    --no-mmap --threads 12 --ctx-size 8192

# mmap + mlock โ€” prevent any page swapping (requires root or ulimit)
sudo ulimit -l unlimited
./build/bin/llama-server \
    --model ./models/hermes-70b-q4.gguf \
    --use-mmap --use-mlock \
    --threads 12 --ctx-size 8192

Detecting Swap Usage During Inference

# Monitor swap (should stay at 0 during inference)
watch -n 1 'free -h && echo "---" && vmstat 1 1 | tail -1'

# If swap increases: model exceeds available RAM
# Solutions:
# 1. Use smaller quantization (Q4 โ†’ Q3)
# 2. Reduce --ctx-size
# 3. Switch to a smaller model

46.5 Metal GPU Acceleration โ€” Apple Silicon

Why Apple Silicon Is Exceptional for LLM Inference

Apple Silicon's unified memory architecture means CPU and GPU share one physical memory pool (LPDDR5X):

# Verify Metal device
./build/bin/llama-cli --list-devices
# Should show: "GPU Metal: Apple M3 Ultra [...]"

# All layers to Metal GPU
./build/bin/llama-server \
    --model ./models/hermes-70b-q4_k_m.gguf \
    -ngl 99 \                   # All layers to GPU
    --threads 4 \               # Minimal CPU threads (GPU handles the work)
    --ctx-size 65536 \          # M3 Ultra 192GB can handle large context
    --flash-attn \              # Flash Attention (Metal-supported)
    --port 8080 \
    --host 127.0.0.1

# Monitor GPU usage
sudo powermetrics --samplers gpu_power -i 1000 -n 5

Apple Silicon Performance Benchmarks

Chip Unified Memory Bandwidth Hermes 70B Q4 Speed
M1 Max 32 GB 400 GB/s Cannot run (insufficient RAM)
M2 Max 96 GB 400 GB/s ~8 t/s
M3 Max 128 GB 400 GB/s ~10 t/s
M2 Ultra 192 GB 800 GB/s ~20 t/s
M3 Ultra 192 GB 800 GB/s ~30 t/s

Adaptive Launch Script

#!/bin/bash
# start_hermes_apple.sh

MEMORY_GB=$(sysctl -n hw.memsize | awk '{print int($1/1024/1024/1024)}')
echo "Unified memory: ${MEMORY_GB}GB"

if [ "$MEMORY_GB" -ge 128 ]; then
    CTX=65536; THREADS=4; MODEL="hermes-70b-q4_k_m.gguf"
elif [ "$MEMORY_GB" -ge 64 ]; then
    CTX=32768; THREADS=4; MODEL="hermes-13b-q4_k_m.gguf"
else
    CTX=8192; THREADS=4; MODEL="hermes-7b-q4_k_m.gguf"
fi

echo "Config: ctx=${CTX}, model=${MODEL}"

./build/bin/llama-server \
    --model "./models/$MODEL" \
    -ngl 99 --threads "$THREADS" \
    --ctx-size "$CTX" --flash-attn \
    --port 8080 --host 127.0.0.1 --log-disable

46.6 Speed Benchmarks Across Configurations

Benchmark Script

#!/bin/bash
# comprehensive_benchmark.sh
MODEL="${1:-./models/hermes-70b-q4_k_m.gguf}"

run_bench() {
    local desc="$1"; shift
    printf "%-25s " "$desc:"
    ./build/bin/llama-bench \
        --model "$MODEL" --n-prompt 512 --n-gen 128 "$@" --output json 2>/dev/null | \
    python3 -c "
import json,sys
d=json.load(sys.stdin)
pp=[x for x in d if 'pp' in x['test']]
tg=[x for x in d if 'tg' in x['test']]
print(f\"PP={pp[0]['avg_ts']:6.1f} t/s  TG={tg[0]['avg_ts']:5.1f} t/s\" if pp else 'FAILED')
"
}

run_bench "CPU  4 threads"   --threads 4  --n-gpu-layers 0
run_bench "CPU  8 threads"   --threads 8  --n-gpu-layers 0
run_bench "CPU 12 threads"   --threads 12 --n-gpu-layers 0

[[ "$OSTYPE" == "darwin"* ]] && {
    run_bench "Metal all layers"         --n-gpu-layers 99 --threads 4
    run_bench "Metal + Flash Attention"  --n-gpu-layers 99 --threads 4 --flash-attn
}

command -v nvidia-smi &>/dev/null && {
    run_bench "CUDA 40 layers"  --n-gpu-layers 40 --threads 8
    run_bench "CUDA all layers" --n-gpu-layers 99 --threads 4
}

Reference Results

Configuration Hardware PP (t/s) TG (t/s) Practical Use
CPU 8 threads i9-13900K 19.3 5.6 Dev / debugging
CPU 12 threads Ryzen 9 7950X 23.1 6.2 Batch processing
Metal all layers M2 Max 96GB 45.2 8.3 Mac users
Metal + FlashAttn M3 Ultra 192GB 98.7 32.1 High-end Mac
CUDA all layers RTX 3090 24GB 78.3 15.2 Personal GPU
CUDA all layers A100 80GB 156.8 28.7 Enterprise
CUDA all layers H100 80GB 245.3 45.8 High-performance

PP = Prefill speed (prompt processing); TG = Token Generation. For user experience, TG is the critical metric โ€” it determines how fast text appears.


46.7 Production Server Configuration

Complete llama-server Launch Command

./build/bin/llama-server \
    --model ./models/hermes-70b-q4_k_m.gguf \
    \
    # GPU/CPU
    -ngl 99 \                       # Metal/CUDA: all layers to GPU
    --threads 4 \
    --threads-batch 4 \
    \
    # Context
    --ctx-size 65536 \
    --n-predict 4096 \
    \
    # Performance
    --flash-attn \
    --use-mmap \
    --cache-type-k q8_0 \          # Quantize KV cache (saves memory)
    --cache-type-v q8_0 \
    \
    # Server
    --host 0.0.0.0 --port 8080 \
    --api-key "your-secret-key" \
    \
    # Batching
    --parallel 4 \                  # Concurrent request slots
    --cont-batching \               # Continuous batching (higher throughput)
    \
    --log-disable

Python Client (OpenAI-Compatible)

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-secret-key"
)

async def chat(messages: list[dict]) -> str:
    response = await client.chat.completions.create(
        model="hermes",   # model name is arbitrary for llama-server
        messages=messages,
        max_tokens=2048,
        temperature=0.1
    )
    return response.choices[0].message.content

Chapter Summary

Key optimization principles for llama.cpp:

  1. Compile flags: LLAMA_NATIVE + platform accelerator (METAL / AVX512 / CUDA)
  2. Quantization: Q4_K_M is the best-balanced choice for most scenarios
  3. Thread count: 50โ€“75% of physical cores; memory bandwidth is the ceiling
  4. mmap strategy: mmap + mlock in production (no swapping); mmap alone under RAM pressure
  5. Apple Silicon: Metal + -ngl 99 is the killer combination; M3 Ultra achieves 30+ t/s

llama.cpp's greatest value proposition: runs without any GPU, and runs beautifully on Apple Silicon.

Review Questions

  1. K-quantization achieves better quality than naive Q4_0 by dynamically assigning precision based on weight sensitivity. Explain the mathematical principle โ€” how does the algorithm determine which weights need higher precision?

  2. Apple Silicon's unified memory architecture theoretically eliminates CPUโ†’GPU copy overhead, yet measured inference speeds are still below a same-priced NVIDIA GPU. Identify and explain the likely bottlenecks.

  3. llama-server's --cont-batching (continuous batching) and vLLM's PagedAttention both improve throughput under concurrent load. What is the fundamental difference between these approaches? In what specific scenario could llama.cpp's continuous batching compete with vLLM?

Rate this chapter
4.5  / 5  (3 ratings)

๐Ÿ’ฌ Comments