Chapter 43

Hardware Selection: GPU Memory Requirements Calculator

Chapter 43: Hardware Selection — Calculating GPU Memory Requirements

Introduction

"How much VRAM do I need?" is the first question anyone asks before deploying Hermes locally. Wrong hardware selection either wastes money (buying far more than needed) or leads to out-of-memory failures discovered only at deployment time. This chapter provides rigorous calculation formulas, comprehensive model comparison tables, and recommended configurations from consumer to enterprise grade.

43.1 Memory Requirement Calculation Formula

The Master Formula

GPU VRAM required for inference is composed of three parts:

Total VRAM = Model Weights Memory + KV Cache Memory + Runtime Overhead

Model Weights Memory is the largest component:

Model Weights (GB) = Parameters (B) × Bytes per Parameter / 1024³ × Overhead Factor

Where:
  Parameters (B) = number of billion parameters
  Bytes per parameter:
    FP32   = 4.0 bytes
    FP16 / BF16 = 2.0 bytes
    INT8 / Q8   = 1.0 byte
    Q5          ≈ 0.625 bytes
    Q4          ≈ 0.5 bytes
  Overhead Factor ≈ 1.15 (framework, activations, buffers)

KV Cache Memory (critical — determines batching capacity):

KV Cache (GB) = 2 × layers × heads × head_dim × seq_len × batch × bytes / 1024³

Simplified formula (calibrated for Hermes 4 series):
KV Cache (GB) ≈ context_length_K × batch_size × precision_factor

Where precision_factor:
  FP16: ~0.25 GB per 1K tokens per batch item
  INT8: ~0.125 GB per 1K tokens per batch item

Python Calculator

def calculate_vram_requirement(
    params_b: float,          # parameters in billions
    precision_bits: int,      # quantization bits: 4/5/8/16/32
    context_length_k: int,    # context window in K tokens
    batch_size: int = 1,
    overhead_ratio: float = 1.15
) -> dict:
    precision_bytes = {4: 0.5, 5: 0.625, 8: 1.0, 16: 2.0, 32: 4.0}
    bytes_per_param = precision_bytes[precision_bits]

    model_weights_gb = (params_b * 1e9 * bytes_per_param) / (1024 ** 3) * overhead_ratio

    kv_precision_factor = precision_bytes.get(min(precision_bits, 16), 2.0)
    kv_cache_gb = context_length_k * batch_size * kv_precision_factor * 0.25

    runtime_overhead_gb = 1.5  # fixed CUDA/framework overhead

    total_gb = model_weights_gb + kv_cache_gb + runtime_overhead_gb

    return {
        "model_weights_gb": round(model_weights_gb, 1),
        "kv_cache_gb": round(kv_cache_gb, 1),
        "runtime_overhead_gb": runtime_overhead_gb,
        "total_gb": round(total_gb, 1),
        "recommended_vram_gb": round(total_gb * 1.1)  # 10% safety margin
    }

# Example calculations
configs = [
    ("Hermes 4 70B  FP16", 70, 16, 32),
    ("Hermes 4 70B  Q8",   70,  8, 64),
    ("Hermes 4 70B  Q5",   70,  5, 64),
    ("Hermes 4 70B  Q4",   70,  4, 64),
    ("Hermes 4 13B  FP16", 13, 16, 64),
    ("Hermes 4 13B  Q4",   13,  4, 64),
    ("Hermes 4  7B  FP16",  7, 16, 64),
    ("Hermes 4  7B  Q4",    7,  4, 64),
]

print(f"{'Config':<28} {'Weights':>8} {'KVCache':>8} {'Total':>7} {'Recommended':>12}")
print("-" * 68)
for name, p, q, ctx in configs:
    r = calculate_vram_requirement(p, q, ctx)
    print(f"{name:<28} {r['model_weights_gb']:>7.1f}G {r['kv_cache_gb']:>7.1f}G "
          f"{r['total_gb']:>6.1f}G {r['recommended_vram_gb']:>11.0f}G")

Output:

Config                       Weights  KVCache   Total  Recommended
--------------------------------------------------------------------
Hermes 4 70B  FP16           130.6G    16.0G  148.1G         163G
Hermes 4 70B  Q8              65.3G    16.0G   82.8G          91G
Hermes 4 70B  Q5              40.8G    16.0G   58.3G          65G
Hermes 4 70B  Q4              32.7G    16.0G   50.2G          56G
Hermes 4 13B  FP16            24.3G    32.0G   57.8G          64G
Hermes 4 13B  Q4               6.1G    32.0G   39.6G          44G
Hermes 4  7B  FP16            13.1G    32.0G   46.6G          52G
Hermes 4  7B  Q4               3.3G    32.0G   36.8G          41G

43.2 VRAM Requirements Reference Table

Hermes Series — VRAM at 64K Context, Single-User Inference

Model	Params	FP16 VRAM	Q8 VRAM	Q5 VRAM	Q4 VRAM	Minimum Viable
Hermes 4 7B	7B	~16 GB	~10 GB	~7 GB	~6 GB	RTX 3060 12GB (Q4)
Hermes 4 13B	13B	~28 GB	~16 GB	~11 GB	~9 GB	RTX 3090 24GB (Q4)
Hermes 4 34B	34B	~68 GB	~36 GB	~24 GB	~20 GB	A100 40GB (Q4)
Hermes 4 70B	70B	~140 GB	~82 GB	~58 GB	~50 GB	2× A100 80GB (Q4)

Context Length Impact on VRAM (Hermes 4 70B Q4)

Context	batch=1	batch=4	batch=8	Typical Use Case
8K tokens	~36 GB	~40 GB	~48 GB	Simple chat
32K tokens	~44 GB	~60 GB	~88 GB	Code analysis
64K tokens	~50 GB	~80 GB	~140 GB	Long document
128K tokens	~66 GB	~136 GB	exceeds single GPU	Extended tasks

43.3 Consumer GPU Recommendations

GPU Comparison Table

GPU	VRAM	Memory BW	Price (2026)	Best Hermes Model	Rating
RTX 3060	12 GB	360 GB/s	~$250	7B Q4 (tight)	★★☆☆☆
RTX 3090	24 GB	936 GB/s	~$750	13B Q4, 7B Q8	★★★★☆
RTX 4090	24 GB	1008 GB/s	~$1,800	13B Q4, fastest	★★★★★
RTX 4080 Super	16 GB	736 GB/s	~$1,000	7B FP16, 13B Q4	★★★☆☆
RX 7900 XTX	24 GB	960 GB/s	~$900	13B Q4 (ROCm)	★★★☆☆
M3 Ultra 128GB	128 GB	800 GB/s	~$6,000	70B Q4	★★★★★
M3 Ultra 192GB	192 GB	800 GB/s	~$8,000	70B Q8	★★★★★

Inference Speed Benchmarks — Hermes 4 70B Q4

GPU Configuration	tokens/sec	First Token	Practical Use
Single RTX 4090 (24GB)	Not viable	—	—
Dual RTX 3090 (48GB)	8–12 t/s	~3s	Personal dev
Dual RTX 4090 (48GB)	15–20 t/s	~2s	Small team
Quad RTX 4090 (96GB)	25–35 t/s	~1.5s	Medium deployment
Single A100 80GB	20–28 t/s	~2s	Enterprise
Single H100 80GB	35–50 t/s	~1s	High performance
M3 Ultra 192GB	25–40 t/s	~1.5s	Efficient personal

43.4 Multi-GPU Parallel Strategies

Tensor Parallelism vs Pipeline Parallelism

Tensor Parallelism (preferred with NVLink):
┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│  GPU 0   │  │  GPU 1   │  │  GPU 2   │  │  GPU 3   │
│ Heads 0-7│  │Heads 8-15│  │Heads16-23│  │Heads24-31│
└────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
     └─────────────┴─────────────┴──────────────┘
                    All-Reduce (each layer)
Pros: Low latency    Cons: Requires NVLink for efficiency

Pipeline Parallelism (PCIe-connected GPUs):
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  GPU 0       │ →  │  GPU 1       │ →  │  GPU 2       │
│ Layers 0–26  │    │ Layers 27–53 │    │ Layers 54–80 │
└──────────────┘    └──────────────┘    └──────────────┘
Pros: Works over PCIe    Cons: Higher latency (mitigated by micro-batching)

vLLM Multi-GPU Launch Commands

# Tensor parallelism (NVLink recommended)
python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-4-70B \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90

# Pipeline parallelism (PCIe multi-GPU)
python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-4-70B \
    --pipeline-parallel-size 2 \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --max-model-len 32768

llama.cpp Layer Splitting

# Split layers proportionally across GPUs
./llama-server \
    --model hermes-4-70b-q4_0.gguf \
    --n-gpu-layers 80 \
    --tensor-split "0.5,0.5" \      # 50% on GPU 0, 50% on GPU 1
    --ctx-size 65536 \
    --port 8080

43.5 CPU-Only Inference: Viability Assessment

Speed Comparison

CPU	RAM Capacity	tokens/sec (70B Q4)	Verdict
Intel i9-13900K	64 GB DDR5	1.5–3 t/s	Barely usable (patience needed)
AMD Ryzen 9 7950X	128 GB DDR5	2–4 t/s	OK for non-interactive batch work
Apple M2 Max (96GB)	96 GB	8–15 t/s	Recommended (hybrid CPU/GPU)
Apple M3 Ultra (192GB)	192 GB	25–40 t/s	Near consumer GPU performance
Dual EPYC 9654 (768GB)	768 GB	3–6 t/s	Can run FP16, poor value

When CPU Inference Is Viable

CPU inference is practical only for:

Batch processing (no user waiting): even 1–3 t/s works for overnight jobs
Apple Silicon with Metal acceleration: M3 Ultra reaches 25–40 t/s
Smaller models (7B Q4 on x86 reaches ~10 t/s)

For interactive use on x86, CPU-only 70B is not recommended.

Optimized CPU Launch Commands

# x86 with AVX-512
./llama-server \
    --model hermes-4-70b-q4_k_m.gguf \
    --threads 16 \              # match physical core count
    --ctx-size 8192 \           # reduce context for CPU
    --mlock \                   # lock pages in RAM
    --no-mmap                   # disable mmap on large-RAM systems

# Apple Silicon (Metal acceleration — all layers to GPU)
./llama-server \
    --model hermes-4-70b-q4_k_m.gguf \
    -ngl 99 \                   # load all layers to Metal GPU
    --threads 8 \
    --ctx-size 65536            # M3 Ultra handles large context

43.6 Hardware Selection Decision Guide

Budget	Recommended Config	Model	Use Case
<$700	CPU + 64GB DDR5	Hermes 7B Q4 (slow)	Learning, experiments
$700–2,000	RTX 3090 (24GB)	13B Q4, 7B FP16	Personal development
$2,000–5,000	RTX 4090 × 2 (48GB)	34B Q4, 13B FP16	Small team
$5,000–9,000	M3 Ultra 192GB	70B Q4 (25 t/s)	Efficient personal/team
$9,000–25,000	A100 80GB × 2	70B Q8 (high quality)	Enterprise small-scale
$25,000+	H100 80GB × 4+	70B FP16 (maximum)	Enterprise production

Chapter Summary

The master formula for VRAM sizing:

Total VRAM = params(B) × precision_bytes × 1.15
           + context_K × batch_size × 0.25
           + 1.5 GB (fixed overhead)

Key conclusions:

Hermes 4 70B sweet spot: Q4 quantization with 64K context requires ~50–55 GB VRAM
Best consumer option: Apple M3 Ultra (unified memory) or dual RTX 4090
CPU inference is only practical on Apple Silicon; x86 CPUs are suitable only for batch workloads
Multi-GPU: prefer tensor parallelism with NVLink; fall back to pipeline parallelism over PCIe

Review Questions

Why does KV Cache memory scale linearly with context length rather than with parameter count? What does this imply when choosing between "large model, short context" and "small model, long context"?
Apple M3 Ultra uses a unified memory architecture where CPU and GPU share the same memory pool. Compare this against discrete GDDR6X in terms of bandwidth, latency, and practical inference performance for a 70B model.
With a $4,000 budget and a requirement to serve 10 concurrent users on Hermes 70B, would you choose one high-memory machine or multiple smaller machines? Justify your architecture decision with capacity calculations.

Rate this chapter

4.6 / 5 (3 ratings)