Chapter 43

Hardware Selection: GPU Memory Requirements Calculator

Chapter 43: Hardware Selection — Calculating GPU Memory Requirements

Introduction

"How much VRAM do I need?" is the first question anyone asks before deploying Hermes locally. Wrong hardware selection either wastes money (buying far more than needed) or leads to out-of-memory failures discovered only at deployment time. This chapter provides rigorous calculation formulas, comprehensive model comparison tables, and recommended configurations from consumer to enterprise grade.


43.1 Memory Requirement Calculation Formula

The Master Formula

GPU VRAM required for inference is composed of three parts:

Total VRAM = Model Weights Memory + KV Cache Memory + Runtime Overhead

Model Weights Memory is the largest component:

Model Weights (GB) = Parameters (B) × Bytes per Parameter / 1024³ × Overhead Factor

Where:
  Parameters (B) = number of billion parameters
  Bytes per parameter:
    FP32   = 4.0 bytes
    FP16 / BF16 = 2.0 bytes
    INT8 / Q8   = 1.0 byte
    Q5          ≈ 0.625 bytes
    Q4          ≈ 0.5 bytes
  Overhead Factor ≈ 1.15 (framework, activations, buffers)

KV Cache Memory (critical — determines batching capacity):

KV Cache (GB) = 2 × layers × heads × head_dim × seq_len × batch × bytes / 1024³

Simplified formula (calibrated for Hermes 4 series):
KV Cache (GB) ≈ context_length_K × batch_size × precision_factor

Where precision_factor:
  FP16: ~0.25 GB per 1K tokens per batch item
  INT8: ~0.125 GB per 1K tokens per batch item

Python Calculator

def calculate_vram_requirement(
    params_b: float,          # parameters in billions
    precision_bits: int,      # quantization bits: 4/5/8/16/32
    context_length_k: int,    # context window in K tokens
    batch_size: int = 1,
    overhead_ratio: float = 1.15
) -> dict:
    precision_bytes = {4: 0.5, 5: 0.625, 8: 1.0, 16: 2.0, 32: 4.0}
    bytes_per_param = precision_bytes[precision_bits]

    model_weights_gb = (params_b * 1e9 * bytes_per_param) / (1024 ** 3) * overhead_ratio

    kv_precision_factor = precision_bytes.get(min(precision_bits, 16), 2.0)
    kv_cache_gb = context_length_k * batch_size * kv_precision_factor * 0.25

    runtime_overhead_gb = 1.5  # fixed CUDA/framework overhead

    total_gb = model_weights_gb + kv_cache_gb + runtime_overhead_gb

    return {
        "model_weights_gb": round(model_weights_gb, 1),
        "kv_cache_gb": round(kv_cache_gb, 1),
        "runtime_overhead_gb": runtime_overhead_gb,
        "total_gb": round(total_gb, 1),
        "recommended_vram_gb": round(total_gb * 1.1)  # 10% safety margin
    }

# Example calculations
configs = [
    ("Hermes 4 70B  FP16", 70, 16, 32),
    ("Hermes 4 70B  Q8",   70,  8, 64),
    ("Hermes 4 70B  Q5",   70,  5, 64),
    ("Hermes 4 70B  Q4",   70,  4, 64),
    ("Hermes 4 13B  FP16", 13, 16, 64),
    ("Hermes 4 13B  Q4",   13,  4, 64),
    ("Hermes 4  7B  FP16",  7, 16, 64),
    ("Hermes 4  7B  Q4",    7,  4, 64),
]

print(f"{'Config':<28} {'Weights':>8} {'KVCache':>8} {'Total':>7} {'Recommended':>12}")
print("-" * 68)
for name, p, q, ctx in configs:
    r = calculate_vram_requirement(p, q, ctx)
    print(f"{name:<28} {r['model_weights_gb']:>7.1f}G {r['kv_cache_gb']:>7.1f}G "
          f"{r['total_gb']:>6.1f}G {r['recommended_vram_gb']:>11.0f}G")

Output:

Config                       Weights  KVCache   Total  Recommended
--------------------------------------------------------------------
Hermes 4 70B  FP16           130.6G    16.0G  148.1G         163G
Hermes 4 70B  Q8              65.3G    16.0G   82.8G          91G
Hermes 4 70B  Q5              40.8G    16.0G   58.3G          65G
Hermes 4 70B  Q4              32.7G    16.0G   50.2G          56G
Hermes 4 13B  FP16            24.3G    32.0G   57.8G          64G
Hermes 4 13B  Q4               6.1G    32.0G   39.6G          44G
Hermes 4  7B  FP16            13.1G    32.0G   46.6G          52G
Hermes 4  7B  Q4               3.3G    32.0G   36.8G          41G

43.2 VRAM Requirements Reference Table

Hermes Series — VRAM at 64K Context, Single-User Inference

Model Params FP16 VRAM Q8 VRAM Q5 VRAM Q4 VRAM Minimum Viable
Hermes 4 7B 7B ~16 GB ~10 GB ~7 GB ~6 GB RTX 3060 12GB (Q4)
Hermes 4 13B 13B ~28 GB ~16 GB ~11 GB ~9 GB RTX 3090 24GB (Q4)
Hermes 4 34B 34B ~68 GB ~36 GB ~24 GB ~20 GB A100 40GB (Q4)
Hermes 4 70B 70B ~140 GB ~82 GB ~58 GB ~50 GB 2× A100 80GB (Q4)

Context Length Impact on VRAM (Hermes 4 70B Q4)

Context batch=1 batch=4 batch=8 Typical Use Case
8K tokens ~36 GB ~40 GB ~48 GB Simple chat
32K tokens ~44 GB ~60 GB ~88 GB Code analysis
64K tokens ~50 GB ~80 GB ~140 GB Long document
128K tokens ~66 GB ~136 GB exceeds single GPU Extended tasks

43.3 Consumer GPU Recommendations

GPU Comparison Table

GPU VRAM Memory BW Price (2026) Best Hermes Model Rating
RTX 3060 12 GB 360 GB/s ~$250 7B Q4 (tight) ★★☆☆☆
RTX 3090 24 GB 936 GB/s ~$750 13B Q4, 7B Q8 ★★★★☆
RTX 4090 24 GB 1008 GB/s ~$1,800 13B Q4, fastest ★★★★★
RTX 4080 Super 16 GB 736 GB/s ~$1,000 7B FP16, 13B Q4 ★★★☆☆
RX 7900 XTX 24 GB 960 GB/s ~$900 13B Q4 (ROCm) ★★★☆☆
M3 Ultra 128GB 128 GB 800 GB/s ~$6,000 70B Q4 ★★★★★
M3 Ultra 192GB 192 GB 800 GB/s ~$8,000 70B Q8 ★★★★★

Inference Speed Benchmarks — Hermes 4 70B Q4

GPU Configuration tokens/sec First Token Practical Use
Single RTX 4090 (24GB) Not viable
Dual RTX 3090 (48GB) 8–12 t/s ~3s Personal dev
Dual RTX 4090 (48GB) 15–20 t/s ~2s Small team
Quad RTX 4090 (96GB) 25–35 t/s ~1.5s Medium deployment
Single A100 80GB 20–28 t/s ~2s Enterprise
Single H100 80GB 35–50 t/s ~1s High performance
M3 Ultra 192GB 25–40 t/s ~1.5s Efficient personal

43.4 Multi-GPU Parallel Strategies

Tensor Parallelism vs Pipeline Parallelism

Tensor Parallelism (preferred with NVLink):
┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│  GPU 0   │  │  GPU 1   │  │  GPU 2   │  │  GPU 3   │
│ Heads 0-7│  │Heads 8-15│  │Heads16-23│  │Heads24-31│
└────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
     └─────────────┴─────────────┴──────────────┘
                    All-Reduce (each layer)
Pros: Low latency    Cons: Requires NVLink for efficiency

Pipeline Parallelism (PCIe-connected GPUs):
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  GPU 0       │ →  │  GPU 1       │ →  │  GPU 2       │
│ Layers 0–26  │    │ Layers 27–53 │    │ Layers 54–80 │
└──────────────┘    └──────────────┘    └──────────────┘
Pros: Works over PCIe    Cons: Higher latency (mitigated by micro-batching)

vLLM Multi-GPU Launch Commands

# Tensor parallelism (NVLink recommended)
python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-4-70B \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90

# Pipeline parallelism (PCIe multi-GPU)
python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-4-70B \
    --pipeline-parallel-size 2 \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --max-model-len 32768

llama.cpp Layer Splitting

# Split layers proportionally across GPUs
./llama-server \
    --model hermes-4-70b-q4_0.gguf \
    --n-gpu-layers 80 \
    --tensor-split "0.5,0.5" \      # 50% on GPU 0, 50% on GPU 1
    --ctx-size 65536 \
    --port 8080

43.5 CPU-Only Inference: Viability Assessment

Speed Comparison

CPU RAM Capacity tokens/sec (70B Q4) Verdict
Intel i9-13900K 64 GB DDR5 1.5–3 t/s Barely usable (patience needed)
AMD Ryzen 9 7950X 128 GB DDR5 2–4 t/s OK for non-interactive batch work
Apple M2 Max (96GB) 96 GB 8–15 t/s Recommended (hybrid CPU/GPU)
Apple M3 Ultra (192GB) 192 GB 25–40 t/s Near consumer GPU performance
Dual EPYC 9654 (768GB) 768 GB 3–6 t/s Can run FP16, poor value

When CPU Inference Is Viable

CPU inference is practical only for:

For interactive use on x86, CPU-only 70B is not recommended.

Optimized CPU Launch Commands

# x86 with AVX-512
./llama-server \
    --model hermes-4-70b-q4_k_m.gguf \
    --threads 16 \              # match physical core count
    --ctx-size 8192 \           # reduce context for CPU
    --mlock \                   # lock pages in RAM
    --no-mmap                   # disable mmap on large-RAM systems

# Apple Silicon (Metal acceleration — all layers to GPU)
./llama-server \
    --model hermes-4-70b-q4_k_m.gguf \
    -ngl 99 \                   # load all layers to Metal GPU
    --threads 8 \
    --ctx-size 65536            # M3 Ultra handles large context

43.6 Hardware Selection Decision Guide

Budget Recommended Config Model Use Case
<$700 CPU + 64GB DDR5 Hermes 7B Q4 (slow) Learning, experiments
$700–2,000 RTX 3090 (24GB) 13B Q4, 7B FP16 Personal development
$2,000–5,000 RTX 4090 × 2 (48GB) 34B Q4, 13B FP16 Small team
$5,000–9,000 M3 Ultra 192GB 70B Q4 (25 t/s) Efficient personal/team
$9,000–25,000 A100 80GB × 2 70B Q8 (high quality) Enterprise small-scale
$25,000+ H100 80GB × 4+ 70B FP16 (maximum) Enterprise production

Chapter Summary

The master formula for VRAM sizing:

Total VRAM = params(B) × precision_bytes × 1.15
           + context_K × batch_size × 0.25
           + 1.5 GB (fixed overhead)

Key conclusions:

  1. Hermes 4 70B sweet spot: Q4 quantization with 64K context requires ~50–55 GB VRAM
  2. Best consumer option: Apple M3 Ultra (unified memory) or dual RTX 4090
  3. CPU inference is only practical on Apple Silicon; x86 CPUs are suitable only for batch workloads
  4. Multi-GPU: prefer tensor parallelism with NVLink; fall back to pipeline parallelism over PCIe

Review Questions

  1. Why does KV Cache memory scale linearly with context length rather than with parameter count? What does this imply when choosing between "large model, short context" and "small model, long context"?

  2. Apple M3 Ultra uses a unified memory architecture where CPU and GPU share the same memory pool. Compare this against discrete GDDR6X in terms of bandwidth, latency, and practical inference performance for a 70B model.

  3. With a $4,000 budget and a requirement to serve 10 concurrent users on Hermes 70B, would you choose one high-memory machine or multiple smaller machines? Justify your architecture decision with capacity calculations.

Rate this chapter
4.6  / 5  (3 ratings)

💬 Comments