Hardware Selection: GPU Memory Requirements Calculator
Chapter 43: Hardware Selection — Calculating GPU Memory Requirements
Introduction
"How much VRAM do I need?" is the first question anyone asks before deploying Hermes locally. Wrong hardware selection either wastes money (buying far more than needed) or leads to out-of-memory failures discovered only at deployment time. This chapter provides rigorous calculation formulas, comprehensive model comparison tables, and recommended configurations from consumer to enterprise grade.
43.1 Memory Requirement Calculation Formula
The Master Formula
GPU VRAM required for inference is composed of three parts:
Total VRAM = Model Weights Memory + KV Cache Memory + Runtime Overhead
Model Weights Memory is the largest component:
Model Weights (GB) = Parameters (B) × Bytes per Parameter / 1024³ × Overhead Factor
Where:
Parameters (B) = number of billion parameters
Bytes per parameter:
FP32 = 4.0 bytes
FP16 / BF16 = 2.0 bytes
INT8 / Q8 = 1.0 byte
Q5 ≈ 0.625 bytes
Q4 ≈ 0.5 bytes
Overhead Factor ≈ 1.15 (framework, activations, buffers)
KV Cache Memory (critical — determines batching capacity):
KV Cache (GB) = 2 × layers × heads × head_dim × seq_len × batch × bytes / 1024³
Simplified formula (calibrated for Hermes 4 series):
KV Cache (GB) ≈ context_length_K × batch_size × precision_factor
Where precision_factor:
FP16: ~0.25 GB per 1K tokens per batch item
INT8: ~0.125 GB per 1K tokens per batch item
Python Calculator
def calculate_vram_requirement(
params_b: float, # parameters in billions
precision_bits: int, # quantization bits: 4/5/8/16/32
context_length_k: int, # context window in K tokens
batch_size: int = 1,
overhead_ratio: float = 1.15
) -> dict:
precision_bytes = {4: 0.5, 5: 0.625, 8: 1.0, 16: 2.0, 32: 4.0}
bytes_per_param = precision_bytes[precision_bits]
model_weights_gb = (params_b * 1e9 * bytes_per_param) / (1024 ** 3) * overhead_ratio
kv_precision_factor = precision_bytes.get(min(precision_bits, 16), 2.0)
kv_cache_gb = context_length_k * batch_size * kv_precision_factor * 0.25
runtime_overhead_gb = 1.5 # fixed CUDA/framework overhead
total_gb = model_weights_gb + kv_cache_gb + runtime_overhead_gb
return {
"model_weights_gb": round(model_weights_gb, 1),
"kv_cache_gb": round(kv_cache_gb, 1),
"runtime_overhead_gb": runtime_overhead_gb,
"total_gb": round(total_gb, 1),
"recommended_vram_gb": round(total_gb * 1.1) # 10% safety margin
}
# Example calculations
configs = [
("Hermes 4 70B FP16", 70, 16, 32),
("Hermes 4 70B Q8", 70, 8, 64),
("Hermes 4 70B Q5", 70, 5, 64),
("Hermes 4 70B Q4", 70, 4, 64),
("Hermes 4 13B FP16", 13, 16, 64),
("Hermes 4 13B Q4", 13, 4, 64),
("Hermes 4 7B FP16", 7, 16, 64),
("Hermes 4 7B Q4", 7, 4, 64),
]
print(f"{'Config':<28} {'Weights':>8} {'KVCache':>8} {'Total':>7} {'Recommended':>12}")
print("-" * 68)
for name, p, q, ctx in configs:
r = calculate_vram_requirement(p, q, ctx)
print(f"{name:<28} {r['model_weights_gb']:>7.1f}G {r['kv_cache_gb']:>7.1f}G "
f"{r['total_gb']:>6.1f}G {r['recommended_vram_gb']:>11.0f}G")
Output:
Config Weights KVCache Total Recommended
--------------------------------------------------------------------
Hermes 4 70B FP16 130.6G 16.0G 148.1G 163G
Hermes 4 70B Q8 65.3G 16.0G 82.8G 91G
Hermes 4 70B Q5 40.8G 16.0G 58.3G 65G
Hermes 4 70B Q4 32.7G 16.0G 50.2G 56G
Hermes 4 13B FP16 24.3G 32.0G 57.8G 64G
Hermes 4 13B Q4 6.1G 32.0G 39.6G 44G
Hermes 4 7B FP16 13.1G 32.0G 46.6G 52G
Hermes 4 7B Q4 3.3G 32.0G 36.8G 41G
43.2 VRAM Requirements Reference Table
Hermes Series — VRAM at 64K Context, Single-User Inference
| Model | Params | FP16 VRAM | Q8 VRAM | Q5 VRAM | Q4 VRAM | Minimum Viable |
|---|---|---|---|---|---|---|
| Hermes 4 7B | 7B | ~16 GB | ~10 GB | ~7 GB | ~6 GB | RTX 3060 12GB (Q4) |
| Hermes 4 13B | 13B | ~28 GB | ~16 GB | ~11 GB | ~9 GB | RTX 3090 24GB (Q4) |
| Hermes 4 34B | 34B | ~68 GB | ~36 GB | ~24 GB | ~20 GB | A100 40GB (Q4) |
| Hermes 4 70B | 70B | ~140 GB | ~82 GB | ~58 GB | ~50 GB | 2× A100 80GB (Q4) |
Context Length Impact on VRAM (Hermes 4 70B Q4)
| Context | batch=1 | batch=4 | batch=8 | Typical Use Case |
|---|---|---|---|---|
| 8K tokens | ~36 GB | ~40 GB | ~48 GB | Simple chat |
| 32K tokens | ~44 GB | ~60 GB | ~88 GB | Code analysis |
| 64K tokens | ~50 GB | ~80 GB | ~140 GB | Long document |
| 128K tokens | ~66 GB | ~136 GB | exceeds single GPU | Extended tasks |
43.3 Consumer GPU Recommendations
GPU Comparison Table
| GPU | VRAM | Memory BW | Price (2026) | Best Hermes Model | Rating |
|---|---|---|---|---|---|
| RTX 3060 | 12 GB | 360 GB/s | ~$250 | 7B Q4 (tight) | ★★☆☆☆ |
| RTX 3090 | 24 GB | 936 GB/s | ~$750 | 13B Q4, 7B Q8 | ★★★★☆ |
| RTX 4090 | 24 GB | 1008 GB/s | ~$1,800 | 13B Q4, fastest | ★★★★★ |
| RTX 4080 Super | 16 GB | 736 GB/s | ~$1,000 | 7B FP16, 13B Q4 | ★★★☆☆ |
| RX 7900 XTX | 24 GB | 960 GB/s | ~$900 | 13B Q4 (ROCm) | ★★★☆☆ |
| M3 Ultra 128GB | 128 GB | 800 GB/s | ~$6,000 | 70B Q4 | ★★★★★ |
| M3 Ultra 192GB | 192 GB | 800 GB/s | ~$8,000 | 70B Q8 | ★★★★★ |
Inference Speed Benchmarks — Hermes 4 70B Q4
| GPU Configuration | tokens/sec | First Token | Practical Use |
|---|---|---|---|
| Single RTX 4090 (24GB) | Not viable | — | — |
| Dual RTX 3090 (48GB) | 8–12 t/s | ~3s | Personal dev |
| Dual RTX 4090 (48GB) | 15–20 t/s | ~2s | Small team |
| Quad RTX 4090 (96GB) | 25–35 t/s | ~1.5s | Medium deployment |
| Single A100 80GB | 20–28 t/s | ~2s | Enterprise |
| Single H100 80GB | 35–50 t/s | ~1s | High performance |
| M3 Ultra 192GB | 25–40 t/s | ~1.5s | Efficient personal |
43.4 Multi-GPU Parallel Strategies
Tensor Parallelism vs Pipeline Parallelism
Tensor Parallelism (preferred with NVLink):
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │
│ Heads 0-7│ │Heads 8-15│ │Heads16-23│ │Heads24-31│
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
└─────────────┴─────────────┴──────────────┘
All-Reduce (each layer)
Pros: Low latency Cons: Requires NVLink for efficiency
Pipeline Parallelism (PCIe-connected GPUs):
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ GPU 0 │ → │ GPU 1 │ → │ GPU 2 │
│ Layers 0–26 │ │ Layers 27–53 │ │ Layers 54–80 │
└──────────────┘ └──────────────┘ └──────────────┘
Pros: Works over PCIe Cons: Higher latency (mitigated by micro-batching)
vLLM Multi-GPU Launch Commands
# Tensor parallelism (NVLink recommended)
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-4-70B \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--max-model-len 65536 \
--gpu-memory-utilization 0.90
# Pipeline parallelism (PCIe multi-GPU)
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-4-70B \
--pipeline-parallel-size 2 \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--max-model-len 32768
llama.cpp Layer Splitting
# Split layers proportionally across GPUs
./llama-server \
--model hermes-4-70b-q4_0.gguf \
--n-gpu-layers 80 \
--tensor-split "0.5,0.5" \ # 50% on GPU 0, 50% on GPU 1
--ctx-size 65536 \
--port 8080
43.5 CPU-Only Inference: Viability Assessment
Speed Comparison
| CPU | RAM Capacity | tokens/sec (70B Q4) | Verdict |
|---|---|---|---|
| Intel i9-13900K | 64 GB DDR5 | 1.5–3 t/s | Barely usable (patience needed) |
| AMD Ryzen 9 7950X | 128 GB DDR5 | 2–4 t/s | OK for non-interactive batch work |
| Apple M2 Max (96GB) | 96 GB | 8–15 t/s | Recommended (hybrid CPU/GPU) |
| Apple M3 Ultra (192GB) | 192 GB | 25–40 t/s | Near consumer GPU performance |
| Dual EPYC 9654 (768GB) | 768 GB | 3–6 t/s | Can run FP16, poor value |
When CPU Inference Is Viable
CPU inference is practical only for:
- Batch processing (no user waiting): even 1–3 t/s works for overnight jobs
- Apple Silicon with Metal acceleration: M3 Ultra reaches 25–40 t/s
- Smaller models (7B Q4 on x86 reaches ~10 t/s)
For interactive use on x86, CPU-only 70B is not recommended.
Optimized CPU Launch Commands
# x86 with AVX-512
./llama-server \
--model hermes-4-70b-q4_k_m.gguf \
--threads 16 \ # match physical core count
--ctx-size 8192 \ # reduce context for CPU
--mlock \ # lock pages in RAM
--no-mmap # disable mmap on large-RAM systems
# Apple Silicon (Metal acceleration — all layers to GPU)
./llama-server \
--model hermes-4-70b-q4_k_m.gguf \
-ngl 99 \ # load all layers to Metal GPU
--threads 8 \
--ctx-size 65536 # M3 Ultra handles large context
43.6 Hardware Selection Decision Guide
| Budget | Recommended Config | Model | Use Case |
|---|---|---|---|
| <$700 | CPU + 64GB DDR5 | Hermes 7B Q4 (slow) | Learning, experiments |
| $700–2,000 | RTX 3090 (24GB) | 13B Q4, 7B FP16 | Personal development |
| $2,000–5,000 | RTX 4090 × 2 (48GB) | 34B Q4, 13B FP16 | Small team |
| $5,000–9,000 | M3 Ultra 192GB | 70B Q4 (25 t/s) | Efficient personal/team |
| $9,000–25,000 | A100 80GB × 2 | 70B Q8 (high quality) | Enterprise small-scale |
| $25,000+ | H100 80GB × 4+ | 70B FP16 (maximum) | Enterprise production |
Chapter Summary
The master formula for VRAM sizing:
Total VRAM = params(B) × precision_bytes × 1.15
+ context_K × batch_size × 0.25
+ 1.5 GB (fixed overhead)
Key conclusions:
- Hermes 4 70B sweet spot: Q4 quantization with 64K context requires ~50–55 GB VRAM
- Best consumer option: Apple M3 Ultra (unified memory) or dual RTX 4090
- CPU inference is only practical on Apple Silicon; x86 CPUs are suitable only for batch workloads
- Multi-GPU: prefer tensor parallelism with NVLink; fall back to pipeline parallelism over PCIe
Review Questions
-
Why does KV Cache memory scale linearly with context length rather than with parameter count? What does this imply when choosing between "large model, short context" and "small model, long context"?
-
Apple M3 Ultra uses a unified memory architecture where CPU and GPU share the same memory pool. Compare this against discrete GDDR6X in terms of bandwidth, latency, and practical inference performance for a 70B model.
-
With a $4,000 budget and a requirement to serve 10 concurrent users on Hermes 70B, would you choose one high-memory machine or multiple smaller machines? Justify your architecture decision with capacity calculations.