vLLM High-Concurrency Inference Service
Chapter 45: vLLM High-Concurrency Inference Service
Introduction
Ollama answers "can it run?" โ vLLM answers "can it hold up?" When your Hermes Agent needs to serve 50, 500, or 5,000 concurrent requests, Ollama will force users into long queues. vLLM's PagedAttention mechanism can multiply throughput by 10โ30x. This chapter takes you from first principles to production configuration for high-concurrency Hermes inference.
45.1 Installation and GPU Environment Setup
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CUDA | 11.8 | 12.1+ |
| Python | 3.8 | 3.10โ3.11 |
| GPU Architecture | Ampere (30-series) | Hopper (H100) |
| Driver | 520.xx | 550.xx+ |
Installation
# Create a virtual environment
python3.11 -m venv venv-vllm
source venv-vllm/bin/activate
# Install vLLM (auto-detects CUDA version)
pip install vllm
# CUDA 12.1 optimized build
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
# Verify installation
python -c "import vllm; print(vllm.__version__)"
# Verify GPU availability
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
p = torch.cuda.get_device_properties(i)
print(f' GPU {i}: {p.name}, {p.total_memory // 1024**3} GB')
"
Docker (Recommended for Production)
docker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model NousResearch/Hermes-3-Llama-3.1-70B \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--max-model-len 65536
45.2 Launching Hermes 4 with vLLM
Basic Single-GPU Launch
# From HuggingFace Hub (auto-download)
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B \
--dtype bfloat16 \
--max-model-len 65536 \
--gpu-memory-utilization 0.90 \
--port 8000 \
--host 0.0.0.0
# From local model directory
python -m vllm.entrypoints.openai.api_server \
--model /models/hermes-4-70b \
--dtype bfloat16 \
--max-model-len 65536 \
--gpu-memory-utilization 0.92 \
--port 8000
Production Launch Script
#!/bin/bash
# start_vllm_hermes.sh
set -e
MODEL_PATH="${MODEL_PATH:-NousResearch/Hermes-3-Llama-3.1-70B}"
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
echo "Detected ${GPU_COUNT} GPUs"
python -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
\
# Precision
--dtype bfloat16 \
\
# Context and memory
--max-model-len 65536 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 256 \
--max-num-batched-tokens 65536 \
\
# Multi-GPU parallelism
--tensor-parallel-size "$GPU_COUNT" \
\
# Scheduling
--scheduler-delay-factor 0.1 \
--use-v2-block-manager \
--enable-prefix-caching \
\
# API
--host 0.0.0.0 \
--port 8000 \
--api-key "${VLLM_API_KEY:-}" \
\
# Logging
--log-level info \
2>&1 | tee /var/log/vllm-hermes.log
GGUF Loading (vLLM 0.5+)
python -m vllm.entrypoints.openai.api_server \
--model /models/hermes-4-70b-q4_k_m.gguf \
--tokenizer NousResearch/Hermes-3-Llama-3.1-70B \
--quantization gguf \
--dtype float16 \
--max-model-len 32768 \
--port 8000
45.3 PagedAttention: How It Works
The Problem with Traditional KV Cache
Traditional inference frameworks pre-allocate contiguous memory blocks for each sequence:
Traditional (pre-allocated):
Sequence A: [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ]
800 tokens used 1248 tokens wasted (60%)
Sequence B: [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ]
1400 tokens used 648 tokens wasted (32%)
Sequence C: WAITING โ even though fragmented free memory exists
Fragmentation wastes 30โ60% of VRAM through a combination of internal fragmentation (allocated but unused space) and external fragmentation (small free blocks unusable for new allocations).
PagedAttention: Virtual Paging for KV Cache
PagedAttention borrows the operating system concept of virtual memory paging:
Physical KV Blocks (fixed size, e.g. 16 tokens/block):
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ B0 โ B1 โ B2 โ B3 โ B4 โ B5 โ B6 โ B7 โ Physical memory
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
Logical view per sequence (non-contiguous physical blocks):
Seq A: [B0] โ [B3] โ [B7] (3 blocks)
Seq B: [B1] โ [B2] โ [B4] โ [B6] (4 blocks)
Seq C: [B5] (1 block, allocated on demand)
Core benefits:
- Near-zero internal fragmentation: blocks allocated in precise 16-token increments
- Zero external fragmentation: non-contiguous blocks appear contiguous through page table mapping
- Memory sharing: multiple requests sharing the same prompt prefix share their KV Cache blocks
Throughput Impact
def throughput_comparison():
gpu_vram_gb = 80 # A100 80GB
model_size_gb = 40 # Hermes 70B Q4
available_for_kv = gpu_vram_gb - model_size_gb # 40 GB
# Traditional: must pre-allocate max_len per sequence
traditional_seq_gb = 65536 / 1024 * 0.25 # ~16 GB per sequence
traditional_max = int(available_for_kv / traditional_seq_gb) # ~2
# PagedAttention: allocate only what's actually used
avg_actual_seq_len_k = 2.0 # typical average 2K tokens
paged_seq_gb = avg_actual_seq_len_k * 0.25 # ~0.5 GB per sequence
paged_max = int(available_for_kv / paged_seq_gb) # ~80
print(f"Traditional max concurrent sequences: {traditional_max}")
print(f"PagedAttention max concurrent sequences: {paged_max}")
print(f"Throughput multiplier: {paged_max / traditional_max:.0f}x")
throughput_comparison()
# Traditional: 2 sequences
# PagedAttention: ~80 sequences
# Throughput multiplier: 40x
Prefix Caching
vLLM's Prefix Caching allows multiple requests to share KV Cache for identical prompt prefixes:
# Multiple requests with the same system prompt:
SYSTEM = "You are Hermes, an expert assistant..." * 50 # long system prompt
# Without Prefix Caching: system prompt KV computed 3 times
# With Prefix Caching: computed once, shared across all requests
Enable it:
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B \
--enable-prefix-caching \
--max-model-len 65536
45.4 Throughput vs Latency Configuration
Configuration Profiles
| Profile | Throughput | P50 Latency | P99 Latency | Best For |
|---|---|---|---|---|
| Throughput-first | Very High | High | Very High | Batch APIs, async processing |
| Latency-first | Low | Very Low | Low | Interactive chat, IDE plugins |
| Balanced | High | Medium | Medium-High | General API service |
Throughput-First Configuration
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B \
--max-num-seqs 512 \
--max-num-batched-tokens 131072 \
--scheduler-delay-factor 0.5 \ # Wait to aggregate more requests
--enable-prefix-caching \
--gpu-memory-utilization 0.95
Latency-First Configuration
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--scheduler-delay-factor 0.0 \ # Process immediately, no waiting
--gpu-memory-utilization 0.85
45.5 Load Testing with wrk and ab
wrk Benchmark
# Install
apt-get install wrk # Ubuntu
brew install wrk # macOS
# Create POST request script
cat > vllm_wrk.lua << 'EOF'
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.headers["Authorization"] = "Bearer your-api-key"
wrk.body = [[{
"model": "NousResearch/Hermes-3-Llama-3.1-70B",
"prompt": "Explain the core principle of quantum computing in one sentence.",
"max_tokens": 100,
"temperature": 0.1
}]]
EOF
# Run: 30 seconds, 10 threads, 100 concurrent connections
wrk -t 10 -c 100 -d 30s -s vllm_wrk.lua http://localhost:8000/v1/completions
Python Async Benchmark
import asyncio, time, httpx, statistics
async def send_request(client, prompt):
start = time.time()
first_token = None
async with client.stream(
"POST", "http://localhost:8000/v1/completions",
json={"model": "NousResearch/Hermes-3-Llama-3.1-70B",
"prompt": prompt, "max_tokens": 200, "stream": True},
headers={"Authorization": "Bearer your-key"}
) as r:
async for line in r.aiter_lines():
if line.startswith("data: ") and line != "data: [DONE]":
if first_token is None:
first_token = time.time() - start
return first_token or (time.time() - start), time.time() - start
async def benchmark(concurrency=10, total=50):
async with httpx.AsyncClient(timeout=180) as client:
sem = asyncio.Semaphore(concurrency)
async def bounded():
async with sem:
return await send_request(client, "Write a Python sorting function.")
start = time.time()
results = await asyncio.gather(*[bounded() for _ in range(total)])
elapsed = time.time() - start
ttfts = [r[0] for r in results]
totals = [r[1] for r in results]
print(f"Concurrency={concurrency}: "
f"Throughput={total/elapsed:.1f} RPS, "
f"TTFT p50={statistics.median(ttfts)*1000:.0f}ms, "
f"TTFT p99={sorted(ttfts)[int(len(ttfts)*0.99)]*1000:.0f}ms")
asyncio.run(benchmark())
GPU Monitoring
# Real-time monitoring
watch -n 1 nvidia-smi
# Detailed metrics (power, utilization, memory, clock, temp, PCIe)
nvidia-smi dmon -s pumcet -d 1
# Python monitor
python3 -c "
import subprocess, time
while True:
r = subprocess.run(['nvidia-smi',
'--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu',
'--format=csv,noheader,nounits'], capture_output=True, text=True)
for i, line in enumerate(r.stdout.strip().split('\n')):
u, mu, mt, t = line.split(', ')
print(f'GPU {i}: util={u}%, mem={mu}/{mt}MB, temp={t}C')
print(); time.sleep(2)
"
45.6 Integrating with Hermes Agent
OpenAI-Compatible Client
vLLM's OpenAI-compatible API lets you swap backends with zero code changes:
from openai import AsyncOpenAI
import asyncio
# Point the OpenAI SDK at vLLM
client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="your-api-key"
)
async def chat(messages: list[dict]) -> str:
response = await client.chat.completions.create(
model="NousResearch/Hermes-3-Llama-3.1-70B",
messages=messages,
max_tokens=2048,
temperature=0.1,
)
return response.choices[0].message.content
async def streaming_chat(messages: list[dict]) -> str:
full = ""
stream = await client.chat.completions.create(
model="NousResearch/Hermes-3-Llama-3.1-70B",
messages=messages,
max_tokens=2048,
temperature=0.1,
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
print(token, end="", flush=True)
full += token
print()
return full
Multi-Instance Load Balancing (Nginx)
upstream vllm_hermes {
least_conn; # Best for long-running LLM requests
server 127.0.0.1:8000;
server 127.0.0.1:8001;
server 127.0.0.1:8002;
keepalive 32;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_hermes;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off; # Required for streaming
proxy_cache off;
proxy_read_timeout 300s;
}
}
Chapter Summary
vLLM is the production inference engine of choice for Hermes deployments:
| Feature | vLLM Advantage |
|---|---|
| Concurrent throughput | PagedAttention delivers near-zero fragmentation; 10โ30x vs traditional |
| API compatibility | OpenAI-compatible; zero migration cost |
| Built-in features | Prefix caching, quantization, tensor parallelism out of the box |
| Observability | Built-in Prometheus metrics |
Three critical knobs: --max-num-seqs (concurrency ceiling), --scheduler-delay-factor (batch aggregation wait), --gpu-memory-utilization (VRAM aggressiveness).
Review Questions
-
PagedAttention uses a fixed block size (default: 16 tokens per block). What are the trade-offs of choosing a smaller versus larger block size? What workload characteristics should guide this choice?
-
In which Hermes Agent scenario would Prefix Caching deliver the greatest benefit? Describe a specific business use case and estimate a realistic cache hit rate.
-
Under the "throughput-first" configuration, what happens to service behavior when GPU utilization reaches 95%? Design a circuit-breaker mechanism that degrades gracefully under overload rather than failing catastrophically.