Chapter 45

vLLM High-Concurrency Inference Service

Chapter 45: vLLM High-Concurrency Inference Service

Introduction

Ollama answers "can it run?" โ€” vLLM answers "can it hold up?" When your Hermes Agent needs to serve 50, 500, or 5,000 concurrent requests, Ollama will force users into long queues. vLLM's PagedAttention mechanism can multiply throughput by 10โ€“30x. This chapter takes you from first principles to production configuration for high-concurrency Hermes inference.


45.1 Installation and GPU Environment Setup

System Requirements

Component Minimum Recommended
CUDA 11.8 12.1+
Python 3.8 3.10โ€“3.11
GPU Architecture Ampere (30-series) Hopper (H100)
Driver 520.xx 550.xx+

Installation

# Create a virtual environment
python3.11 -m venv venv-vllm
source venv-vllm/bin/activate

# Install vLLM (auto-detects CUDA version)
pip install vllm

# CUDA 12.1 optimized build
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# Verify installation
python -c "import vllm; print(vllm.__version__)"

# Verify GPU availability
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    p = torch.cuda.get_device_properties(i)
    print(f'  GPU {i}: {p.name}, {p.total_memory // 1024**3} GB')
"
docker pull vllm/vllm-openai:latest

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model NousResearch/Hermes-3-Llama-3.1-70B \
    --dtype bfloat16 \
    --tensor-parallel-size 2 \
    --max-model-len 65536

45.2 Launching Hermes 4 with vLLM

Basic Single-GPU Launch

# From HuggingFace Hub (auto-download)
python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-3-Llama-3.1-70B \
    --dtype bfloat16 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90 \
    --port 8000 \
    --host 0.0.0.0

# From local model directory
python -m vllm.entrypoints.openai.api_server \
    --model /models/hermes-4-70b \
    --dtype bfloat16 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.92 \
    --port 8000

Production Launch Script

#!/bin/bash
# start_vllm_hermes.sh

set -e

MODEL_PATH="${MODEL_PATH:-NousResearch/Hermes-3-Llama-3.1-70B}"
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
echo "Detected ${GPU_COUNT} GPUs"

python -m vllm.entrypoints.openai.api_server \
    --model "$MODEL_PATH" \
    \
    # Precision
    --dtype bfloat16 \
    \
    # Context and memory
    --max-model-len 65536 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256 \
    --max-num-batched-tokens 65536 \
    \
    # Multi-GPU parallelism
    --tensor-parallel-size "$GPU_COUNT" \
    \
    # Scheduling
    --scheduler-delay-factor 0.1 \
    --use-v2-block-manager \
    --enable-prefix-caching \
    \
    # API
    --host 0.0.0.0 \
    --port 8000 \
    --api-key "${VLLM_API_KEY:-}" \
    \
    # Logging
    --log-level info \
    2>&1 | tee /var/log/vllm-hermes.log

GGUF Loading (vLLM 0.5+)

python -m vllm.entrypoints.openai.api_server \
    --model /models/hermes-4-70b-q4_k_m.gguf \
    --tokenizer NousResearch/Hermes-3-Llama-3.1-70B \
    --quantization gguf \
    --dtype float16 \
    --max-model-len 32768 \
    --port 8000

45.3 PagedAttention: How It Works

The Problem with Traditional KV Cache

Traditional inference frameworks pre-allocate contiguous memory blocks for each sequence:

Traditional (pre-allocated):

Sequence A: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘]
             800 tokens used      1248 tokens wasted (60%)

Sequence B: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘]
             1400 tokens used     648 tokens wasted (32%)

Sequence C: WAITING โ€” even though fragmented free memory exists

Fragmentation wastes 30โ€“60% of VRAM through a combination of internal fragmentation (allocated but unused space) and external fragmentation (small free blocks unusable for new allocations).

PagedAttention: Virtual Paging for KV Cache

PagedAttention borrows the operating system concept of virtual memory paging:

Physical KV Blocks (fixed size, e.g. 16 tokens/block):
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ B0 โ”‚ B1 โ”‚ B2 โ”‚ B3 โ”‚ B4 โ”‚ B5 โ”‚ B6 โ”‚ B7 โ”‚  Physical memory
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜

Logical view per sequence (non-contiguous physical blocks):
Seq A: [B0] โ†’ [B3] โ†’ [B7]              (3 blocks)
Seq B: [B1] โ†’ [B2] โ†’ [B4] โ†’ [B6]      (4 blocks)
Seq C: [B5]                             (1 block, allocated on demand)

Core benefits:

  1. Near-zero internal fragmentation: blocks allocated in precise 16-token increments
  2. Zero external fragmentation: non-contiguous blocks appear contiguous through page table mapping
  3. Memory sharing: multiple requests sharing the same prompt prefix share their KV Cache blocks

Throughput Impact

def throughput_comparison():
    gpu_vram_gb = 80       # A100 80GB
    model_size_gb = 40     # Hermes 70B Q4

    available_for_kv = gpu_vram_gb - model_size_gb  # 40 GB

    # Traditional: must pre-allocate max_len per sequence
    traditional_seq_gb = 65536 / 1024 * 0.25  # ~16 GB per sequence
    traditional_max = int(available_for_kv / traditional_seq_gb)  # ~2

    # PagedAttention: allocate only what's actually used
    avg_actual_seq_len_k = 2.0  # typical average 2K tokens
    paged_seq_gb = avg_actual_seq_len_k * 0.25  # ~0.5 GB per sequence
    paged_max = int(available_for_kv / paged_seq_gb)  # ~80

    print(f"Traditional max concurrent sequences: {traditional_max}")
    print(f"PagedAttention max concurrent sequences: {paged_max}")
    print(f"Throughput multiplier: {paged_max / traditional_max:.0f}x")

throughput_comparison()
# Traditional: 2 sequences
# PagedAttention: ~80 sequences
# Throughput multiplier: 40x

Prefix Caching

vLLM's Prefix Caching allows multiple requests to share KV Cache for identical prompt prefixes:

# Multiple requests with the same system prompt:
SYSTEM = "You are Hermes, an expert assistant..." * 50  # long system prompt

# Without Prefix Caching: system prompt KV computed 3 times
# With Prefix Caching: computed once, shared across all requests

Enable it:

python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-3-Llama-3.1-70B \
    --enable-prefix-caching \
    --max-model-len 65536

45.4 Throughput vs Latency Configuration

Configuration Profiles

Profile Throughput P50 Latency P99 Latency Best For
Throughput-first Very High High Very High Batch APIs, async processing
Latency-first Low Very Low Low Interactive chat, IDE plugins
Balanced High Medium Medium-High General API service

Throughput-First Configuration

python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-3-Llama-3.1-70B \
    --max-num-seqs 512 \
    --max-num-batched-tokens 131072 \
    --scheduler-delay-factor 0.5 \      # Wait to aggregate more requests
    --enable-prefix-caching \
    --gpu-memory-utilization 0.95

Latency-First Configuration

python -m vllm.entrypoints.openai.api_server \
    --model NousResearch/Hermes-3-Llama-3.1-70B \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --scheduler-delay-factor 0.0 \      # Process immediately, no waiting
    --gpu-memory-utilization 0.85

45.5 Load Testing with wrk and ab

wrk Benchmark

# Install
apt-get install wrk    # Ubuntu
brew install wrk       # macOS

# Create POST request script
cat > vllm_wrk.lua << 'EOF'
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.headers["Authorization"] = "Bearer your-api-key"
wrk.body = [[{
  "model": "NousResearch/Hermes-3-Llama-3.1-70B",
  "prompt": "Explain the core principle of quantum computing in one sentence.",
  "max_tokens": 100,
  "temperature": 0.1
}]]
EOF

# Run: 30 seconds, 10 threads, 100 concurrent connections
wrk -t 10 -c 100 -d 30s -s vllm_wrk.lua http://localhost:8000/v1/completions

Python Async Benchmark

import asyncio, time, httpx, statistics

async def send_request(client, prompt):
    start = time.time()
    first_token = None
    async with client.stream(
        "POST", "http://localhost:8000/v1/completions",
        json={"model": "NousResearch/Hermes-3-Llama-3.1-70B",
              "prompt": prompt, "max_tokens": 200, "stream": True},
        headers={"Authorization": "Bearer your-key"}
    ) as r:
        async for line in r.aiter_lines():
            if line.startswith("data: ") and line != "data: [DONE]":
                if first_token is None:
                    first_token = time.time() - start
    return first_token or (time.time() - start), time.time() - start

async def benchmark(concurrency=10, total=50):
    async with httpx.AsyncClient(timeout=180) as client:
        sem = asyncio.Semaphore(concurrency)
        async def bounded():
            async with sem:
                return await send_request(client, "Write a Python sorting function.")
        
        start = time.time()
        results = await asyncio.gather(*[bounded() for _ in range(total)])
        elapsed = time.time() - start
        
        ttfts = [r[0] for r in results]
        totals = [r[1] for r in results]
        print(f"Concurrency={concurrency}: "
              f"Throughput={total/elapsed:.1f} RPS, "
              f"TTFT p50={statistics.median(ttfts)*1000:.0f}ms, "
              f"TTFT p99={sorted(ttfts)[int(len(ttfts)*0.99)]*1000:.0f}ms")

asyncio.run(benchmark())

GPU Monitoring

# Real-time monitoring
watch -n 1 nvidia-smi

# Detailed metrics (power, utilization, memory, clock, temp, PCIe)
nvidia-smi dmon -s pumcet -d 1

# Python monitor
python3 -c "
import subprocess, time
while True:
    r = subprocess.run(['nvidia-smi',
        '--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu',
        '--format=csv,noheader,nounits'], capture_output=True, text=True)
    for i, line in enumerate(r.stdout.strip().split('\n')):
        u, mu, mt, t = line.split(', ')
        print(f'GPU {i}: util={u}%, mem={mu}/{mt}MB, temp={t}C')
    print(); time.sleep(2)
"

45.6 Integrating with Hermes Agent

OpenAI-Compatible Client

vLLM's OpenAI-compatible API lets you swap backends with zero code changes:

from openai import AsyncOpenAI
import asyncio

# Point the OpenAI SDK at vLLM
client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

async def chat(messages: list[dict]) -> str:
    response = await client.chat.completions.create(
        model="NousResearch/Hermes-3-Llama-3.1-70B",
        messages=messages,
        max_tokens=2048,
        temperature=0.1,
    )
    return response.choices[0].message.content

async def streaming_chat(messages: list[dict]) -> str:
    full = ""
    stream = await client.chat.completions.create(
        model="NousResearch/Hermes-3-Llama-3.1-70B",
        messages=messages,
        max_tokens=2048,
        temperature=0.1,
        stream=True
    )
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            token = chunk.choices[0].delta.content
            print(token, end="", flush=True)
            full += token
    print()
    return full

Multi-Instance Load Balancing (Nginx)

upstream vllm_hermes {
    least_conn;   # Best for long-running LLM requests
    server 127.0.0.1:8000;
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
    keepalive 32;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://vllm_hermes;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;         # Required for streaming
        proxy_cache off;
        proxy_read_timeout 300s;
    }
}

Chapter Summary

vLLM is the production inference engine of choice for Hermes deployments:

Feature vLLM Advantage
Concurrent throughput PagedAttention delivers near-zero fragmentation; 10โ€“30x vs traditional
API compatibility OpenAI-compatible; zero migration cost
Built-in features Prefix caching, quantization, tensor parallelism out of the box
Observability Built-in Prometheus metrics

Three critical knobs: --max-num-seqs (concurrency ceiling), --scheduler-delay-factor (batch aggregation wait), --gpu-memory-utilization (VRAM aggressiveness).

Review Questions

  1. PagedAttention uses a fixed block size (default: 16 tokens per block). What are the trade-offs of choosing a smaller versus larger block size? What workload characteristics should guide this choice?

  2. In which Hermes Agent scenario would Prefix Caching deliver the greatest benefit? Describe a specific business use case and estimate a realistic cache hit rate.

  3. Under the "throughput-first" configuration, what happens to service behavior when GPU utilization reaches 95%? Design a circuit-breaker mechanism that degrades gracefully under overload rather than failing catastrophically.

Rate this chapter
4.7  / 5  (3 ratings)

๐Ÿ’ฌ Comments