Ollama Local Deployment and API Wrapping
Chapter 44: Ollama Local Deployment and API Integration
Introduction
Ollama is the undisputed first choice for local LLM deployment in 2025โ2026. It abstracts away the complexity of model downloads, quantization format conversion, and GPU driver configuration, letting you run Hermes 70B with a single command. This chapter covers installation across all three major platforms, pulling and configuring Hermes models, the complete REST API, and a production-grade integration with Hermes Agent.
44.1 Installing Ollama
macOS
# Option 1: Download the installer (recommended)
# Visit https://ollama.com/download and download the .pkg file
# Option 2: Homebrew
brew install ollama
# Verify installation
ollama --version
# Start the service (auto-starts on macOS by default)
ollama serve
# Confirm the API is responsive
curl http://localhost:11434/api/version
Linux
# One-line install script (Ubuntu 20.04+, Debian 11+, CentOS 8+)
curl -fsSL https://ollama.com/install.sh | sh
# The script automatically:
# 1. Detects your CUDA version
# 2. Installs compatible CUDA libraries if missing
# 3. Configures a systemd service
# 4. Creates an 'ollama' system user
# Enable auto-start
sudo systemctl enable ollama
sudo systemctl start ollama
# Monitor logs
journalctl -u ollama -f
# Confirm GPU is detected
nvidia-smi
# You should see an 'ollama' process using VRAM after the first model load
Windows
# Option 1: Download OllamaSetup.exe (recommended)
# https://ollama.com/download/windows
# Option 2: winget
winget install Ollama.Ollama
# Start the service
ollama serve
# Verify
Invoke-WebRequest -Uri "http://localhost:11434/api/version"
CUDA Compatibility
| CUDA Version | Min Driver | Ollama Version | Status |
|---|---|---|---|
| CUDA 12.4 | 550.xx | 0.4.x+ | Full support |
| CUDA 12.1 | 530.xx | 0.3.x+ | Supported |
| CUDA 11.8 | 520.xx | 0.2.x+ | Supported (slightly lower performance) |
44.2 Pulling Hermes Models
Pull Commands
# Hermes 2 Pro (7B โ fastest, good for everyday tasks)
ollama pull nous-hermes2-pro:7b
# Hermes 3 (recommended โ latest architecture)
ollama pull nous-hermes3:8b
ollama pull nous-hermes3:70b
# List locally available models
ollama list
# Show model details
ollama show nous-hermes3:70b
# Remove a model
ollama rm nous-hermes2-pro:7b
Importing Hermes 4 from GGUF
When the latest Hermes version is not yet in the Ollama Hub, import it manually:
# 1. Download GGUF from HuggingFace
wget https://huggingface.co/NousResearch/Hermes-4-70B-GGUF/resolve/main/hermes-4-70b-q4_k_m.gguf
# 2. Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./hermes-4-70b-q4_k_m.gguf
SYSTEM """You are Hermes, an AI assistant. Be helpful, harmless, and honest."""
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
PARAMETER num_ctx 65536
PARAMETER num_gpu -1
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ range .Messages }}<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
EOF
# 3. Import and name the model
ollama create hermes-4-70b:q4 -f Modelfile
# 4. Test it
ollama run hermes-4-70b:q4 "Briefly explain transformer attention."
Quantization Format Guide
| Format | VRAM (70B) | Quality | Speed | Recommendation |
|---|---|---|---|---|
| FP16 | ~140 GB | Reference | Baseline | Enterprise only |
| Q8_0 | ~80 GB | Excellent | 1.5ร | Best quality/VRAM |
| Q5_K_M | ~58 GB | Very Good | 2ร | Good balance |
| Q4_K_M | ~50 GB | Good | 2.5ร | Recommended default |
| Q4_0 | ~48 GB | Acceptable | 2.7ร | Speed priority |
Q4_K_M uses the K-quantization method, which preserves quality significantly better than naive Q4_0 at similar size.
44.3 Configuration Reference
Environment Variables (Production)
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1" # Use GPUs 0 and 1
Environment="OLLAMA_HOST=0.0.0.0:11434" # Allow external access
Environment="OLLAMA_NUM_PARALLEL=4" # Handle 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2" # Keep 2 models in VRAM
Environment="OLLAMA_KEEP_ALIVE=10m" # Unload idle models after 10 min
Environment="OLLAMA_MODELS=/data/ollama/models" # Large storage path
sudo systemctl daemon-reload && sudo systemctl restart ollama
Key Modelfile Parameters Explained
# GPU configuration
PARAMETER num_gpu -1 # -1 = use all GPUs automatically
# 0 = CPU-only inference
# N = offload N layers to GPU
# Context window
PARAMETER num_ctx 65536 # Maximum token context (more = more VRAM)
PARAMETER num_batch 512 # Prefill batch size (larger = faster first token)
# Generation
PARAMETER temperature 0.1 # 0 = deterministic, 1 = creative
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1 # Penalize repeated phrases
# Performance
PARAMETER num_thread 8 # CPU threads (set to physical core count)
PARAMETER use_mmap true # Memory-map model file (saves RAM)
PARAMETER use_mlock false # Lock pages in RAM (prevents swapping, needs root)
44.4 REST API Usage
Endpoint Overview
| Endpoint | Method | Purpose |
|---|---|---|
/api/chat |
POST | Multi-turn conversation |
/api/generate |
POST | Single-prompt generation |
/api/embeddings |
POST | Text embedding vectors |
/api/tags |
GET | List local models |
/api/show |
POST | Model details |
/api/pull |
POST | Download a model |
/api/ps |
GET | Currently loaded models |
Chat API Example
# chat_example.py
import httpx
import json
import asyncio
async def streaming_chat():
messages = []
async with httpx.AsyncClient(timeout=120) as client:
while True:
user_input = input("\nYou: ").strip()
if user_input.lower() in ["exit", "quit"]:
break
messages.append({"role": "user", "content": user_input})
print("\nHermes: ", end="", flush=True)
full_response = ""
async with client.stream(
"POST",
"http://localhost:11434/api/chat",
json={
"model": "nous-hermes3:70b",
"messages": messages,
"stream": True,
"options": {
"num_ctx": 65536,
"temperature": 0.1,
"num_predict": 2048,
}
}
) as response:
async for line in response.aiter_lines():
if line:
data = json.loads(line)
content = data.get("message", {}).get("content", "")
if content:
print(content, end="", flush=True)
full_response += content
if data.get("done"):
tps = data["eval_count"] / (data["eval_duration"] / 1e9)
print(f"\n[{data['eval_count']} tokens, {tps:.1f} t/s]")
messages.append({"role": "assistant", "content": full_response})
asyncio.run(streaming_chat())
44.5 Hermes Agent Integration
Agent Configuration (YAML)
# hermes_agent_config.yaml
model:
provider: ollama
base_url: "http://localhost:11434"
model_name: "nous-hermes3:70b"
inference:
temperature: 0.1
max_tokens: 4096
context_window: 65536
stream: true
connection:
timeout: 120
max_retries: 3
agent:
max_iterations: 20
tools:
enabled: true
format: "chatml" # Hermes uses ChatML tool format
mcp_servers:
filesystem:
command: "uvx"
args: ["mcp-server-filesystem", "/workspace"]
Python Integration
# hermes_ollama_agent.py
import httpx
import json
import asyncio
from typing import Optional, Callable
from dataclasses import dataclass
@dataclass
class OllamaConfig:
base_url: str = "http://localhost:11434"
model: str = "nous-hermes3:70b"
temperature: float = 0.1
max_tokens: int = 4096
context_window: int = 65536
timeout: int = 120
class HermesOllamaAgent:
def __init__(self, config: OllamaConfig, tools: list = None):
self.config = config
self.tools = tools or []
self.conversation_history = []
self.client = httpx.AsyncClient(
base_url=config.base_url,
timeout=httpx.Timeout(config.timeout)
)
async def chat(
self,
user_message: str,
system_prompt: Optional[str] = None,
on_token: Optional[Callable] = None
) -> str:
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.extend(self.conversation_history)
messages.append({"role": "user", "content": user_message})
payload = {
"model": self.config.model,
"messages": messages,
"stream": on_token is not None,
"options": {
"temperature": self.config.temperature,
"num_predict": self.config.max_tokens,
"num_ctx": self.config.context_window,
}
}
if self.tools:
payload["tools"] = self.tools
response_content = ""
if on_token:
async with self.client.stream("POST", "/api/chat", json=payload) as response:
async for line in response.aiter_lines():
if line:
data = json.loads(line)
content = data.get("message", {}).get("content", "")
if content:
response_content += content
on_token(content)
else:
response = await self.client.post("/api/chat", json=payload)
response.raise_for_status()
response_content = response.json().get("message", {}).get("content", "")
self.conversation_history.append({"role": "user", "content": user_message})
self.conversation_history.append({"role": "assistant", "content": response_content})
return response_content
def clear_history(self):
self.conversation_history = []
async def close(self):
await self.client.aclose()
# Usage
async def main():
config = OllamaConfig(model="nous-hermes3:70b")
agent = HermesOllamaAgent(config)
system = "You are a professional code analysis assistant. Be concise and precise."
print("Hermes Agent started (type 'exit' to quit)\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() == "exit":
break
print("Hermes: ", end="")
await agent.chat(user_input, system, on_token=lambda t: print(t, end="", flush=True))
print()
await agent.close()
asyncio.run(main())
44.6 Performance Tuning
Key Parameter Effects
| Parameter | Default | Direction | Effect |
|---|---|---|---|
num_ctx |
2048 | Increase as needed | Longer context understanding |
num_gpu |
-1 | Keep default | Auto-maximize GPU usage |
num_batch |
512 | Increase (256โ2048) | Faster prefill / first token |
num_thread |
physical cores | Match physical cores | CPU portion speed |
use_mmap |
true | Keep true | Reduces RAM footprint |
use_mlock |
false | true on high-RAM servers | Prevents page swapping |
Context Window vs Speed Benchmark
# ctx_benchmark.py
import httpx, time, asyncio
async def benchmark_ctx(ctx_sizes):
async with httpx.AsyncClient(timeout=300) as client:
for ctx in ctx_sizes:
start = time.time()
r = await client.post(
"http://localhost:11434/api/generate",
json={
"model": "nous-hermes3:70b",
"prompt": "Explain neural networks in one paragraph.",
"stream": False,
"options": {"num_ctx": ctx, "num_predict": 100}
}
)
d = r.json()
tps = d["eval_count"] / (d["eval_duration"] / 1e9)
ttft = d["prompt_eval_duration"] / 1e6
print(f"num_ctx={ctx:6d}: {tps:.1f} t/s, TTFT={ttft:.0f}ms, total={time.time()-start:.1f}s")
asyncio.run(benchmark_ctx([4096, 8192, 16384, 32768, 65536]))
Typical results on A100 80GB with Hermes 70B Q4:
num_ctx= 4096: 28.3 t/s, TTFT=450ms, total=4.2s
num_ctx= 8192: 26.1 t/s, TTFT=890ms, total=4.6s
num_ctx= 16384: 22.8 t/s, TTFT=1780ms, total=5.3s
num_ctx= 32768: 18.5 t/s, TTFT=3560ms, total=6.8s
num_ctx= 65536: 12.1 t/s, TTFT=7120ms, total=10.3s
The TTFT increase at larger contexts is due to the prefill (prompt evaluation) phase scaling linearly with context tokens.
Chapter Summary
Ollama is the lowest-friction path to running Hermes locally:
- Installation: One command on all three platforms; automatic GPU driver detection
- Model management:
ollama pullfor Hub models; Modelfile for custom GGUF imports - Core APIs:
/api/chatand/api/generatecover 95% of use cases - Critical parameters:
num_ctx(context window) andnum_gpu(GPU layers) have the biggest performance impact - Agent integration: Set
provider: ollamain your Hermes Agent config for seamless connection
Best practice: Use Ollama in development for its simplicity. Switch to vLLM (Chapter 45) for high-concurrency production workloads.
Review Questions
-
Ollama's
KEEP_ALIVEparameter controls how long a model stays loaded in VRAM after its last request. If your server has 80 GB VRAM and you need to alternate between Hermes 70B and a 7B model, whatKEEP_ALIVEvalue minimizes user wait time while maximizing GPU utilization? -
The benchmark shows that TTFT nearly doubles with each doubling of
num_ctx. What is the underlying computational reason for this? Can this overhead be reduced without sacrificing context capacity? -
The
HermesOllamaAgentimplementation grows itsconversation_historyindefinitely. Design a context management strategy that automatically trims history when it approaches thenum_ctxlimit, while preserving the most relevant parts of the conversation.