Chapter 44

Ollama Local Deployment and API Wrapping

Chapter 44: Ollama Local Deployment and API Integration

Introduction

Ollama is the undisputed first choice for local LLM deployment in 2025–2026. It abstracts away the complexity of model downloads, quantization format conversion, and GPU driver configuration, letting you run Hermes 70B with a single command. This chapter covers installation across all three major platforms, pulling and configuring Hermes models, the complete REST API, and a production-grade integration with Hermes Agent.

44.1 Installing Ollama

macOS

# Option 1: Download the installer (recommended)
# Visit https://ollama.com/download and download the .pkg file

# Option 2: Homebrew
brew install ollama

# Verify installation
ollama --version

# Start the service (auto-starts on macOS by default)
ollama serve

# Confirm the API is responsive
curl http://localhost:11434/api/version

Linux

# One-line install script (Ubuntu 20.04+, Debian 11+, CentOS 8+)
curl -fsSL https://ollama.com/install.sh | sh

# The script automatically:
# 1. Detects your CUDA version
# 2. Installs compatible CUDA libraries if missing
# 3. Configures a systemd service
# 4. Creates an 'ollama' system user

# Enable auto-start
sudo systemctl enable ollama
sudo systemctl start ollama

# Monitor logs
journalctl -u ollama -f

# Confirm GPU is detected
nvidia-smi
# You should see an 'ollama' process using VRAM after the first model load

Windows

# Option 1: Download OllamaSetup.exe (recommended)
# https://ollama.com/download/windows

# Option 2: winget
winget install Ollama.Ollama

# Start the service
ollama serve

# Verify
Invoke-WebRequest -Uri "http://localhost:11434/api/version"

CUDA Compatibility

CUDA Version	Min Driver	Ollama Version	Status
CUDA 12.4	550.xx	0.4.x+	Full support
CUDA 12.1	530.xx	0.3.x+	Supported
CUDA 11.8	520.xx	0.2.x+	Supported (slightly lower performance)

44.2 Pulling Hermes Models

Pull Commands

# Hermes 2 Pro (7B — fastest, good for everyday tasks)
ollama pull nous-hermes2-pro:7b

# Hermes 3 (recommended — latest architecture)
ollama pull nous-hermes3:8b
ollama pull nous-hermes3:70b

# List locally available models
ollama list

# Show model details
ollama show nous-hermes3:70b

# Remove a model
ollama rm nous-hermes2-pro:7b

Importing Hermes 4 from GGUF

When the latest Hermes version is not yet in the Ollama Hub, import it manually:

# 1. Download GGUF from HuggingFace
wget https://huggingface.co/NousResearch/Hermes-4-70B-GGUF/resolve/main/hermes-4-70b-q4_k_m.gguf

# 2. Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./hermes-4-70b-q4_k_m.gguf

SYSTEM """You are Hermes, an AI assistant. Be helpful, harmless, and honest."""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
PARAMETER num_ctx 65536
PARAMETER num_gpu -1
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ range .Messages }}<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
EOF

# 3. Import and name the model
ollama create hermes-4-70b:q4 -f Modelfile

# 4. Test it
ollama run hermes-4-70b:q4 "Briefly explain transformer attention."

Quantization Format Guide

Format	VRAM (70B)	Quality	Speed	Recommendation
FP16	~140 GB	Reference	Baseline	Enterprise only
Q8_0	~80 GB	Excellent	1.5×	Best quality/VRAM
Q5_K_M	~58 GB	Very Good	2×	Good balance
Q4_K_M	~50 GB	Good	2.5×	Recommended default
Q4_0	~48 GB	Acceptable	2.7×	Speed priority

Q4_K_M uses the K-quantization method, which preserves quality significantly better than naive Q4_0 at similar size.

44.3 Configuration Reference

Environment Variables (Production)

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"      # Use GPUs 0 and 1
Environment="OLLAMA_HOST=0.0.0.0:11434"     # Allow external access
Environment="OLLAMA_NUM_PARALLEL=4"          # Handle 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2"    # Keep 2 models in VRAM
Environment="OLLAMA_KEEP_ALIVE=10m"         # Unload idle models after 10 min
Environment="OLLAMA_MODELS=/data/ollama/models"  # Large storage path

sudo systemctl daemon-reload && sudo systemctl restart ollama

Key Modelfile Parameters Explained

# GPU configuration
PARAMETER num_gpu -1          # -1 = use all GPUs automatically
                              #  0 = CPU-only inference
                              #  N = offload N layers to GPU

# Context window
PARAMETER num_ctx 65536       # Maximum token context (more = more VRAM)
PARAMETER num_batch 512       # Prefill batch size (larger = faster first token)

# Generation
PARAMETER temperature 0.1     # 0 = deterministic, 1 = creative
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1  # Penalize repeated phrases

# Performance
PARAMETER num_thread 8        # CPU threads (set to physical core count)
PARAMETER use_mmap true       # Memory-map model file (saves RAM)
PARAMETER use_mlock false     # Lock pages in RAM (prevents swapping, needs root)

44.4 REST API Usage

Endpoint Overview

Endpoint	Method	Purpose
`/api/chat`	POST	Multi-turn conversation
`/api/generate`	POST	Single-prompt generation
`/api/embeddings`	POST	Text embedding vectors
`/api/tags`	GET	List local models
`/api/show`	POST	Model details
`/api/pull`	POST	Download a model
`/api/ps`	GET	Currently loaded models

Chat API Example

# chat_example.py
import httpx
import json
import asyncio

async def streaming_chat():
    messages = []

    async with httpx.AsyncClient(timeout=120) as client:
        while True:
            user_input = input("\nYou: ").strip()
            if user_input.lower() in ["exit", "quit"]:
                break

            messages.append({"role": "user", "content": user_input})
            print("\nHermes: ", end="", flush=True)
            full_response = ""

            async with client.stream(
                "POST",
                "http://localhost:11434/api/chat",
                json={
                    "model": "nous-hermes3:70b",
                    "messages": messages,
                    "stream": True,
                    "options": {
                        "num_ctx": 65536,
                        "temperature": 0.1,
                        "num_predict": 2048,
                    }
                }
            ) as response:
                async for line in response.aiter_lines():
                    if line:
                        data = json.loads(line)
                        content = data.get("message", {}).get("content", "")
                        if content:
                            print(content, end="", flush=True)
                            full_response += content
                        if data.get("done"):
                            tps = data["eval_count"] / (data["eval_duration"] / 1e9)
                            print(f"\n[{data['eval_count']} tokens, {tps:.1f} t/s]")

            messages.append({"role": "assistant", "content": full_response})

asyncio.run(streaming_chat())

44.5 Hermes Agent Integration

Agent Configuration (YAML)

# hermes_agent_config.yaml
model:
  provider: ollama
  base_url: "http://localhost:11434"
  model_name: "nous-hermes3:70b"
  inference:
    temperature: 0.1
    max_tokens: 4096
    context_window: 65536
    stream: true
  connection:
    timeout: 120
    max_retries: 3

agent:
  max_iterations: 20
  tools:
    enabled: true
    format: "chatml"    # Hermes uses ChatML tool format

mcp_servers:
  filesystem:
    command: "uvx"
    args: ["mcp-server-filesystem", "/workspace"]

Python Integration

# hermes_ollama_agent.py
import httpx
import json
import asyncio
from typing import Optional, Callable
from dataclasses import dataclass

@dataclass
class OllamaConfig:
    base_url: str = "http://localhost:11434"
    model: str = "nous-hermes3:70b"
    temperature: float = 0.1
    max_tokens: int = 4096
    context_window: int = 65536
    timeout: int = 120

class HermesOllamaAgent:
    def __init__(self, config: OllamaConfig, tools: list = None):
        self.config = config
        self.tools = tools or []
        self.conversation_history = []
        self.client = httpx.AsyncClient(
            base_url=config.base_url,
            timeout=httpx.Timeout(config.timeout)
        )

    async def chat(
        self,
        user_message: str,
        system_prompt: Optional[str] = None,
        on_token: Optional[Callable] = None
    ) -> str:
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.extend(self.conversation_history)
        messages.append({"role": "user", "content": user_message})

        payload = {
            "model": self.config.model,
            "messages": messages,
            "stream": on_token is not None,
            "options": {
                "temperature": self.config.temperature,
                "num_predict": self.config.max_tokens,
                "num_ctx": self.config.context_window,
            }
        }
        if self.tools:
            payload["tools"] = self.tools

        response_content = ""

        if on_token:
            async with self.client.stream("POST", "/api/chat", json=payload) as response:
                async for line in response.aiter_lines():
                    if line:
                        data = json.loads(line)
                        content = data.get("message", {}).get("content", "")
                        if content:
                            response_content += content
                            on_token(content)
        else:
            response = await self.client.post("/api/chat", json=payload)
            response.raise_for_status()
            response_content = response.json().get("message", {}).get("content", "")

        self.conversation_history.append({"role": "user", "content": user_message})
        self.conversation_history.append({"role": "assistant", "content": response_content})
        return response_content

    def clear_history(self):
        self.conversation_history = []

    async def close(self):
        await self.client.aclose()


# Usage
async def main():
    config = OllamaConfig(model="nous-hermes3:70b")
    agent = HermesOllamaAgent(config)

    system = "You are a professional code analysis assistant. Be concise and precise."

    print("Hermes Agent started (type 'exit' to quit)\n")
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "exit":
            break
        print("Hermes: ", end="")
        await agent.chat(user_input, system, on_token=lambda t: print(t, end="", flush=True))
        print()

    await agent.close()

asyncio.run(main())

44.6 Performance Tuning

Key Parameter Effects

Parameter	Default	Direction	Effect
`num_ctx`	2048	Increase as needed	Longer context understanding
`num_gpu`	-1	Keep default	Auto-maximize GPU usage
`num_batch`	512	Increase (256–2048)	Faster prefill / first token
`num_thread`	physical cores	Match physical cores	CPU portion speed
`use_mmap`	true	Keep true	Reduces RAM footprint
`use_mlock`	false	true on high-RAM servers	Prevents page swapping

Context Window vs Speed Benchmark

# ctx_benchmark.py
import httpx, time, asyncio

async def benchmark_ctx(ctx_sizes):
    async with httpx.AsyncClient(timeout=300) as client:
        for ctx in ctx_sizes:
            start = time.time()
            r = await client.post(
                "http://localhost:11434/api/generate",
                json={
                    "model": "nous-hermes3:70b",
                    "prompt": "Explain neural networks in one paragraph.",
                    "stream": False,
                    "options": {"num_ctx": ctx, "num_predict": 100}
                }
            )
            d = r.json()
            tps = d["eval_count"] / (d["eval_duration"] / 1e9)
            ttft = d["prompt_eval_duration"] / 1e6
            print(f"num_ctx={ctx:6d}: {tps:.1f} t/s, TTFT={ttft:.0f}ms, total={time.time()-start:.1f}s")

asyncio.run(benchmark_ctx([4096, 8192, 16384, 32768, 65536]))

Typical results on A100 80GB with Hermes 70B Q4:

num_ctx=  4096: 28.3 t/s, TTFT=450ms,  total=4.2s
num_ctx=  8192: 26.1 t/s, TTFT=890ms,  total=4.6s
num_ctx= 16384: 22.8 t/s, TTFT=1780ms, total=5.3s
num_ctx= 32768: 18.5 t/s, TTFT=3560ms, total=6.8s
num_ctx= 65536: 12.1 t/s, TTFT=7120ms, total=10.3s

The TTFT increase at larger contexts is due to the prefill (prompt evaluation) phase scaling linearly with context tokens.

Chapter Summary

Ollama is the lowest-friction path to running Hermes locally:

Installation: One command on all three platforms; automatic GPU driver detection
Model management: ollama pull for Hub models; Modelfile for custom GGUF imports
Core APIs: /api/chat and /api/generate cover 95% of use cases
Critical parameters: num_ctx (context window) and num_gpu (GPU layers) have the biggest performance impact
Agent integration: Set provider: ollama in your Hermes Agent config for seamless connection

Best practice: Use Ollama in development for its simplicity. Switch to vLLM (Chapter 45) for high-concurrency production workloads.

Review Questions

Ollama's KEEP_ALIVE parameter controls how long a model stays loaded in VRAM after its last request. If your server has 80 GB VRAM and you need to alternate between Hermes 70B and a 7B model, what KEEP_ALIVE value minimizes user wait time while maximizing GPU utilization?
The benchmark shows that TTFT nearly doubles with each doubling of num_ctx. What is the underlying computational reason for this? Can this overhead be reduced without sacrificing context capacity?
The HermesOllamaAgent implementation grows its conversation_history indefinitely. Design a context management strategy that automatically trims history when it approaches the num_ctx limit, while preserving the most relevant parts of the conversation.

Rate this chapter

4.8 / 5 (3 ratings)