Chapter 44

Ollama Local Deployment and API Wrapping

Chapter 44: Ollama Local Deployment and API Integration

Introduction

Ollama is the undisputed first choice for local LLM deployment in 2025โ€“2026. It abstracts away the complexity of model downloads, quantization format conversion, and GPU driver configuration, letting you run Hermes 70B with a single command. This chapter covers installation across all three major platforms, pulling and configuring Hermes models, the complete REST API, and a production-grade integration with Hermes Agent.


44.1 Installing Ollama

macOS

# Option 1: Download the installer (recommended)
# Visit https://ollama.com/download and download the .pkg file

# Option 2: Homebrew
brew install ollama

# Verify installation
ollama --version

# Start the service (auto-starts on macOS by default)
ollama serve

# Confirm the API is responsive
curl http://localhost:11434/api/version

Linux

# One-line install script (Ubuntu 20.04+, Debian 11+, CentOS 8+)
curl -fsSL https://ollama.com/install.sh | sh

# The script automatically:
# 1. Detects your CUDA version
# 2. Installs compatible CUDA libraries if missing
# 3. Configures a systemd service
# 4. Creates an 'ollama' system user

# Enable auto-start
sudo systemctl enable ollama
sudo systemctl start ollama

# Monitor logs
journalctl -u ollama -f

# Confirm GPU is detected
nvidia-smi
# You should see an 'ollama' process using VRAM after the first model load

Windows

# Option 1: Download OllamaSetup.exe (recommended)
# https://ollama.com/download/windows

# Option 2: winget
winget install Ollama.Ollama

# Start the service
ollama serve

# Verify
Invoke-WebRequest -Uri "http://localhost:11434/api/version"

CUDA Compatibility

CUDA Version Min Driver Ollama Version Status
CUDA 12.4 550.xx 0.4.x+ Full support
CUDA 12.1 530.xx 0.3.x+ Supported
CUDA 11.8 520.xx 0.2.x+ Supported (slightly lower performance)

44.2 Pulling Hermes Models

Pull Commands

# Hermes 2 Pro (7B โ€” fastest, good for everyday tasks)
ollama pull nous-hermes2-pro:7b

# Hermes 3 (recommended โ€” latest architecture)
ollama pull nous-hermes3:8b
ollama pull nous-hermes3:70b

# List locally available models
ollama list

# Show model details
ollama show nous-hermes3:70b

# Remove a model
ollama rm nous-hermes2-pro:7b

Importing Hermes 4 from GGUF

When the latest Hermes version is not yet in the Ollama Hub, import it manually:

# 1. Download GGUF from HuggingFace
wget https://huggingface.co/NousResearch/Hermes-4-70B-GGUF/resolve/main/hermes-4-70b-q4_k_m.gguf

# 2. Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./hermes-4-70b-q4_k_m.gguf

SYSTEM """You are Hermes, an AI assistant. Be helpful, harmless, and honest."""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
PARAMETER num_ctx 65536
PARAMETER num_gpu -1
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ range .Messages }}<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
EOF

# 3. Import and name the model
ollama create hermes-4-70b:q4 -f Modelfile

# 4. Test it
ollama run hermes-4-70b:q4 "Briefly explain transformer attention."

Quantization Format Guide

Format VRAM (70B) Quality Speed Recommendation
FP16 ~140 GB Reference Baseline Enterprise only
Q8_0 ~80 GB Excellent 1.5ร— Best quality/VRAM
Q5_K_M ~58 GB Very Good 2ร— Good balance
Q4_K_M ~50 GB Good 2.5ร— Recommended default
Q4_0 ~48 GB Acceptable 2.7ร— Speed priority

Q4_K_M uses the K-quantization method, which preserves quality significantly better than naive Q4_0 at similar size.


44.3 Configuration Reference

Environment Variables (Production)

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"      # Use GPUs 0 and 1
Environment="OLLAMA_HOST=0.0.0.0:11434"     # Allow external access
Environment="OLLAMA_NUM_PARALLEL=4"          # Handle 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2"    # Keep 2 models in VRAM
Environment="OLLAMA_KEEP_ALIVE=10m"         # Unload idle models after 10 min
Environment="OLLAMA_MODELS=/data/ollama/models"  # Large storage path
sudo systemctl daemon-reload && sudo systemctl restart ollama

Key Modelfile Parameters Explained

# GPU configuration
PARAMETER num_gpu -1          # -1 = use all GPUs automatically
                              #  0 = CPU-only inference
                              #  N = offload N layers to GPU

# Context window
PARAMETER num_ctx 65536       # Maximum token context (more = more VRAM)
PARAMETER num_batch 512       # Prefill batch size (larger = faster first token)

# Generation
PARAMETER temperature 0.1     # 0 = deterministic, 1 = creative
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1  # Penalize repeated phrases

# Performance
PARAMETER num_thread 8        # CPU threads (set to physical core count)
PARAMETER use_mmap true       # Memory-map model file (saves RAM)
PARAMETER use_mlock false     # Lock pages in RAM (prevents swapping, needs root)

44.4 REST API Usage

Endpoint Overview

Endpoint Method Purpose
/api/chat POST Multi-turn conversation
/api/generate POST Single-prompt generation
/api/embeddings POST Text embedding vectors
/api/tags GET List local models
/api/show POST Model details
/api/pull POST Download a model
/api/ps GET Currently loaded models

Chat API Example

# chat_example.py
import httpx
import json
import asyncio

async def streaming_chat():
    messages = []

    async with httpx.AsyncClient(timeout=120) as client:
        while True:
            user_input = input("\nYou: ").strip()
            if user_input.lower() in ["exit", "quit"]:
                break

            messages.append({"role": "user", "content": user_input})
            print("\nHermes: ", end="", flush=True)
            full_response = ""

            async with client.stream(
                "POST",
                "http://localhost:11434/api/chat",
                json={
                    "model": "nous-hermes3:70b",
                    "messages": messages,
                    "stream": True,
                    "options": {
                        "num_ctx": 65536,
                        "temperature": 0.1,
                        "num_predict": 2048,
                    }
                }
            ) as response:
                async for line in response.aiter_lines():
                    if line:
                        data = json.loads(line)
                        content = data.get("message", {}).get("content", "")
                        if content:
                            print(content, end="", flush=True)
                            full_response += content
                        if data.get("done"):
                            tps = data["eval_count"] / (data["eval_duration"] / 1e9)
                            print(f"\n[{data['eval_count']} tokens, {tps:.1f} t/s]")

            messages.append({"role": "assistant", "content": full_response})

asyncio.run(streaming_chat())

44.5 Hermes Agent Integration

Agent Configuration (YAML)

# hermes_agent_config.yaml
model:
  provider: ollama
  base_url: "http://localhost:11434"
  model_name: "nous-hermes3:70b"
  inference:
    temperature: 0.1
    max_tokens: 4096
    context_window: 65536
    stream: true
  connection:
    timeout: 120
    max_retries: 3

agent:
  max_iterations: 20
  tools:
    enabled: true
    format: "chatml"    # Hermes uses ChatML tool format

mcp_servers:
  filesystem:
    command: "uvx"
    args: ["mcp-server-filesystem", "/workspace"]

Python Integration

# hermes_ollama_agent.py
import httpx
import json
import asyncio
from typing import Optional, Callable
from dataclasses import dataclass

@dataclass
class OllamaConfig:
    base_url: str = "http://localhost:11434"
    model: str = "nous-hermes3:70b"
    temperature: float = 0.1
    max_tokens: int = 4096
    context_window: int = 65536
    timeout: int = 120

class HermesOllamaAgent:
    def __init__(self, config: OllamaConfig, tools: list = None):
        self.config = config
        self.tools = tools or []
        self.conversation_history = []
        self.client = httpx.AsyncClient(
            base_url=config.base_url,
            timeout=httpx.Timeout(config.timeout)
        )

    async def chat(
        self,
        user_message: str,
        system_prompt: Optional[str] = None,
        on_token: Optional[Callable] = None
    ) -> str:
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.extend(self.conversation_history)
        messages.append({"role": "user", "content": user_message})

        payload = {
            "model": self.config.model,
            "messages": messages,
            "stream": on_token is not None,
            "options": {
                "temperature": self.config.temperature,
                "num_predict": self.config.max_tokens,
                "num_ctx": self.config.context_window,
            }
        }
        if self.tools:
            payload["tools"] = self.tools

        response_content = ""

        if on_token:
            async with self.client.stream("POST", "/api/chat", json=payload) as response:
                async for line in response.aiter_lines():
                    if line:
                        data = json.loads(line)
                        content = data.get("message", {}).get("content", "")
                        if content:
                            response_content += content
                            on_token(content)
        else:
            response = await self.client.post("/api/chat", json=payload)
            response.raise_for_status()
            response_content = response.json().get("message", {}).get("content", "")

        self.conversation_history.append({"role": "user", "content": user_message})
        self.conversation_history.append({"role": "assistant", "content": response_content})
        return response_content

    def clear_history(self):
        self.conversation_history = []

    async def close(self):
        await self.client.aclose()


# Usage
async def main():
    config = OllamaConfig(model="nous-hermes3:70b")
    agent = HermesOllamaAgent(config)

    system = "You are a professional code analysis assistant. Be concise and precise."

    print("Hermes Agent started (type 'exit' to quit)\n")
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "exit":
            break
        print("Hermes: ", end="")
        await agent.chat(user_input, system, on_token=lambda t: print(t, end="", flush=True))
        print()

    await agent.close()

asyncio.run(main())

44.6 Performance Tuning

Key Parameter Effects

Parameter Default Direction Effect
num_ctx 2048 Increase as needed Longer context understanding
num_gpu -1 Keep default Auto-maximize GPU usage
num_batch 512 Increase (256โ€“2048) Faster prefill / first token
num_thread physical cores Match physical cores CPU portion speed
use_mmap true Keep true Reduces RAM footprint
use_mlock false true on high-RAM servers Prevents page swapping

Context Window vs Speed Benchmark

# ctx_benchmark.py
import httpx, time, asyncio

async def benchmark_ctx(ctx_sizes):
    async with httpx.AsyncClient(timeout=300) as client:
        for ctx in ctx_sizes:
            start = time.time()
            r = await client.post(
                "http://localhost:11434/api/generate",
                json={
                    "model": "nous-hermes3:70b",
                    "prompt": "Explain neural networks in one paragraph.",
                    "stream": False,
                    "options": {"num_ctx": ctx, "num_predict": 100}
                }
            )
            d = r.json()
            tps = d["eval_count"] / (d["eval_duration"] / 1e9)
            ttft = d["prompt_eval_duration"] / 1e6
            print(f"num_ctx={ctx:6d}: {tps:.1f} t/s, TTFT={ttft:.0f}ms, total={time.time()-start:.1f}s")

asyncio.run(benchmark_ctx([4096, 8192, 16384, 32768, 65536]))

Typical results on A100 80GB with Hermes 70B Q4:

num_ctx=  4096: 28.3 t/s, TTFT=450ms,  total=4.2s
num_ctx=  8192: 26.1 t/s, TTFT=890ms,  total=4.6s
num_ctx= 16384: 22.8 t/s, TTFT=1780ms, total=5.3s
num_ctx= 32768: 18.5 t/s, TTFT=3560ms, total=6.8s
num_ctx= 65536: 12.1 t/s, TTFT=7120ms, total=10.3s

The TTFT increase at larger contexts is due to the prefill (prompt evaluation) phase scaling linearly with context tokens.


Chapter Summary

Ollama is the lowest-friction path to running Hermes locally:

  1. Installation: One command on all three platforms; automatic GPU driver detection
  2. Model management: ollama pull for Hub models; Modelfile for custom GGUF imports
  3. Core APIs: /api/chat and /api/generate cover 95% of use cases
  4. Critical parameters: num_ctx (context window) and num_gpu (GPU layers) have the biggest performance impact
  5. Agent integration: Set provider: ollama in your Hermes Agent config for seamless connection

Best practice: Use Ollama in development for its simplicity. Switch to vLLM (Chapter 45) for high-concurrency production workloads.

Review Questions

  1. Ollama's KEEP_ALIVE parameter controls how long a model stays loaded in VRAM after its last request. If your server has 80 GB VRAM and you need to alternate between Hermes 70B and a 7B model, what KEEP_ALIVE value minimizes user wait time while maximizing GPU utilization?

  2. The benchmark shows that TTFT nearly doubles with each doubling of num_ctx. What is the underlying computational reason for this? Can this overhead be reduced without sacrificing context capacity?

  3. The HermesOllamaAgent implementation grows its conversation_history indefinitely. Design a context management strategy that automatically trims history when it approaches the num_ctx limit, while preserving the most relevant parts of the conversation.

Rate this chapter
4.8  / 5  (3 ratings)

๐Ÿ’ฌ Comments