Chapter 13

Local Model Deployment: Ollama, LM Studio and vLLM Integration

Chapter 13: Local Model Deployment โ€” Complete Integration Guide for Ollama / LM Studio / vLLM

Overview

Local model deployment is one of OpenClaw's unique advantages: data never leaves your machine, inference costs are zero, and you're completely immune to network fluctuations. This chapter covers the integration method for four mainstream local inference solutions โ€” from principles to hands-on practice โ€” along with model selection recommendations.


13.1 The Principle Behind OpenAI API-Compatible Endpoints

All local inference frameworks (Ollama / LM Studio / vLLM / SGLang) expose HTTP endpoints in the OpenAI API-compatible format. OpenClaw leverages this โ€” you only need to set base_url to point at the local service to reuse the same tool-calling and parameter interface.

Local inference framework
    |
    | Exposes OpenAI-compatible API
    | GET /v1/models
    | POST /v1/chat/completions
    | POST /v1/completions
    v
OpenClaw  <--- base_url: http://localhost:PORT/v1
    |
    | Reuses the same Tool Calling / Streaming interface
    v
Agent / Tool System

Key Configuration Fields

{
  "providers": {
    "local_inference": {
      "base_url": "http://localhost:11434/v1",
      "api_key": "not-required",
      "default_model": "ollama/llama3.2",
      "timeout_seconds": 120,
      "stream": true
    }
  }
}

Note: Local services typically don't require an API Key, but some frameworks require passing any non-empty string (e.g., "not-required" or "ollama").


13.2 Ollama: The Simplest Local Model Solution

Installation

macOS / Linux (recommended)

# One-line install
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version
# ollama version 0.4.x

Windows

Download the official installer: https://ollama.com/download/windows

Docker (isolated environment)

# CPU version
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# NVIDIA GPU version
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Pulling Models

# General-purpose conversation models
ollama pull llama3.2          # 3B params, 2GB
ollama pull llama3.2:8b       # 8B params, 4.7GB
ollama pull llama3.1:70b      # 70B params, requires large memory

# Code-specialized models
ollama pull qwen2.5-coder:7b  # Alibaba Qwen2.5 coding model
ollama pull deepseek-coder-v2 # DeepSeek coding model
ollama pull codellama:13b     # Meta CodeLlama

# Multimodal models (image support)
ollama pull llava:13b         # Supports image input
ollama pull moondream         # Lightweight vision model

# Reasoning models
ollama pull deepseek-r1:8b    # DeepSeek R1 reasoning model (local)
ollama pull qwq               # QwQ reasoning model

# List installed models
ollama list

Starting and Managing Ollama

# Start Ollama service (usually auto-starts with system)
ollama serve

# Run model test
ollama run llama3.2 "Tell me a joke"

# Stop service
pkill ollama

# View running models
ollama ps

# Remove model
ollama rm llama3.2

OpenClaw Configuration Integration

{
  "providers": {
    "ollama": {
      "base_url": "http://localhost:11434/v1",
      "api_key": "ollama",
      "models": {
        "ollama/llama3.2": {
          "context_length": 128000,
          "supports_tools": true
        },
        "ollama/deepseek-r1:8b": {
          "context_length": 32768,
          "think_mode": "high"
        }
      }
    }
  }
}

Ollama Environment Variable Tuning

# Set maximum parallel requests
export OLLAMA_NUM_PARALLEL=4

# Set GPU VRAM limit
export OLLAMA_GPU_OVERHEAD=512000000  # 512MB reserved for system

# Cross-machine access (defaults to localhost only)
export OLLAMA_HOST=0.0.0.0:11434

# Model storage directory
export OLLAMA_MODELS=/data/ollama/models

13.3 LM Studio: GUI Management Interface

LM Studio provides an intuitive graphical interface, suitable for non-technical users managing and testing local models.

Installation

Download: https://lmstudio.ai/

Supported platforms: macOS (M-chip optimized) / Windows / Linux

Configuring the Local Server

  1. Open LM Studio
  2. Select Local Server from the left menu
  3. Select the model to load
  4. Click Start Server
  5. Default port: 1234

Server Settings Recommendations

Port: 1234 (or custom)
CORS: enabled (allow cross-origin requests from OpenClaw)
Context Length: set according to model limit
GPU Layers: set to -1 (offload all to GPU)

OpenClaw Configuration Integration

{
  "providers": {
    "lmstudio": {
      "base_url": "http://localhost:1234/v1",
      "api_key": "lm-studio",
      "default_model": "lmstudio/current",
      "timeout_seconds": 180
    }
  }
}

Tip: LM Studio can only load one model at a time. The model-id is typically set to the currently loaded model name, or use lmstudio/current to let OpenClaw automatically query the currently loaded model.

Model File Formats

LM Studio uses quantized models in GGUF format:

Quantization Level File Size Quality Loss Recommended For
Q2_K Smallest Highest Extreme memory constraints
Q4_K_M Small Medium 8GB RAM, recommended for beginners
Q5_K_M Medium Low 16GB RAM, recommended for daily use
Q8_0 Large Minimal 32GB RAM, near full precision
F16 Largest None Professional training/evaluation

13.4 vLLM: High-Performance Production Inference Service

vLLM is designed for high-concurrency production scenarios, supporting Continuous Batching and PagedAttention for throughput far exceeding Ollama.

Installation

# Recommended: install via pip (requires CUDA environment)
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"

Starting the vLLM Service

# Basic start: single GPU
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --port 8000 \
  --host 0.0.0.0

# Multi-GPU tensor parallelism (4x A100)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000

# Quantized inference (reduce VRAM requirements)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --quantization awq \
  --port 8000

# Using GPTQ quantized models
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq \
  --port 8000

Running via Docker

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct

OpenClaw Configuration Integration

{
  "providers": {
    "vllm": {
      "base_url": "http://localhost:8000/v1",
      "api_key": "vllm",
      "default_model": "vllm/meta-llama/Meta-Llama-3-8B-Instruct",
      "timeout_seconds": 60,
      "max_concurrent_requests": 50
    }
  }
}

vLLM Performance Parameter Tuning

# Key startup parameters
python -m vllm.entrypoints.openai.api_server \
  --model YOUR_MODEL \
  --max-model-len 32768 \          # Maximum context length
  --max-num-batched-tokens 8192 \  # Batch token limit
  --max-num-seqs 256 \             # Maximum concurrent sequences
  --gpu-memory-utilization 0.90 \  # GPU VRAM utilization
  --swap-space 4 \                 # CPU swap space (GB)
  --disable-log-requests           # Disable request logging in production

13.5 SGLang: Structured Generation Inference Service

SGLang is particularly suitable for scenarios requiring structured output (JSON Schema-constrained generation).

Installation and Start

pip install sglang[all]

# Start service
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3-8B-Instruct \
  --port 30000 \
  --host 0.0.0.0

OpenClaw Configuration Integration

{
  "providers": {
    "sglang": {
      "base_url": "http://localhost:30000/v1",
      "api_key": "sglang",
      "default_model": "sglang/llama-3-8b-instruct"
    }
  }
}

13.6 Privacy Advantages of Local Models

The Data-Stays-Local Principle

Cloud model data flow:
User input --> Local --> [Internet] --> Provider server --> Processing --> Return

Local model data flow:
User input --> Local --> [Local inference engine] --> Processing --> Return
                              (Data never leaves the machine)

Scenarios Well-Suited for Local Models

Scenario Reason
Medical record analysis HIPAA compliance requires data to stay local
Legal document processing Attorney-client privilege protection
Source code review Prevents code leakage to third parties
Financial data processing Financial regulatory compliance
Enterprise internal knowledge bases Trade secret protection
Government/military scenarios Classified data processing

Compliance Comparison

Standard/Regulation Cloud Model Local Model
GDPR (EU) Requires data processing agreement Naturally compliant
HIPAA (US medical) Requires BAA agreement Naturally compliant
SOC 2 Type II Provider must be certified Self-controlled
China Data Security Law Cross-border transfer risk No transfer risk

13.7 Performance Trade-offs: Local vs. Cloud

Latency Comparison

Metric Local (RTX 4090) Local (M3 Max) Cloud Haiku Cloud Sonnet
First token latency 50~200ms 100~500ms 300~600ms 500~1200ms
Generation speed (tok/s) 60~120 30~80 Unlimited Unlimited
Concurrency Hardware-limited Hardware-limited Near unlimited Near unlimited

Comprehensive Comparison Table

Dimension Local Model Cloud Model
Cost One-time hardware investment Pay-per-token
Privacy Full control Depends on Provider promises
Latency Hardware-dependent Network latency added
Quality Usually below top-tier cloud Top-tier quality
Availability 99.9% (no network dependency) Depends on Provider SLA
Maintenance High (self-managed) Low
Scalability Limited by local hardware Elastic scaling

Coding Task Recommendations

Model Parameters VRAM Required Recommended Framework
qwen2.5-coder:32b 32B 24GB Ollama/vLLM
deepseek-coder-v2 16B 12GB Ollama/vLLM
codellama:13b 13B 10GB Ollama
starcoder2:15b 15B 11GB vLLM

General Conversation Recommendations

Model Parameters VRAM Required Characteristics
llama3.2:8b 8B 6GB Latest Meta, well-rounded
mistral:7b 7B 5GB European open source, great multilingual
gemma2:9b 9B 7GB Google open source
phi3:14b 14B 11GB Microsoft small model, strong performance

Reasoning Task Recommendations

Model Parameters VRAM Required Characteristics
deepseek-r1:8b 8B 6GB Complete reasoning chain
deepseek-r1:32b 32B 24GB Reasoning quality close to cloud
qwq:32b 32B 24GB Strong math reasoning

Multimodal Recommendations (Image Understanding)

Model Parameters VRAM Required Vision Capability
llava:13b 13B 10GB Basic image Q&A
llava-llama3:8b 8B 6GB Llama 3 base
moondream2 2B 2GB Ultra-lightweight, embedded devices
minicpm-v 8B 6GB Optimized for Chinese-language scenarios

13.9 Hybrid Deployment Strategy

In real production environments, the best practice is combining local and cloud models:

{
  "hybrid_routing": {
    "rules": [
      {
        "condition": "data_sensitivity == 'high'",
        "route_to": "ollama/llama3.2",
        "reason": "Sensitive data must be processed locally"
      },
      {
        "condition": "complexity == 'high' AND data_sensitivity == 'low'",
        "route_to": "anthropic/claude-opus-4-6",
        "reason": "Complex tasks use the strongest cloud model"
      },
      {
        "condition": "default",
        "route_to": "anthropic/claude-sonnet-4-6",
        "reason": "Default to cloud balanced model"
      }
    ]
  }
}

Chapter Summary

Next chapter will comprehensively cover the capability boundaries and applicable scenarios for OpenClaw's 16 built-in tools.

Rate this chapter
4.7  / 5  (26 ratings)

๐Ÿ’ฌ Comments