Chapter 13

Local Model Deployment: Ollama, LM Studio and vLLM Integration

Chapter 13: Local Model Deployment — Complete Integration Guide for Ollama / LM Studio / vLLM

Overview

Local model deployment is one of OpenClaw's unique advantages: data never leaves your machine, inference costs are zero, and you're completely immune to network fluctuations. This chapter covers the integration method for four mainstream local inference solutions — from principles to hands-on practice — along with model selection recommendations.

13.1 The Principle Behind OpenAI API-Compatible Endpoints

All local inference frameworks (Ollama / LM Studio / vLLM / SGLang) expose HTTP endpoints in the OpenAI API-compatible format. OpenClaw leverages this — you only need to set base_url to point at the local service to reuse the same tool-calling and parameter interface.

Local inference framework
    |
    | Exposes OpenAI-compatible API
    | GET /v1/models
    | POST /v1/chat/completions
    | POST /v1/completions
    v
OpenClaw  <--- base_url: http://localhost:PORT/v1
    |
    | Reuses the same Tool Calling / Streaming interface
    v
Agent / Tool System

Key Configuration Fields

{
  "providers": {
    "local_inference": {
      "base_url": "http://localhost:11434/v1",
      "api_key": "not-required",
      "default_model": "ollama/llama3.2",
      "timeout_seconds": 120,
      "stream": true
    }
  }
}

Note: Local services typically don't require an API Key, but some frameworks require passing any non-empty string (e.g., "not-required" or "ollama").

13.2 Ollama: The Simplest Local Model Solution

Installation

macOS / Linux (recommended)

# One-line install
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version
# ollama version 0.4.x

Windows

Download the official installer: https://ollama.com/download/windows

Docker (isolated environment)

# CPU version
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# NVIDIA GPU version
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Pulling Models

# General-purpose conversation models
ollama pull llama3.2          # 3B params, 2GB
ollama pull llama3.2:8b       # 8B params, 4.7GB
ollama pull llama3.1:70b      # 70B params, requires large memory

# Code-specialized models
ollama pull qwen2.5-coder:7b  # Alibaba Qwen2.5 coding model
ollama pull deepseek-coder-v2 # DeepSeek coding model
ollama pull codellama:13b     # Meta CodeLlama

# Multimodal models (image support)
ollama pull llava:13b         # Supports image input
ollama pull moondream         # Lightweight vision model

# Reasoning models
ollama pull deepseek-r1:8b    # DeepSeek R1 reasoning model (local)
ollama pull qwq               # QwQ reasoning model

# List installed models
ollama list

Starting and Managing Ollama

# Start Ollama service (usually auto-starts with system)
ollama serve

# Run model test
ollama run llama3.2 "Tell me a joke"

# Stop service
pkill ollama

# View running models
ollama ps

# Remove model
ollama rm llama3.2

OpenClaw Configuration Integration

{
  "providers": {
    "ollama": {
      "base_url": "http://localhost:11434/v1",
      "api_key": "ollama",
      "models": {
        "ollama/llama3.2": {
          "context_length": 128000,
          "supports_tools": true
        },
        "ollama/deepseek-r1:8b": {
          "context_length": 32768,
          "think_mode": "high"
        }
      }
    }
  }
}

Ollama Environment Variable Tuning

# Set maximum parallel requests
export OLLAMA_NUM_PARALLEL=4

# Set GPU VRAM limit
export OLLAMA_GPU_OVERHEAD=512000000  # 512MB reserved for system

# Cross-machine access (defaults to localhost only)
export OLLAMA_HOST=0.0.0.0:11434

# Model storage directory
export OLLAMA_MODELS=/data/ollama/models

13.3 LM Studio: GUI Management Interface

LM Studio provides an intuitive graphical interface, suitable for non-technical users managing and testing local models.

Installation

Download: https://lmstudio.ai/

Supported platforms: macOS (M-chip optimized) / Windows / Linux

Configuring the Local Server

Open LM Studio
Select Local Server from the left menu
Select the model to load
Click Start Server
Default port: 1234

Server Settings Recommendations

Port: 1234 (or custom)
CORS: enabled (allow cross-origin requests from OpenClaw)
Context Length: set according to model limit
GPU Layers: set to -1 (offload all to GPU)

OpenClaw Configuration Integration

{
  "providers": {
    "lmstudio": {
      "base_url": "http://localhost:1234/v1",
      "api_key": "lm-studio",
      "default_model": "lmstudio/current",
      "timeout_seconds": 180
    }
  }
}

Tip: LM Studio can only load one model at a time. The model-id is typically set to the currently loaded model name, or use lmstudio/current to let OpenClaw automatically query the currently loaded model.

Model File Formats

LM Studio uses quantized models in GGUF format:

Quantization Level	File Size	Quality Loss	Recommended For
Q2_K	Smallest	Highest	Extreme memory constraints
Q4_K_M	Small	Medium	8GB RAM, recommended for beginners
Q5_K_M	Medium	Low	16GB RAM, recommended for daily use
Q8_0	Large	Minimal	32GB RAM, near full precision
F16	Largest	None	Professional training/evaluation

13.4 vLLM: High-Performance Production Inference Service

vLLM is designed for high-concurrency production scenarios, supporting Continuous Batching and PagedAttention for throughput far exceeding Ollama.

Installation

# Recommended: install via pip (requires CUDA environment)
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"

Starting the vLLM Service

# Basic start: single GPU
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --port 8000 \
  --host 0.0.0.0

# Multi-GPU tensor parallelism (4x A100)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000

# Quantized inference (reduce VRAM requirements)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --quantization awq \
  --port 8000

# Using GPTQ quantized models
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq \
  --port 8000

Running via Docker

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct

OpenClaw Configuration Integration

{
  "providers": {
    "vllm": {
      "base_url": "http://localhost:8000/v1",
      "api_key": "vllm",
      "default_model": "vllm/meta-llama/Meta-Llama-3-8B-Instruct",
      "timeout_seconds": 60,
      "max_concurrent_requests": 50
    }
  }
}

vLLM Performance Parameter Tuning

# Key startup parameters
python -m vllm.entrypoints.openai.api_server \
  --model YOUR_MODEL \
  --max-model-len 32768 \          # Maximum context length
  --max-num-batched-tokens 8192 \  # Batch token limit
  --max-num-seqs 256 \             # Maximum concurrent sequences
  --gpu-memory-utilization 0.90 \  # GPU VRAM utilization
  --swap-space 4 \                 # CPU swap space (GB)
  --disable-log-requests           # Disable request logging in production

13.5 SGLang: Structured Generation Inference Service

SGLang is particularly suitable for scenarios requiring structured output (JSON Schema-constrained generation).

Installation and Start

pip install sglang[all]

# Start service
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3-8B-Instruct \
  --port 30000 \
  --host 0.0.0.0

OpenClaw Configuration Integration

{
  "providers": {
    "sglang": {
      "base_url": "http://localhost:30000/v1",
      "api_key": "sglang",
      "default_model": "sglang/llama-3-8b-instruct"
    }
  }
}

13.6 Privacy Advantages of Local Models

The Data-Stays-Local Principle

Cloud model data flow:
User input --> Local --> [Internet] --> Provider server --> Processing --> Return

Local model data flow:
User input --> Local --> [Local inference engine] --> Processing --> Return
                              (Data never leaves the machine)

Scenarios Well-Suited for Local Models

Scenario	Reason
Medical record analysis	HIPAA compliance requires data to stay local
Legal document processing	Attorney-client privilege protection
Source code review	Prevents code leakage to third parties
Financial data processing	Financial regulatory compliance
Enterprise internal knowledge bases	Trade secret protection
Government/military scenarios	Classified data processing

Compliance Comparison

Standard/Regulation	Cloud Model	Local Model
GDPR (EU)	Requires data processing agreement	Naturally compliant
HIPAA (US medical)	Requires BAA agreement	Naturally compliant
SOC 2 Type II	Provider must be certified	Self-controlled
China Data Security Law	Cross-border transfer risk	No transfer risk

13.7 Performance Trade-offs: Local vs. Cloud

Latency Comparison

Metric	Local (RTX 4090)	Local (M3 Max)	Cloud Haiku	Cloud Sonnet
First token latency	50~200ms	100~500ms	300~600ms	500~1200ms
Generation speed (tok/s)	60~120	30~80	Unlimited	Unlimited
Concurrency	Hardware-limited	Hardware-limited	Near unlimited	Near unlimited

Comprehensive Comparison Table

Dimension	Local Model	Cloud Model
Cost	One-time hardware investment	Pay-per-token
Privacy	Full control	Depends on Provider promises
Latency	Hardware-dependent	Network latency added
Quality	Usually below top-tier cloud	Top-tier quality
Availability	99.9% (no network dependency)	Depends on Provider SLA
Maintenance	High (self-managed)	Low
Scalability	Limited by local hardware	Elastic scaling

13.8 Recommended Model Selection

Coding Task Recommendations

Model	Parameters	VRAM Required	Recommended Framework
qwen2.5-coder:32b	32B	24GB	Ollama/vLLM
deepseek-coder-v2	16B	12GB	Ollama/vLLM
codellama:13b	13B	10GB	Ollama
starcoder2:15b	15B	11GB	vLLM

General Conversation Recommendations

Model	Parameters	VRAM Required	Characteristics
llama3.2:8b	8B	6GB	Latest Meta, well-rounded
mistral:7b	7B	5GB	European open source, great multilingual
gemma2:9b	9B	7GB	Google open source
phi3:14b	14B	11GB	Microsoft small model, strong performance

Reasoning Task Recommendations

Model	Parameters	VRAM Required	Characteristics
deepseek-r1:8b	8B	6GB	Complete reasoning chain
deepseek-r1:32b	32B	24GB	Reasoning quality close to cloud
qwq:32b	32B	24GB	Strong math reasoning

Multimodal Recommendations (Image Understanding)

Model	Parameters	VRAM Required	Vision Capability
llava:13b	13B	10GB	Basic image Q&A
llava-llama3:8b	8B	6GB	Llama 3 base
moondream2	2B	2GB	Ultra-lightweight, embedded devices
minicpm-v	8B	6GB	Optimized for Chinese-language scenarios

13.9 Hybrid Deployment Strategy

In real production environments, the best practice is combining local and cloud models:

{
  "hybrid_routing": {
    "rules": [
      {
        "condition": "data_sensitivity == 'high'",
        "route_to": "ollama/llama3.2",
        "reason": "Sensitive data must be processed locally"
      },
      {
        "condition": "complexity == 'high' AND data_sensitivity == 'low'",
        "route_to": "anthropic/claude-opus-4-6",
        "reason": "Complex tasks use the strongest cloud model"
      },
      {
        "condition": "default",
        "route_to": "anthropic/claude-sonnet-4-6",
        "reason": "Default to cloud balanced model"
      }
    ]
  }
}

Chapter Summary

All local inference frameworks expose OpenAI API-compatible endpoints; OpenClaw integrates via base_url
Ollama is easy to install, suitable for individual developers and small teams
LM Studio provides a GUI, suitable for non-technical users
vLLM is a high-performance production inference engine supporting multi-GPU parallelism and continuous batching
SGLang is particularly suitable for structured output scenarios
The core value of local models: data privacy + zero inference cost + no network dependency
Hybrid cloud and local deployment is the best practice for production environments

Next chapter will comprehensively cover the capability boundaries and applicable scenarios for OpenClaw's 16 built-in tools.

Rate this chapter

4.7 / 5 (26 ratings)

Local Model Deployment: Ollama, LM Studio and vLLM Integration

Chapter 13: Local Model Deployment — Complete Integration Guide for Ollama / LM Studio / vLLM

Overview

13.1 The Principle Behind OpenAI API-Compatible Endpoints

Key Configuration Fields

13.2 Ollama: The Simplest Local Model Solution

Installation

Pulling Models

Starting and Managing Ollama

OpenClaw Configuration Integration

Ollama Environment Variable Tuning

13.3 LM Studio: GUI Management Interface

Installation

Configuring the Local Server

Server Settings Recommendations

OpenClaw Configuration Integration

Model File Formats

13.4 vLLM: High-Performance Production Inference Service

Installation

Starting the vLLM Service

Running via Docker

OpenClaw Configuration Integration

vLLM Performance Parameter Tuning

13.5 SGLang: Structured Generation Inference Service

Installation and Start

OpenClaw Configuration Integration

13.6 Privacy Advantages of Local Models

The Data-Stays-Local Principle

Scenarios Well-Suited for Local Models

Compliance Comparison

13.7 Performance Trade-offs: Local vs. Cloud

Latency Comparison

Comprehensive Comparison Table

13.8 Recommended Model Selection

Coding Task Recommendations

General Conversation Recommendations

Reasoning Task Recommendations

Multimodal Recommendations (Image Understanding)

13.9 Hybrid Deployment Strategy

Chapter Summary

💬 Comments