Local Model Deployment: Ollama, LM Studio and vLLM Integration
Chapter 13: Local Model Deployment — Complete Integration Guide for Ollama / LM Studio / vLLM
Overview
Local model deployment is one of OpenClaw's unique advantages: data never leaves your machine, inference costs are zero, and you're completely immune to network fluctuations. This chapter covers the integration method for four mainstream local inference solutions — from principles to hands-on practice — along with model selection recommendations.
13.1 The Principle Behind OpenAI API-Compatible Endpoints
All local inference frameworks (Ollama / LM Studio / vLLM / SGLang) expose HTTP endpoints in the OpenAI API-compatible format. OpenClaw leverages this — you only need to set base_url to point at the local service to reuse the same tool-calling and parameter interface.
Local inference framework
|
| Exposes OpenAI-compatible API
| GET /v1/models
| POST /v1/chat/completions
| POST /v1/completions
v
OpenClaw <--- base_url: http://localhost:PORT/v1
|
| Reuses the same Tool Calling / Streaming interface
v
Agent / Tool System
Key Configuration Fields
{
"providers": {
"local_inference": {
"base_url": "http://localhost:11434/v1",
"api_key": "not-required",
"default_model": "ollama/llama3.2",
"timeout_seconds": 120,
"stream": true
}
}
}
Note: Local services typically don't require an API Key, but some frameworks require passing any non-empty string (e.g.,
"not-required"or"ollama").
13.2 Ollama: The Simplest Local Model Solution
Installation
macOS / Linux (recommended)
# One-line install
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# ollama version 0.4.x
Windows
Download the official installer: https://ollama.com/download/windows
Docker (isolated environment)
# CPU version
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# NVIDIA GPU version
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Pulling Models
# General-purpose conversation models
ollama pull llama3.2 # 3B params, 2GB
ollama pull llama3.2:8b # 8B params, 4.7GB
ollama pull llama3.1:70b # 70B params, requires large memory
# Code-specialized models
ollama pull qwen2.5-coder:7b # Alibaba Qwen2.5 coding model
ollama pull deepseek-coder-v2 # DeepSeek coding model
ollama pull codellama:13b # Meta CodeLlama
# Multimodal models (image support)
ollama pull llava:13b # Supports image input
ollama pull moondream # Lightweight vision model
# Reasoning models
ollama pull deepseek-r1:8b # DeepSeek R1 reasoning model (local)
ollama pull qwq # QwQ reasoning model
# List installed models
ollama list
Starting and Managing Ollama
# Start Ollama service (usually auto-starts with system)
ollama serve
# Run model test
ollama run llama3.2 "Tell me a joke"
# Stop service
pkill ollama
# View running models
ollama ps
# Remove model
ollama rm llama3.2
OpenClaw Configuration Integration
{
"providers": {
"ollama": {
"base_url": "http://localhost:11434/v1",
"api_key": "ollama",
"models": {
"ollama/llama3.2": {
"context_length": 128000,
"supports_tools": true
},
"ollama/deepseek-r1:8b": {
"context_length": 32768,
"think_mode": "high"
}
}
}
}
}
Ollama Environment Variable Tuning
# Set maximum parallel requests
export OLLAMA_NUM_PARALLEL=4
# Set GPU VRAM limit
export OLLAMA_GPU_OVERHEAD=512000000 # 512MB reserved for system
# Cross-machine access (defaults to localhost only)
export OLLAMA_HOST=0.0.0.0:11434
# Model storage directory
export OLLAMA_MODELS=/data/ollama/models
13.3 LM Studio: GUI Management Interface
LM Studio provides an intuitive graphical interface, suitable for non-technical users managing and testing local models.
Installation
Download: https://lmstudio.ai/
Supported platforms: macOS (M-chip optimized) / Windows / Linux
Configuring the Local Server
- Open LM Studio
- Select Local Server from the left menu
- Select the model to load
- Click Start Server
- Default port:
1234
Server Settings Recommendations
Port: 1234 (or custom)
CORS: enabled (allow cross-origin requests from OpenClaw)
Context Length: set according to model limit
GPU Layers: set to -1 (offload all to GPU)
OpenClaw Configuration Integration
{
"providers": {
"lmstudio": {
"base_url": "http://localhost:1234/v1",
"api_key": "lm-studio",
"default_model": "lmstudio/current",
"timeout_seconds": 180
}
}
}
Tip: LM Studio can only load one model at a time. The model-id is typically set to the currently loaded model name, or use
lmstudio/currentto let OpenClaw automatically query the currently loaded model.
Model File Formats
LM Studio uses quantized models in GGUF format:
| Quantization Level | File Size | Quality Loss | Recommended For |
|---|---|---|---|
| Q2_K | Smallest | Highest | Extreme memory constraints |
| Q4_K_M | Small | Medium | 8GB RAM, recommended for beginners |
| Q5_K_M | Medium | Low | 16GB RAM, recommended for daily use |
| Q8_0 | Large | Minimal | 32GB RAM, near full precision |
| F16 | Largest | None | Professional training/evaluation |
13.4 vLLM: High-Performance Production Inference Service
vLLM is designed for high-concurrency production scenarios, supporting Continuous Batching and PagedAttention for throughput far exceeding Ollama.
Installation
# Recommended: install via pip (requires CUDA environment)
pip install vllm
# Verify installation
python -c "import vllm; print(vllm.__version__)"
Starting the vLLM Service
# Basic start: single GPU
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--port 8000 \
--host 0.0.0.0
# Multi-GPU tensor parallelism (4x A100)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000
# Quantized inference (reduce VRAM requirements)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--quantization awq \
--port 8000
# Using GPTQ quantized models
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-13B-GPTQ \
--quantization gptq \
--port 8000
Running via Docker
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct
OpenClaw Configuration Integration
{
"providers": {
"vllm": {
"base_url": "http://localhost:8000/v1",
"api_key": "vllm",
"default_model": "vllm/meta-llama/Meta-Llama-3-8B-Instruct",
"timeout_seconds": 60,
"max_concurrent_requests": 50
}
}
}
vLLM Performance Parameter Tuning
# Key startup parameters
python -m vllm.entrypoints.openai.api_server \
--model YOUR_MODEL \
--max-model-len 32768 \ # Maximum context length
--max-num-batched-tokens 8192 \ # Batch token limit
--max-num-seqs 256 \ # Maximum concurrent sequences
--gpu-memory-utilization 0.90 \ # GPU VRAM utilization
--swap-space 4 \ # CPU swap space (GB)
--disable-log-requests # Disable request logging in production
13.5 SGLang: Structured Generation Inference Service
SGLang is particularly suitable for scenarios requiring structured output (JSON Schema-constrained generation).
Installation and Start
pip install sglang[all]
# Start service
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3-8B-Instruct \
--port 30000 \
--host 0.0.0.0
OpenClaw Configuration Integration
{
"providers": {
"sglang": {
"base_url": "http://localhost:30000/v1",
"api_key": "sglang",
"default_model": "sglang/llama-3-8b-instruct"
}
}
}
13.6 Privacy Advantages of Local Models
The Data-Stays-Local Principle
Cloud model data flow:
User input --> Local --> [Internet] --> Provider server --> Processing --> Return
Local model data flow:
User input --> Local --> [Local inference engine] --> Processing --> Return
(Data never leaves the machine)
Scenarios Well-Suited for Local Models
| Scenario | Reason |
|---|---|
| Medical record analysis | HIPAA compliance requires data to stay local |
| Legal document processing | Attorney-client privilege protection |
| Source code review | Prevents code leakage to third parties |
| Financial data processing | Financial regulatory compliance |
| Enterprise internal knowledge bases | Trade secret protection |
| Government/military scenarios | Classified data processing |
Compliance Comparison
| Standard/Regulation | Cloud Model | Local Model |
|---|---|---|
| GDPR (EU) | Requires data processing agreement | Naturally compliant |
| HIPAA (US medical) | Requires BAA agreement | Naturally compliant |
| SOC 2 Type II | Provider must be certified | Self-controlled |
| China Data Security Law | Cross-border transfer risk | No transfer risk |
13.7 Performance Trade-offs: Local vs. Cloud
Latency Comparison
| Metric | Local (RTX 4090) | Local (M3 Max) | Cloud Haiku | Cloud Sonnet |
|---|---|---|---|---|
| First token latency | 50~200ms | 100~500ms | 300~600ms | 500~1200ms |
| Generation speed (tok/s) | 60~120 | 30~80 | Unlimited | Unlimited |
| Concurrency | Hardware-limited | Hardware-limited | Near unlimited | Near unlimited |
Comprehensive Comparison Table
| Dimension | Local Model | Cloud Model |
|---|---|---|
| Cost | One-time hardware investment | Pay-per-token |
| Privacy | Full control | Depends on Provider promises |
| Latency | Hardware-dependent | Network latency added |
| Quality | Usually below top-tier cloud | Top-tier quality |
| Availability | 99.9% (no network dependency) | Depends on Provider SLA |
| Maintenance | High (self-managed) | Low |
| Scalability | Limited by local hardware | Elastic scaling |
13.8 Recommended Model Selection
Coding Task Recommendations
| Model | Parameters | VRAM Required | Recommended Framework |
|---|---|---|---|
| qwen2.5-coder:32b | 32B | 24GB | Ollama/vLLM |
| deepseek-coder-v2 | 16B | 12GB | Ollama/vLLM |
| codellama:13b | 13B | 10GB | Ollama |
| starcoder2:15b | 15B | 11GB | vLLM |
General Conversation Recommendations
| Model | Parameters | VRAM Required | Characteristics |
|---|---|---|---|
| llama3.2:8b | 8B | 6GB | Latest Meta, well-rounded |
| mistral:7b | 7B | 5GB | European open source, great multilingual |
| gemma2:9b | 9B | 7GB | Google open source |
| phi3:14b | 14B | 11GB | Microsoft small model, strong performance |
Reasoning Task Recommendations
| Model | Parameters | VRAM Required | Characteristics |
|---|---|---|---|
| deepseek-r1:8b | 8B | 6GB | Complete reasoning chain |
| deepseek-r1:32b | 32B | 24GB | Reasoning quality close to cloud |
| qwq:32b | 32B | 24GB | Strong math reasoning |
Multimodal Recommendations (Image Understanding)
| Model | Parameters | VRAM Required | Vision Capability |
|---|---|---|---|
| llava:13b | 13B | 10GB | Basic image Q&A |
| llava-llama3:8b | 8B | 6GB | Llama 3 base |
| moondream2 | 2B | 2GB | Ultra-lightweight, embedded devices |
| minicpm-v | 8B | 6GB | Optimized for Chinese-language scenarios |
13.9 Hybrid Deployment Strategy
In real production environments, the best practice is combining local and cloud models:
{
"hybrid_routing": {
"rules": [
{
"condition": "data_sensitivity == 'high'",
"route_to": "ollama/llama3.2",
"reason": "Sensitive data must be processed locally"
},
{
"condition": "complexity == 'high' AND data_sensitivity == 'low'",
"route_to": "anthropic/claude-opus-4-6",
"reason": "Complex tasks use the strongest cloud model"
},
{
"condition": "default",
"route_to": "anthropic/claude-sonnet-4-6",
"reason": "Default to cloud balanced model"
}
]
}
}
Chapter Summary
- All local inference frameworks expose OpenAI API-compatible endpoints; OpenClaw integrates via
base_url - Ollama is easy to install, suitable for individual developers and small teams
- LM Studio provides a GUI, suitable for non-technical users
- vLLM is a high-performance production inference engine supporting multi-GPU parallelism and continuous batching
- SGLang is particularly suitable for structured output scenarios
- The core value of local models: data privacy + zero inference cost + no network dependency
- Hybrid cloud and local deployment is the best practice for production environments
Next chapter will comprehensively cover the capability boundaries and applicable scenarios for OpenClaw's 16 built-in tools.