功能描述

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, a...

使用说明 (SKILL.md)

Local LLM Router

Name: Local Llm Router
Author: twinsgeeks

You are managing a local LLM inference router that distributes local LLM requests across multiple Ollama instances using a 7-signal local LLM scoring engine.

What this local LLM router solves

You have multiple machines with GPUs but your local LLM inference scripts only talk to one. Switching local LLM models between machines means editing configs and restarting. There's no way to compare local LLM latency across nodes, no automatic local LLM failover, and no visibility into which machine handles which local LLM requests.

This local LLM router sits in front of your Ollama instances and picks the optimal device for every local LLM request — based on what local LLM models are hot in memory, how much headroom each machine has, how deep the local LLM queues are, and historical local LLM latency data. Drop-in compatible with the OpenAI SDK and Ollama API.

Setup Local LLM Router

pip install ollama-herd           # install the local LLM router
herd                              # launch the local LLM router (scores and routes)
herd-node                         # launch a local LLM node agent on each device

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Local LLM Router Endpoint

The local LLM router runs at http://localhost:11435 by default. Point any OpenAI-compatible client at http://localhost:11435/v1 for local LLM inference.

# local_llm_client — connect to the local LLM router
from openai import OpenAI
local_llm_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
local_llm_response = local_llm_client.chat.completions.create(
    model="llama3.3:70b",  # local LLM model
    messages=[{"role": "user", "content": "Hello from local LLM"}],
    stream=True,
)

Local LLM Scoring Engine

Every local LLM request is scored across 7 signals:

Thermal state (+50 pts) — local LLM models already loaded in GPU memory ("hot") score highest
Memory fit (+20 pts) — local LLM nodes with more available headroom score higher
Queue depth (-30 pts) — busy local LLM nodes get penalized
Latency history (-25 pts) — past p75 local LLM latency from SQLite informs expected wait
Role affinity (+15 pts) — large local LLM models prefer big machines
Availability trend (+10 pts) — local LLM nodes with stable availability patterns score higher
Context fit (+15 pts) — local LLM nodes with loaded context windows that fit the estimated token count

Local LLM Context-size Protection

When clients send num_ctx in local LLM requests, the local LLM router intercepts it to prevent Ollama from reloading models unnecessarily:

num_ctx \x3C= loaded context: stripped (local LLM model already supports it)
num_ctx > loaded context: auto-upgrades to a larger loaded local LLM model with sufficient context
Configurable via FLEET_CONTEXT_PROTECTION (strip/warn/passthrough)

Local LLM API Endpoints

Local LLM Fleet Status

# local_llm_fleet_status — all local LLM nodes and queues
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

List all local LLM models across the fleet

# local_llm_model_list — every local LLM model on every node
curl -s http://localhost:11435/api/tags | python3 -m json.tool

Local LLM models currently loaded in memory (hot)

# local_llm_hot_models — local LLM models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool

OpenAI-compatible local LLM model list

curl -s http://localhost:11435/v1/models | python3 -m json.tool

Local LLM Request Traces (routing decisions)

# local_llm_traces — recent local LLM routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool

Returns: local LLM model requested, node selected, score breakdown, latency, tokens, retry/fallback status.

Local LLM Model Performance

curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool

Local LLM Usage Statistics

curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

Local LLM Fleet Health

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Local LLM Model Recommendations

curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

Local LLM Settings

curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

# Toggle local LLM auto-pull
curl -s -X POST http://localhost:11435/dashboard/api/settings \
  -H "Content-Type: application/json" \
  -d '{"auto_pull": false}'

Local LLM Model Management

# local_llm_model_inventory — per-node local LLM model details
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Pull a local LLM model onto a specific node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'

# Delete a local LLM model from a specific node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "mac-studio"}'

Per-app local LLM analytics

curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

Local LLM Dashboard

Web dashboard at http://localhost:11435/dashboard with eight tabs: Local LLM Fleet Overview, Trends, Local LLM Model Insights, Apps, Benchmarks, Local LLM Health, Recommendations, Settings.

Optimizing Local LLM Latency

Find the slowest local LLM model/node combinations

sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' GROUP BY node_id, model HAVING n > 10 ORDER BY avg_secs DESC LIMIT 10"

Check local LLM time-to-first-token

sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, model, AVG(time_to_first_token_ms) as avg_ttft FROM request_traces WHERE time_to_first_token_ms IS NOT NULL GROUP BY node_id, model"

Compare hot vs cold local LLM load latency

sqlite3 ~/.fleet-manager/latency.db "SELECT model, CASE WHEN time_to_first_token_ms \x3C 1000 THEN 'hot' ELSE 'cold' END as load_type, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' AND time_to_first_token_ms IS NOT NULL GROUP BY model, load_type ORDER BY model"

Test local LLM inference

# local LLM via OpenAI format
curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from local LLM"}],"stream":false}'

# local LLM via Ollama format
curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from local LLM"}],"stream":false}'

Local LLM Resilience

Auto-retry — re-scores and retries on the next-best local LLM node if failure occurs before the first chunk
Local LLM model fallbacks — specify backup local LLM models; tries alternatives when the primary is unavailable
Local LLM context protection — strips dangerous num_ctx values, auto-upgrades to larger local LLM models
VRAM-aware local LLM fallback — routes to an already-loaded local LLM model in the same category
Zombie reaper — detects and cleans up stuck in-flight local LLM requests
Local LLM auto-pull — pulls missing local LLM models onto the best available node

Local LLM Guardrails

Never restart or stop the local LLM router or node agents without explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/ (contains local LLM latency data, traces, and logs).
Do not pull or delete local LLM models without user confirmation — downloads can be 10-100+ GB.
If a local LLM node shows as offline, report it rather than attempting to SSH into the machine.

Local LLM Failure Handling

Connection refused → local LLM router may not be running, suggest herd or uv run herd
0 local LLM nodes online → suggest starting herd-node on devices
mDNS discovery fails → use --router-url http://router-ip:11435
Local LLM requests hang → check for num_ctx in client requests; verify context protection
Local LLM API errors → check ~/.fleet-manager/logs/herd.jsonl

安全使用建议

Before installing or running this skill: 1) Verify the pip package and GitHub repo (check maintainer, recent activity, and package files) — do not blindly run 'pip install' from unknown sources. 2) Inspect the package code (or run it in an isolated VM/container) because pip installs and the herd/herd-node binaries can execute arbitrary code and open a local HTTP server. 3) Expect files under ~/.fleet-manager (SQLite DB and logs) and the router to open localhost:11435; review and back up any data you care about. 4) Confirm how FLEET_CONTEXT_PROTECTION works and set it explicitly if sensitive context data might be routed. 5) Be aware the router can trigger model pulls/downloads (large network/disk usage) on your nodes — ensure you want that behavior and run on machines with appropriate quotas or in a sandbox. If you want a lower-risk path, review the project's source code on GitHub and install only after auditing, or run in an isolated environment.

能力评估

ℹ Purpose & Capability

Name/description (local LLM router) aligns with the actions described: installing a Python package (ollama-herd), running a router and node agent, and calling local Ollama/OpenAI-compatible endpoints. Required binaries (curl/wget, optional python/pip/sqlite3) are reasonable for this purpose. However, SKILL.md metadata includes configPaths (~/.fleet-manager/latency.db and logs) while the registry's top-level 'Required config paths' lists none — an internal mismatch.

⚠ Instruction Scope

SKILL.md instructs pip install and running herd/herd-node, calling many local endpoints and using python3 -m json.tool. It references an environment variable FLEET_CONTEXT_PROTECTION and writes/reads ~/.fleet-manager artifacts, but these env/config paths were not declared in the registry metadata fields shown earlier. The instructions will cause network activity (model pulls, contacting Ollama nodes) and create local files — all expected for this tool but not fully declared.

ℹ Install Mechanism

There is no formal install spec in the skill bundle (instruction-only). The instructions tell users/agents to run 'pip install ollama-herd' — installing from PyPI (a common but non‑vetted third‑party code source). That is proportionate for a Python router package but carries the usual risk that pip packages execute arbitrary code during installation/runtime.

⚠ Credentials

The skill does not request API keys or secrets (good), but SKILL.md references FLEET_CONTEXT_PROTECTION and data paths under ~/.fleet-manager without declaring them in requires.env/configPaths at the top level. The router will also open a local HTTP endpoint and can instruct nodes to pull models (which may download large binaries) — these are privileges the user should explicitly accept.

ℹ Persistence & Privilege

The skill is not 'always: true' and is user-invocable (normal). Running it will start long-lived processes (herd, herd-node) and open localhost:11435, persist SQLite logs under ~/.fleet-manager, and potentially auto-pull models. This runtime persistence is expected for a router but is a meaningful system presence and should be run in a trusted or isolated environment.

版本历史

v1.0.4

Cross-platform support: macOS, Linux, and Windows. Updated OS metadata, descriptions, and hardware recommendations.

v1.0.3

**Version 1.0.3 — “Local LLM” terminology update** - Updated all documentation to explicitly use "local LLM" in features, usage, API, and examples for clarity. - Expanded multilingual keywords in the description (Chinese, Spanish). - Adjusted example code and command comments to emphasize the "local LLM router" and "local LLM requests." - No changes to logic or endpoints, documentation only.

v1.3.0

local-llm-router 1.3.0 - Updated documentation in SKILL.md to clarify skill purpose and platform compatibility. - Enhanced description of supported devices (Mac Studio, Mac Mini, MacBook Pro, Linux). - No functional or API changes included; documentation only. - Maintains all previous features and usage instructions.

v1.2.0

- Expanded the skill's description to clarify supported LLM families (Llama, Qwen, DeepSeek, Phi, Mistral, Gemma) and highlight context protection, VRAM fallback, and auto-retry. - Improved summary for easier understanding and decision making about use-cases. - No changes to code or functionality; documentation only.

v1.1.0

No file changes detected for version 1.1.0; this is a version bump only. - No new features, bug fixes, or documentation changes in this release. - Functionality and interface remain unchanged from the previous version.

v1.0.2

- Version updated to 1.0.2. - Minor metadata field order adjustment in SKILL.md. - No functional or behavioral changes; documentation update only.

v1.0.1

- Added optionalBins field to metadata with python3, sqlite3, and pip as optional dependencies. - Added configPaths field to metadata to specify locations of configuration and log files. - No functional changes to core skill logic or API. - Documentation (SKILL.md) updated to reflect new metadata fields.

v1.0.0

Initial release: Smart local LLM router for optimized inference across multiple devices. - Routes OpenAI/Ollama-compatible API requests across multiple local inference nodes using a 7-signal scoring engine (thermal state, memory fit, queue depth, latency, role affinity, availability, context fit). - Real-time web dashboard with detailed fleet status, analytics, and request traceability. - Automatic context-size protection prevents unnecessary model reloads and supports VRAM-aware fallback. - Built-in resilience: auto-retries, model fallback, zombie request cleanup, and automatic model pull. - CLI tools and API endpoints for fleet management, per-model stats, node health, and usage analytics. - Drop-in compatible with OpenAI SDK and Ollama, designed to minimize latency and manual model management.

元数据

Slug local-llm-router

版本 1.0.4

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 8

常见问题

Local Llm Router 是什么？

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, a... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 243 次。

如何安装 Local Llm Router？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install local-llm-router」即可一键安装，无需额外配置。

Local Llm Router 是免费的吗？

是的，Local Llm Router 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Local Llm Router 支持哪些平台？

Local Llm Router 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（darwin, linux, windows）。

谁开发了 Local Llm Router？

由 Twin Geeks（@twinsgeeks）开发并维护，当前版本 v1.0.4。

Local Llm Router