功能描述

Distributed inference for Llama, Qwen, DeepSeek across heterogeneous hardware. Self-hosted distributed inference — scatter requests across macOS, Linux, Wind...

使用说明 (SKILL.md)

Distributed Inference

Name: Distributed Inference
Author: twinsgeeks

A coordination layer for distributed inference across heterogeneous machines. Each node is autonomous — it runs its own Ollama, manages its own models, and works fine standalone. The distributed inference coordinator routes requests to the optimal node using a multi-signal distributed inference scoring function and records every distributed inference decision for analysis.

Install Distributed Inference

pip install ollama-herd
herd              # start the distributed inference coordinator
herd-node         # start a distributed inference agent on each node

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Distributed Inference Architecture

Distributed Inference Coordinator (:11435)    Node Agents
┌──────────────────────┐     ┌──────────────────┐
│ Distributed Scoring  │◄────│ Heartbeat + Metrics│  (mDNS or explicit URL)
│ Inference Queue Mgr  │     │ Capacity Learner   │
│ Streaming Proxy      │     └──────────────────┘
│ Trace Store          │     ┌──────────────────┐
│ Latency Store        │     │ Heartbeat + Metrics│  (N nodes)
└──────────────────────┘     └──────────────────┘
        │
        ▼
   Ollama instances (one per distributed inference node)

Distributed inference nodes discover the coordinator via mDNS (_fleet-manager._tcp.local.) or connect explicitly with --router-url. Each distributed inference node sends heartbeats every 5 seconds containing: CPU utilization, memory usage and pressure classification, disk metrics, loaded models with context lengths, available models, and an optional capacity score from the behavioral model.

Distributed Inference Scoring Function

The distributed inference coordinator evaluates every online node for every request using 7 weighted signals:

Distributed Inference Signal	Max Weight	What it measures
Thermal state	+50	Is the model already loaded in GPU memory? Hot (+50), warm (+30), cold (+10)
Memory fit	+20	Available distributed inference memory headroom relative to model size
Queue depth	-30	Pending + in-flight distributed inference requests on this node:model pair
Wait time	-25	Estimated distributed inference wait based on p75 historical latency × queue depth
Role affinity	+15	Large models prefer high-memory distributed inference nodes
Availability trend	+10	Capacity learner's prediction of distributed inference node availability
Context fit	+15	Does the loaded model's context window fit the estimated distributed inference token count?

Distributed inference nodes with insufficient memory, critical pressure, or missing models are eliminated before scoring. The highest-scoring distributed inference node wins.

Adaptive Distributed Inference Capacity

Distributed inference nodes optionally learn usage patterns and constrain their availability:

168-slot behavioral model — one slot per hour of the week, learns when the distributed inference machine is typically free
Dynamic memory ceiling — maps availability score to how much RAM the distributed inference coordinator can use

Enable with FLEET_NODE_ENABLE_CAPACITY_LEARNING=true on the distributed inference node agent.

Context-aware Distributed Inference Model Placement

The distributed inference coordinator protects against a known Ollama behavior where changing num_ctx at runtime triggers a full model reload. For an 89GB model, this causes multi-minute hangs.

num_ctx ≤ loaded context → stripped from the distributed inference request
num_ctx > loaded context → searches loaded models across all distributed inference nodes for sufficient context
Configurable: FLEET_CONTEXT_PROTECTION=strip|warn|passthrough

Distributed Inference API

Distributed Inference Coordinator State

# distributed_inference_fleet_state — full distributed inference topology
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# distributed_inference_models — models across all distributed inference nodes
curl -s http://localhost:11435/api/tags | python3 -m json.tool

# distributed_inference_hot_models — models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool

Distributed Inference (OpenAI-compatible)

# distributed_inference_chat — route via distributed inference scoring
curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello via distributed inference"}]}'

Distributed Inference (Ollama-native)

curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello via distributed inference"}]}'

Distributed Inference Model Fallback Chains

curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","fallback_models":["qwen2.5:32b","qwen2.5:7b"],"messages":[{"role":"user","content":"Hello with distributed inference fallback"}]}'

Distributed Inference Trace Analysis

# distributed_inference_traces — recent routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool

# distributed_inference_score_breakdown
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, node_id, score, scores_breakdown FROM request_traces ORDER BY timestamp DESC LIMIT 1"

Distributed Inference Node Performance

sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, model, COUNT(*) as n, ROUND(AVG(latency_ms)/1000.0, 1) as avg_s, ROUND(AVG(COALESCE(completion_tokens,0) * 1000.0 / NULLIF(latency_ms,0)), 1) as tok_per_s FROM request_traces WHERE status='completed' GROUP BY node_id, model HAVING n > 10 ORDER BY tok_per_s DESC"

Distributed Inference Health and Capacity

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

Distributed Inference Model Lifecycle

# distributed_inference_model_inventory
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Pull model to a distributed inference node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'

# Remove model from a distributed inference node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "mac-studio"}'

Distributed Inference Fault Tolerance

Mechanism	Distributed Inference Behavior
Auto-retry	If a distributed inference node fails before the first chunk, re-score and retry on next-best node
Holding queue	When all distributed inference nodes are saturated, requests queue for up to 30 seconds
Zombie reaper	Background task reclaims stuck distributed inference in-flight slots
VRAM fallback	Routes to a loaded model in the same category rather than cold-loading
Auto-pull	Pulls missing models onto the distributed inference node with the most available memory
Graceful drain	SIGTERM triggers drain: in-flight distributed inference requests finish, pending redistribute

Distributed Inference Data Model

All distributed inference state is in SQLite at ~/.fleet-manager/latency.db:

-- Distributed inference request traces (every routing decision)
SELECT * FROM request_traces LIMIT 1;

Structured distributed inference logs at ~/.fleet-manager/logs/herd.jsonl — daily rotation, 30-day retention.

Distributed Inference Dashboard

http://localhost:11435/dashboard — eight tabs covering distributed inference fleet overview, trends, model insights, per-app analytics, benchmarks, health checks, model recommendations, and settings.

Distributed Inference Constraints

Never restart distributed inference services or modify ~/.fleet-manager/ without explicit user confirmation.
Distributed inference model pull/delete operations require user confirmation (10-100+ GB transfers).
If the distributed inference coordinator is unreachable, suggest herd or uv run herd.
If no distributed inference nodes are online, suggest herd-node on target machines.

安全使用建议

What to check before installing/running this skill: - Review the upstream package and source (https://github.com/geeks-accelerator/ollama-herd and the PyPI project) to confirm the code matches the documentation and contains no unexpected network calls or telemetry. - Expect the coordinator and node agents to collect and store system metrics and per-request traces in ~/.fleet-manager (latency.sqlite and JSONL logs); if that data is sensitive, review/rotate/secure those files. - The system uses mDNS (_fleet-manager._tcp.local.) for discovery and may broadcast/listen on the local network; if you are on an untrusted network, restrict/mask that behavior or use explicit --router-url instead. - The SKILL.md shows installing via pip and uses python3 in examples; ensure you have the intended Python/pip version and do not install packages as root without verifying the package. - Because there is no registry install spec, the skill itself won't auto-install code, but following its instructions will install a third‑party PyPI package — treat that as a separate action and audit it. - If you need higher assurance, request the exact PyPI version hash or a signed release and compare it to the repository before installing. These checks will reduce risk and confirm the skill is doing what its documentation claims.

能力评估

✓ Purpose & Capability

The name/description (distributed inference for Ollama across local machines) matches the instructions: discover nodes via mDNS or explicit URL, route requests, score nodes, and record metrics. Declared metadata (curl/sqlite3, config paths for latency DB and logs) is coherent with this purpose. Minor inconsistency: the runtime instructions show installing via pip (pip install ollama-herd) and use python3 in examples, but pip/python3 are only listed as 'optionalBins' rather than required — in practice Python and pip will be needed to follow the provided install instructions.

ℹ Instruction Scope

SKILL.md instructs running a coordinator and node agents, collecting heartbeat data every 5s (CPU, memory, disk, loaded models, optional capacity scores), writing a latency sqlite DB and JSONL logs, and doing local network discovery via mDNS. Those actions are expected for this functionality but do involve collecting system metrics and using local network discovery (mDNS) — both are legitimate for a fleet manager but are sensitive operations the user should expect.

ℹ Install Mechanism

The skill is instruction-only (no install spec in the registry). The docs instruct users to 'pip install ollama-herd' (PyPI) and run herd/herd-node; installing an external pip package is a moderate-risk action if done automatically, but here the registry does not auto-install. Recommend reviewing the PyPI package and GitHub repo before installing.

✓ Credentials

No credentials or secret env vars are requested. The only env variables referenced are feature/config flags (e.g., FLEET_NODE_ENABLE_CAPACITY_LEARNING, FLEET_CONTEXT_PROTECTION). The skill reads/writes local config paths (~/.fleet-manager/latency.db, ~/.fleet-manager/logs/herd.jsonl) which is proportional to the stated purpose but worth noting for privacy.

✓ Persistence & Privilege

The skill is not force-enabled (always: false) and is user-invocable. It stores local state (DB and logs) and listens/sends on the local network, which is appropriate for this service. It does not request elevated platform privileges or modify other skills' configs.

版本历史

v1.0.4

Cross-platform support: macOS, Linux, and Windows. Updated OS metadata, descriptions, and hardware recommendations.

v1.0.3

- Updated documentation to consistently use the term "distributed inference" throughout all sections. - Expanded usage examples and code snippets to emphasize distributed inference operations and clarify terminology. - Added brief multilingual descriptions (Chinese and Spanish) to the package description. - Refined architecture diagrams and section headings for clarity and focus on distributed inference concepts. - No API, functional, or behavioral changes—documentation update only.

v1.2.0

distributed-inference 1.2.0 - Updated language and examples to better highlight self-hosted, local AI across Mac Studio, Mac Mini, MacBook Pro, and Linux machines. - Clarified hardware diversity, emphasizing support for all machines running Ollama, including various Apple Silicon models. - No code or API changes; documentation improvements only.

v1.1.0

- Shortened and clarified the project description for easier understanding. - Highlighted specific supported model families (Llama, Qwen, DeepSeek). - Simplified language around features like "no orchestration layer" and "just HTTP and mDNS." - Reduced default "zombie reaper" timeout from 15 minutes to 10 minutes in fault tolerance. - Improved focus in feature summaries to highlight key differentiators and supported scenarios.

v1.0.2

- Bumped version to 1.0.2. - Updated metadata: adjusted configPaths positioning and formatting. - Documentation cleanup: removed references to meeting detection and app fingerprinting from adaptive capacity section. - Simplified architecture diagram for node agents. - No changes to core functionality, API, or usage examples.

v1.0.1

- Added optional dependencies (python3, pip) and config file paths (latency.db, herd.jsonl) to metadata. - No functional or API changes; documentation and metadata update only.

v1.0.0

Initial release of distributed-inference — coordinate LLM inference across diverse hardware with adaptive scheduling and no orchestration. - Run distributed LLM inference seamlessly across Apple Silicon, Linux, and Ollama nodes using HTTP and mDNS. - Intelligent request routing using a multi-signal, thermal-aware scoring system for optimal node selection. - Adaptive node capacity learning with behavioral modeling, meeting detection, and app fingerprinting for resource-aware scheduling. - Automatic context window handling and model placement to prevent unnecessary model reloads. - Integrated fault tolerance features: auto-retry, holding queues, zombie reaping, VRAM fallback, and graceful draining. - Simple API for querying fleet status, routing history, usage stats, and model management.

元数据

Slug distributed-inference

版本 1.0.4

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 7

常见问题

Distributed Inference 是什么？

Distributed inference for Llama, Qwen, DeepSeek across heterogeneous hardware. Self-hosted distributed inference — scatter requests across macOS, Linux, Wind... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 220 次。

如何安装 Distributed Inference？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install distributed-inference」即可一键安装，无需额外配置。

Distributed Inference 是免费的吗？

是的，Distributed Inference 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Distributed Inference 支持哪些平台？

Distributed Inference 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（darwin, linux, windows）。

谁开发了 Distributed Inference？

由 Twin Geeks（@twinsgeeks）开发并维护，当前版本 v1.0.4。

Distributed Inference

Distributed Inference

Install Distributed Inference

Distributed Inference Architecture

Distributed Inference Scoring Function

Adaptive Distributed Inference Capacity

Context-aware Distributed Inference Model Placement

Distributed Inference API

Distributed Inference Coordinator State

Distributed Inference (OpenAI-compatible)

Distributed Inference (Ollama-native)

Distributed Inference Model Fallback Chains

Distributed Inference Trace Analysis

Distributed Inference Node Performance

Distributed Inference Health and Capacity

Distributed Inference Model Lifecycle

Distributed Inference Fault Tolerance

Distributed Inference Data Model

Distributed Inference Dashboard

Distributed Inference Constraints

Distributed Inference 是什么？

如何安装 Distributed Inference？

Distributed Inference 是免费的吗？

Distributed Inference 支持哪些平台？

谁开发了 Distributed Inference？

💬 留言讨论