功能描述

Turn your spare GPUs into one inference endpoint. Auto-discovers machines on your network, routes requests to the best available device, learns when your mac...

使用说明 (SKILL.md)

GPU Cluster Manager

Name: Gpu Cluster Manager
Author: twinsgeeks

You are managing a GPU cluster that combines multiple machines into one inference endpoint for running local LLMs via Ollama. The GPU cluster routes every request to the best available device automatically.

What this GPU cluster solves

Your desktop, laptop, and maybe an old Linux box all have GPUs sitting idle most of the time. You want one GPU cluster URL that uses all of them — without Kubernetes, without Docker, without editing config files. Just point your AI apps at the GPU cluster endpoint and let the cluster figure out which machine should handle each request.

This GPU cluster manager does exactly that. Install it, run two commands, and your GPU cluster machines discover each other automatically. The GPU cluster learns when your devices are free, pauses during video calls, and picks the best GPU cluster node for every request based on real-time conditions.

Getting started with the GPU cluster

pip install ollama-herd    # GPU cluster manager from PyPI

On your main GPU cluster machine (the router):

herd    # starts GPU cluster router

On each other GPU cluster machine:

herd-node    # joins the GPU cluster automatically

That's it. The GPU cluster nodes find the router via mDNS. No config files. Your GPU cluster is running.

If mDNS doesn't work on your GPU cluster network: herd-node --router-url http://router-ip:11435

GPU Cluster Endpoint

Your GPU cluster runs at http://localhost:11435. Point any AI app at the GPU cluster:

from openai import OpenAI
# GPU cluster client
gpu_cluster_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
gpu_cluster_response = gpu_cluster_client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Explain GPU cluster routing for AI inference"}]
)

Works with: LangChain, CrewAI, AutoGen, LlamaIndex, Aider, Cline, Continue.dev, and any OpenAI-compatible client pointing at the GPU cluster.

GPU Cluster Smart Features

GPU cluster auto-discovery — machines find each other via mDNS, no config
7-signal GPU cluster scoring — picks the best machine based on loaded models, memory, queue depth, latency, and more
GPU cluster meeting detection — pauses inference when your camera/mic is active (macOS)
GPU cluster capacity learning — learns your weekly patterns (168-hour behavioral model)
GPU cluster context protection — prevents models from reloading when apps send different context sizes
GPU cluster auto-pull — if you request a model that doesn't exist, it downloads to the best GPU cluster node
GPU cluster auto-retry — if a machine hiccups, retries on the next-best GPU cluster node

Check your GPU cluster

GPU cluster status — all machines

curl -s http://localhost:11435/fleet/status | python3 -m json.tool

What models are available on the GPU cluster?

curl -s http://localhost:11435/api/tags | python3 -m json.tool

What's loaded in GPU cluster memory right now?

curl -s http://localhost:11435/api/ps | python3 -m json.tool

How healthy is the GPU cluster?

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

GPU cluster model recommendations

curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

Returns GPU cluster recommendations based on your hardware — which models fit, which are too big, and the optimal GPU cluster mix.

GPU cluster recent activity

curl -s "http://localhost:11435/dashboard/api/traces?limit=10" | python3 -m json.tool

GPU cluster usage stats

curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

GPU cluster settings

curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

curl -s -X POST http://localhost:11435/dashboard/api/settings \
  -H "Content-Type: application/json" \
  -d '{"auto_pull": false}'

Manage GPU cluster models

# What's on each GPU cluster node
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Download a model to a specific GPU cluster node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "gpu-cluster-studio"}'

# Remove a model from a GPU cluster node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "gpu-cluster-studio"}'

GPU cluster per-app tracking

curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

Tag your GPU cluster requests to see which apps use the most time:

curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Summarize GPU cluster utilization"}],"metadata":{"tags":["gpu-cluster-app"]}}'

GPU Cluster Dashboard

Open http://localhost:11435/dashboard for a visual GPU cluster overview. Eight tabs: Fleet Overview (live GPU cluster node cards), Trends (charts), Model Insights (performance comparison), Apps (per-app usage), Benchmarks, Health (automated GPU cluster checks), Recommendations (what models to run), Settings.

Try the GPU cluster

# Quick GPU cluster test
curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.2:3b","messages":[{"role":"user","content":"Hello from the GPU cluster!"}],"stream":false}'

GPU Cluster Troubleshooting

Check what's slow in the GPU cluster

sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' GROUP BY node_id, model HAVING n > 5 ORDER BY avg_secs DESC LIMIT 10"

See GPU cluster failures

sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message, latency_ms/1000.0 as secs FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"

GPU Cluster Guardrails

Never restart or stop the GPU cluster without explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/ (contains all your GPU cluster data and logs).
Do not pull or delete models on the GPU cluster without user confirmation — downloads can be 10-100+ GB.
If a GPU cluster machine shows as offline, report it rather than attempting to SSH into it.

GPU Cluster Failure Handling

Connection refused → GPU cluster router may not be running, suggest herd or uv run herd
0 nodes online → suggest starting herd-node on GPU cluster devices
mDNS discovery fails → use --router-url http://router-ip:11435
GPU cluster requests hang → check for num_ctx in client requests; context protection handles it
GPU cluster errors → check ~/.fleet-manager/logs/herd.jsonl

安全使用建议

Before installing: verify the upstream project and PyPI package (check the GitHub repo, package maintainers, and release history); inspect the package source if possible (pip install can run arbitrary code during install). Be aware the tool auto-downloads models (large disk and network use) and can access system sensors (macOS camera/mic) for meeting-detection—decide whether you want those capabilities. If you proceed, run the package in an isolated environment (VM/container) or on a machine you can sacrifice disk/network access from, restrict network egress if you need to control where models are downloaded from, and review/grep the installed package files for unexpected behavior. Ask the publisher to resolve the registry vs SKILL.md inconsistencies (declared requirements, homepage) and to document where models are pulled from and what permissions meeting-detection uses.

能力评估

ℹ Purpose & Capability

The skill's name and instructions (install ollama-herd, run herd/herd-node, provide a local OpenAI-compatible endpoint) match the stated purpose. However, the registry metadata declares no requirements while SKILL.md includes its own metadata requiring network tools (curl/wget), optional python/pip, and config paths (~/.fleet-manager/...). That mismatch is incoherent and should be explained by the publisher.

⚠ Instruction Scope

Runtime instructions tell the user to pip install a PyPI package and run herd/herd-node; the service auto-discovers via mDNS, auto-pulls models to nodes, and implements 'meeting detection' that pauses inference when camera/mic are active on macOS. These behaviors imply access to network, disk, and system sensors (camera/microphone). SKILL.md does not detail where pulled models come from, what permissions are required/used for meeting detection, or any safeguards for auto-downloads or sensitive sensor access.

ℹ Install Mechanism

The skill is instruction-only (no install spec in registry) but tells users to run `pip install ollama-herd` (PyPI). Installing a third-party pip package is a moderate-risk install mechanism because install-time code can execute arbitrary actions. SKILL.md's declared homepage (a GitHub repo) exists in the file but the registry entry lists no homepage — another inconsistency to resolve.

⚠ Credentials

No environment variables or external credentials are requested, which is proportionate. However, requested config paths (~/.fleet-manager/latency.db, logs/herd.jsonl) indicate the skill will write local telemetry. The meeting-detection feature implies access to camera/mic state on macOS, which is sensitive; auto-pull will download large models over the network and write them to disk. These resource and privacy implications are not made explicit in the registry metadata.

✓ Persistence & Privilege

The skill is not set to always:true and does not request system-wide config changes in the registry. It will create local files under the user's home (~/.fleet-manager) and requires installing a pip package, which is typical. Autonomous invocation is allowed by default (expected) but combined with network/model-download and sensor access increases blast radius—worth caution but not an immediate privilege misconfiguration in the manifest.

版本历史

v1.4.1

Cross-platform support: macOS, Linux, and Windows. Updated OS metadata, descriptions, and hardware recommendations.

v1.4.0

- Reworded all documentation to emphasize the term “GPU cluster” throughout, improving clarity for users seeking cluster-based Apple Silicon AI inference. - Updated descriptions, usage instructions, and example commands for consistency with the "GPU cluster" terminology. - Added short non-English phrases ("GPU集群管理本地AI推理", "Clúster GPU para inferencia IA local") to the description for expanded accessibility. - Minor version in documentation updated to 1.0.1 to reflect these clarifications and enhancements.

v1.3.0

- Updated description to clarify Apple Silicon support and list specific Mac models (Mac Studio, Mac Mini, MacBook Pro, Mac Pro). - Emphasized self-hosted local AI cluster capability in the description. - No changes to cluster usage, features, or technical instructions.

v1.2.0

- Updated description to highlight Apple Silicon support and main model families (Llama, Qwen, DeepSeek, Phi). - Streamlined language to emphasize zero-config setup and Apple-device clustering. - No functional or interface changes—documentation only. - Version number in manifest not updated; remains at 1.0.0 for this documentation refresh.

v1.1.0

- Fixed a metadata formatting issue in SKILL.md. - No feature or behavior changes to the skill itself. - No user action needed; documentation metadata is now consistent.

v1.0.1

- Added "optionalBins" and "configPaths" to the metadata section in SKILL.md for improved requirements and configuration clarity. - No changes to functionality or user instructions.

v1.0.0

Initial release of gpu-cluster-manager: combine multiple machines and GPUs into a single local inference endpoint with zero config. - Auto-discovers machines on your local network via mDNS with no configuration required. - Routes inference requests to the best available device based on real-time status, loaded models, and system capacity. - Learns your machines’ availability patterns and can pause when you’re in a video call (on macOS). - Dashboard and CLI tools for monitoring cluster status, loaded models, usage, and health. - Supports automatic model downloading, retries, per-app usage tracking, and robust safety guardrails. - Simple installation: pip install and two commands to set up a cluster; no Docker or Kubernetes needed.

元数据

Slug gpu-cluster-manager

版本 1.4.1

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 7

常见问题