Description

Manage multi-tier AI inference clusters for homelabs. Health monitoring, expert MoE routing, automatic node recovery, and model deployment across Ollama and llama.cpp nodes. Covers GPU memory planning, Docker volume strategies for large models, sequential startup patterns to avoid CUDA deadlocks, and unified API gateways via LiteLLM.

README (SKILL.md)

\r \r

Homelab Cluster Management\r

Name: Homelab Cluster Management
Author: mlesnews

\r Manage a compound AI compute cluster spanning multiple tiers of GPU and CPU inference nodes.\r Built and battle-tested by Lumina Homelab.\r \r

When to Use\r

\r Use this skill when your agent needs to:\r

Monitor health of distributed model endpoints\r
Route inference requests to the best available model\r
Recover downed nodes automatically\r
Plan GPU memory allocation across models\r
Deploy models across heterogeneous hardware\r \r

Architecture Pattern\r

\r A homelab cluster typically spans 2-3 tiers:\r \r | Tier | Typical Hardware | Runtime | Role |\r |------|-----------------|---------|------|\r | Local | Primary GPU (RTX 4090/5090) | Ollama | Fast inference, embeddings |\r | Remote | Secondary GPU (RTX 3090/4090) | llama.cpp or Ollama | Distributed inference |\r | NAS/CPU | Synology, RPi, any CPU node | Ollama | Lightweight models, fallback |\r \r A LiteLLM proxy sits in front, providing a unified OpenAI-compatible API across all tiers.\r \r

Health Monitoring\r

\r Check all endpoints with configurable per-endpoint timeouts:\r \r

# Define endpoints with tier labels\r
ENDPOINTS = {\r
    "local/ollama": {"url": "http://localhost:11434/api/tags", "tier": "LOCAL"},\r
    "remote/mark-i": {"url": "http://REMOTE_IP:3009/v1/models", "tier": "REMOTE", "timeout": 8},\r
    "gateway/litellm": {"url": "http://localhost:8080/health/liveliness", "tier": "GATEWAY"},\r
}\r
\r
# For each endpoint: GET with timeout, check HTTP 200\r
# Classify: HEALTHY / DEGRADED / DOWN per tier\r
# Overall prognosis based on tier health\r
```\r
\r
**Key lesson:** Use `/health/liveliness` for LiteLLM, not `/health` — the latter probes all model routes and hangs if any are unreachable.\r
\r
## Expert MoE Routing\r
\r
Route requests to the optimal model based on task classification:\r
\r
```\r
Task Categories:\r
  code     → Coder model (Qwen2.5-Coder-7B or similar)\r
  reason   → Reasoning model (DeepSeek-R1-Distill or similar)\r
  chat     → General model (Qwen2.5-14B or similar)\r
  vision   → Vision model (Qwen2.5-VL or similar)\r
  fast     → Smallest available model for quick responses\r
  embed    → Embedding model (nomic-embed-text or similar)\r
\r
Router logic:\r
  1. Classify task from prompt\r
  2. Check health of preferred model\r
  3. Fallback to next-best if unavailable\r
  4. Return model endpoint + metadata\r
```\r
\r
## Docker Deployment (llama.cpp on Remote Nodes)\r
\r
### Critical: Use Docker Volumes, Not Bind Mounts\r
\r
For models larger than ~1.5GB on Windows Docker hosts:\r
\r
```bash\r
# Create a Docker volume for model storage\r
docker volume create models-vol\r
\r
# Copy models INTO the volume\r
docker run --rm -v models-vol:/models -v /host/path:/src alpine cp /src/model.gguf /models/\r
\r
# Run container FROM volume (not bind mount)\r
docker run -d --gpus all -v models-vol:/models -p 3009:8000 \\r
  -e MODEL_PATH=/models/model.gguf your-llamacpp-image\r
```\r
\r
**Why:** Windows bind mounts use gRPC-FUSE/9P bridge which hangs during GPU tensor loading for large files. Docker volumes use native Linux ext4 and bypass this entirely.\r
\r
### Sequential Container Startup\r
\r
Never start multiple GPU containers simultaneously:\r
\r
```bash\r
# WRONG — causes CUDA initialization deadlock\r
docker start mark-i mark-iii mark-iv mark-vi &\r
\r
# RIGHT — sequential with health check between each\r
for container in mark-v mark-iii mark-iv mark-vi mark-i; do\r
  docker restart $container\r
  sleep 5\r
  # Verify health before starting next\r
  curl -s http://localhost:PORT/v1/models || echo "Warning: $container slow to start"\r
done\r
```\r
\r
### GPU Memory Planning\r
\r
Plan your model lineup to fit within VRAM:\r
\r
```\r
Example for 24GB GPU:\r
  14B model (Q4_K_M)  →  9.0 GB, 28 GPU layers\r
  7B coder            →  4.4 GB, full GPU\r
  8B reasoning        →  4.6 GB, full GPU\r
  1.5B fast coder     →  1.1 GB, full GPU\r
  1.7B fast chat      →  1.0 GB, full GPU\r
  ─────────────────────────────\r
  Total:               20.1 GB (~84% utilized)\r
\r
  Remaining: CPU-only containers for 32B+ models\r
```\r
\r
## Automatic Node Recovery\r
\r
When a remote node goes down (Docker Desktop crash, reboot, etc.):\r
\r
```\r
Recovery sequence:\r
  1. Health check fails for remote tier\r
  2. Check if SSH is responsive (node is up but Docker is down)\r
  3. If SSH works: restart Docker Desktop via SSH\r
  4. If SSH fails: create RDP session to wake the machine\r
  5. Wait for Docker + sequential container restart\r
  6. Re-check health\r
```\r
\r
**Important:** Never store recovery credentials in plaintext. Use a vault (Azure Key Vault, HashiCorp Vault, etc.) and pipe secrets through stdin, never as CLI arguments.\r
\r
## LiteLLM Gateway Configuration\r
\r
Unified API across all tiers:\r
\r
```yaml\r
model_list:\r
  # Local Ollama models\r
  - model_name: local/chat\r
    litellm_params:\r
      model: ollama/qwen2.5:32b\r
      api_base: http://localhost:11434\r
\r
  # Remote llama.cpp models (need openai/ prefix)\r
  - model_name: remote/mark-i\r
    litellm_params:\r
      model: openai/qwen2.5-14b-instruct\r
      api_base: http://REMOTE_IP:3009/v1\r
      api_key: "not-needed"\r
\r
  # NAS Ollama models\r
  - model_name: nas/coder\r
    litellm_params:\r
      model: ollama/qwen2.5-coder:7b\r
      api_base: http://NAS_IP:11434\r
```\r
\r
**Key:** llama.cpp endpoints need the `openai/` prefix in model name and `/v1` in api_base for LiteLLM compatibility.\r
\r
## Links\r
\r
- **Lumina Homelab:** [luminahomelab.ai](https://luminahomelab.ai)\r
- **X/Twitter:** [@HK47LUMINA](https://x.com/HK47LUMINA)\r
- **GitHub:** [mlesnews](https://github.com/mlesnews)\r

Usage Guidance

This skill appears to be coherent for homelab cluster management, but it expects the agent (or the operator) to run network and system commands (docker, curl, ssh, RDP) and to supply remote credentials or vault access at runtime even though none are declared. Before installing or enabling it: - Verify provenance (source is 'unknown' and there's no homepage). Consider running in an isolated test environment first. - Do not provide long-lived credentials directly to the skill; use a secrets vault as recommended and prefer short-lived credentials. - Require explicit human confirmation before the agent performs SSH/RDP, restarts Docker, or copies model files — those actions can be disruptive. - Audit the full, untruncated SKILL.md to confirm there are no instructions that run arbitrary downloaded code or call unknown external endpoints. - If you allow autonomous runs, restrict the agent's network and credential scope (least privilege) and log all actions so you can review recovery operations and container restarts. If you want, provide the full SKILL.md (it was truncated in the package) and any provenance or author contact so I can re-check for missing or risky instructions.

Capability Analysis

Type: OpenClaw Skill Name: homelab-cluster Version: 1.0.0 The skill instructs the OpenClaw agent to perform high-privilege remote system management actions, specifically 'restart Docker Desktop via SSH' and 'create RDP session to wake the machine' for automatic node recovery, as detailed in SKILL.md. These instructions imply the agent will handle credentials and execute commands on remote systems, which are inherently risky capabilities. While presented as part of a legitimate 'homelab-cluster' management function, granting an AI agent such remote execution and credential management powers creates a significant attack surface, making it suspicious due to the potential for abuse if the agent or its environment were compromised.

Capability Assessment

ℹ Purpose & Capability

The SKILL.md content (health checks, routing, Docker advice, SSH/RDP recovery, LiteLLM config) matches the stated 'Homelab Cluster Management' purpose. However, the skill declares no required binaries or environment variables while its instructions explicitly use docker, curl, ssh, RDP, and external vaults — a mild inconsistency: the runtime expects system/network tools and credentials even though none are listed in metadata.

✓ Instruction Scope

The instructions remain within cluster-management scope: endpoint health checks, model routing logic, Docker volume strategies, sequential container startups, GPU memory planning, and recovery procedures. They instruct connecting to remote hosts (SSH/RDP) and operating Docker and HTTP endpoints, which is expected for this purpose. There is no obvious instruction to collect unrelated system data or to exfiltrate secrets, though the agent will be asked to handle sensitive credentials if used.

✓ Install Mechanism

This is an instruction-only skill with no install spec and no code files, which is low-risk from an installation perspective (nothing is written to disk by the skill package itself).

ℹ Credentials

The skill requests no declared environment variables or credentials, but the guidance assumes use of SSH/RDP credentials and external vaults (Azure/HashiCorp) and references API keys in LiteLLM config snippets. It's reasonable for a management skill to require such credentials at runtime, but the metadata does not document them — users should not supply secrets implicitly without clear prompts and should prefer a vault-backed workflow as the doc suggests.

✓ Persistence & Privilege

always:false (default) and autonomous invocation enabled (also default). The skill does not request permanent 'always' presence or attempt to modify other skills/config. No persistence or escalation behaviors are declared.

Version History

v1.0.0

Initial release: health monitoring, MoE routing, Docker volume patterns, GPU memory planning, sequential startup, node recovery, LiteLLM gateway config

Metadata

Slug homelab-cluster

Version 1.0.0

License —

All-time Installs 4

Active Installs 4

Total Versions 1

Frequently Asked Questions

What is Homelab Cluster Management?

Manage multi-tier AI inference clusters for homelabs. Health monitoring, expert MoE routing, automatic node recovery, and model deployment across Ollama and llama.cpp nodes. Covers GPU memory planning, Docker volume strategies for large models, sequential startup patterns to avoid CUDA deadlocks, and unified API gateways via LiteLLM. It is an AI Agent Skill for Claude Code / OpenClaw, with 976 downloads so far.

How do I install Homelab Cluster Management?

Run "/install homelab-cluster" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Homelab Cluster Management free?

Yes, Homelab Cluster Management is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Homelab Cluster Management support?

Homelab Cluster Management is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Homelab Cluster Management?

It is built and maintained by mlesnews (@mlesnews); the current version is v1.0.0.

More Skills

Homelab Cluster Management