/install homelab-cluster
\r \r
Homelab Cluster Management\r
\r Manage a compound AI compute cluster spanning multiple tiers of GPU and CPU inference nodes.\r Built and battle-tested by Lumina Homelab.\r \r
When to Use\r
\r Use this skill when your agent needs to:\r
- Monitor health of distributed model endpoints\r
- Route inference requests to the best available model\r
- Recover downed nodes automatically\r
- Plan GPU memory allocation across models\r
- Deploy models across heterogeneous hardware\r \r
Architecture Pattern\r
\r A homelab cluster typically spans 2-3 tiers:\r \r | Tier | Typical Hardware | Runtime | Role |\r |------|-----------------|---------|------|\r | Local | Primary GPU (RTX 4090/5090) | Ollama | Fast inference, embeddings |\r | Remote | Secondary GPU (RTX 3090/4090) | llama.cpp or Ollama | Distributed inference |\r | NAS/CPU | Synology, RPi, any CPU node | Ollama | Lightweight models, fallback |\r \r A LiteLLM proxy sits in front, providing a unified OpenAI-compatible API across all tiers.\r \r
Health Monitoring\r
\r Check all endpoints with configurable per-endpoint timeouts:\r \r
# Define endpoints with tier labels\r
ENDPOINTS = {\r
"local/ollama": {"url": "http://localhost:11434/api/tags", "tier": "LOCAL"},\r
"remote/mark-i": {"url": "http://REMOTE_IP:3009/v1/models", "tier": "REMOTE", "timeout": 8},\r
"gateway/litellm": {"url": "http://localhost:8080/health/liveliness", "tier": "GATEWAY"},\r
}\r
\r
# For each endpoint: GET with timeout, check HTTP 200\r
# Classify: HEALTHY / DEGRADED / DOWN per tier\r
# Overall prognosis based on tier health\r
```\r
\r
**Key lesson:** Use `/health/liveliness` for LiteLLM, not `/health` — the latter probes all model routes and hangs if any are unreachable.\r
\r
## Expert MoE Routing\r
\r
Route requests to the optimal model based on task classification:\r
\r
```\r
Task Categories:\r
code → Coder model (Qwen2.5-Coder-7B or similar)\r
reason → Reasoning model (DeepSeek-R1-Distill or similar)\r
chat → General model (Qwen2.5-14B or similar)\r
vision → Vision model (Qwen2.5-VL or similar)\r
fast → Smallest available model for quick responses\r
embed → Embedding model (nomic-embed-text or similar)\r
\r
Router logic:\r
1. Classify task from prompt\r
2. Check health of preferred model\r
3. Fallback to next-best if unavailable\r
4. Return model endpoint + metadata\r
```\r
\r
## Docker Deployment (llama.cpp on Remote Nodes)\r
\r
### Critical: Use Docker Volumes, Not Bind Mounts\r
\r
For models larger than ~1.5GB on Windows Docker hosts:\r
\r
```bash\r
# Create a Docker volume for model storage\r
docker volume create models-vol\r
\r
# Copy models INTO the volume\r
docker run --rm -v models-vol:/models -v /host/path:/src alpine cp /src/model.gguf /models/\r
\r
# Run container FROM volume (not bind mount)\r
docker run -d --gpus all -v models-vol:/models -p 3009:8000 \\r
-e MODEL_PATH=/models/model.gguf your-llamacpp-image\r
```\r
\r
**Why:** Windows bind mounts use gRPC-FUSE/9P bridge which hangs during GPU tensor loading for large files. Docker volumes use native Linux ext4 and bypass this entirely.\r
\r
### Sequential Container Startup\r
\r
Never start multiple GPU containers simultaneously:\r
\r
```bash\r
# WRONG — causes CUDA initialization deadlock\r
docker start mark-i mark-iii mark-iv mark-vi &\r
\r
# RIGHT — sequential with health check between each\r
for container in mark-v mark-iii mark-iv mark-vi mark-i; do\r
docker restart $container\r
sleep 5\r
# Verify health before starting next\r
curl -s http://localhost:PORT/v1/models || echo "Warning: $container slow to start"\r
done\r
```\r
\r
### GPU Memory Planning\r
\r
Plan your model lineup to fit within VRAM:\r
\r
```\r
Example for 24GB GPU:\r
14B model (Q4_K_M) → 9.0 GB, 28 GPU layers\r
7B coder → 4.4 GB, full GPU\r
8B reasoning → 4.6 GB, full GPU\r
1.5B fast coder → 1.1 GB, full GPU\r
1.7B fast chat → 1.0 GB, full GPU\r
─────────────────────────────\r
Total: 20.1 GB (~84% utilized)\r
\r
Remaining: CPU-only containers for 32B+ models\r
```\r
\r
## Automatic Node Recovery\r
\r
When a remote node goes down (Docker Desktop crash, reboot, etc.):\r
\r
```\r
Recovery sequence:\r
1. Health check fails for remote tier\r
2. Check if SSH is responsive (node is up but Docker is down)\r
3. If SSH works: restart Docker Desktop via SSH\r
4. If SSH fails: create RDP session to wake the machine\r
5. Wait for Docker + sequential container restart\r
6. Re-check health\r
```\r
\r
**Important:** Never store recovery credentials in plaintext. Use a vault (Azure Key Vault, HashiCorp Vault, etc.) and pipe secrets through stdin, never as CLI arguments.\r
\r
## LiteLLM Gateway Configuration\r
\r
Unified API across all tiers:\r
\r
```yaml\r
model_list:\r
# Local Ollama models\r
- model_name: local/chat\r
litellm_params:\r
model: ollama/qwen2.5:32b\r
api_base: http://localhost:11434\r
\r
# Remote llama.cpp models (need openai/ prefix)\r
- model_name: remote/mark-i\r
litellm_params:\r
model: openai/qwen2.5-14b-instruct\r
api_base: http://REMOTE_IP:3009/v1\r
api_key: "not-needed"\r
\r
# NAS Ollama models\r
- model_name: nas/coder\r
litellm_params:\r
model: ollama/qwen2.5-coder:7b\r
api_base: http://NAS_IP:11434\r
```\r
\r
**Key:** llama.cpp endpoints need the `openai/` prefix in model name and `/v1` in api_base for LiteLLM compatibility.\r
\r
## Links\r
\r
- **Lumina Homelab:** [luminahomelab.ai](https://luminahomelab.ai)\r
- **X/Twitter:** [@HK47LUMINA](https://x.com/HK47LUMINA)\r
- **GitHub:** [mlesnews](https://github.com/mlesnews)\r
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install homelab-cluster - 安装完成后,直接呼叫该 Skill 的名称或使用
/homelab-cluster触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Homelab Cluster Management 是什么?
Manage multi-tier AI inference clusters for homelabs. Health monitoring, expert MoE routing, automatic node recovery, and model deployment across Ollama and llama.cpp nodes. Covers GPU memory planning, Docker volume strategies for large models, sequential startup patterns to avoid CUDA deadlocks, and unified API gateways via LiteLLM. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 976 次。
如何安装 Homelab Cluster Management?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install homelab-cluster」即可一键安装,无需额外配置。
Homelab Cluster Management 是免费的吗?
是的,Homelab Cluster Management 完全免费(开源免费),可自由下载、安装和使用。
Homelab Cluster Management 支持哪些平台?
Homelab Cluster Management 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Homelab Cluster Management?
由 mlesnews(@mlesnews)开发并维护,当前版本 v1.0.0。