功能描述

CUDA Ollama — route Ollama LLM inference across NVIDIA GPUs with automatic CUDA load balancing. CUDA Ollama cluster for RTX 4090, RTX 4080, A100, L40S, H100....

使用说明 (SKILL.md)

CUDA Ollama — Route LLMs Across NVIDIA GPUs

Name: Cuda Ollama
Author: twinsgeeks

Turn your NVIDIA GPUs into a unified CUDA Ollama inference cluster. Ollama already uses CUDA for GPU acceleration — Ollama Herd routes requests across multiple CUDA-enabled machines automatically. One CUDA Ollama endpoint, many NVIDIA GPUs.

Why CUDA Ollama fleet routing

You have NVIDIA GPUs across multiple machines — a workstation with an RTX 4090, a server with dual A100s, maybe an old machine with an RTX 3080. Each runs Ollama with CUDA. But without routing, you're manually picking which CUDA GPU handles each request.

CUDA Ollama Herd fixes this: one endpoint routes every request to the best available NVIDIA GPU based on 7 signals including vRAM fit, thermal state, and queue depth.

NVIDIA CUDA GPU recommendations

NVIDIA GPU	vRAM	Best CUDA Ollama models	Notes
RTX 4090	24GB	`llama3.3:70b` (Q4), `qwen3.5:32b`, `deepseek-r1:32b`	Consumer CUDA king
RTX 4080	16GB	`qwen3.5:14b`, `phi4`, `codestral`	Great CUDA mid-range
RTX 4070	12GB	`llama3.2:3b`, `phi4-mini`, `gemma3:4b`	Budget CUDA option
RTX 3090	24GB	Same as RTX 4090	Older CUDA, still excellent
A100	40/80GB	`llama3.3:70b` (full), `deepseek-v3`	Data center CUDA
H100	80GB	`deepseek-v3`, `qwen3.5:72b`	Frontier CUDA performance
L40S	48GB	`llama3.3:70b`, `qwen3.5:32b`	Inference-optimized CUDA

Cross-platform: Any NVIDIA CUDA GPU works. These are example configurations — the fleet router runs on Linux and Windows.

Quick start

pip install ollama-herd    # PyPI: https://pypi.org/project/ollama-herd/

On your CUDA Ollama router machine:

herd    # start the CUDA Ollama router (port 11435)

On every NVIDIA CUDA machine:

herd-node    # auto-discovers the CUDA Ollama router via mDNS

Verify CUDA is available on each NVIDIA node:

nvidia-smi    # confirm NVIDIA CUDA driver is loaded
ollama ps     # confirm Ollama is using CUDA GPU

No mDNS? Connect CUDA nodes directly: herd-node --router-url http://router-ip:11435

Use the CUDA Ollama cluster

OpenAI SDK (drop-in replacement)

from openai import OpenAI

# Point at your CUDA Ollama fleet
cuda_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")

# Request routes to the best NVIDIA CUDA GPU automatically
response = cuda_client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Explain CUDA parallel computing"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

curl (Ollama format)

# Routes to best available NVIDIA CUDA GPU
curl http://localhost:11435/api/chat -d '{
  "model": "qwen3.5:32b",
  "messages": [{"role": "user", "content": "Optimize this CUDA kernel"}],
  "stream": false
}'

curl (OpenAI format)

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-r1:32b", "messages": [{"role": "user", "content": "Hello"}]}'

CUDA Ollama fleet features

7-signal CUDA scoring — thermal state, vRAM fit, queue depth, latency history, role affinity, availability trend, context fit
vRAM-aware CUDA fallback — if a CUDA GPU is full, routes to the next best NVIDIA GPU
CUDA auto-retry — transparent failover between NVIDIA CUDA nodes
Context protection — prevents expensive CUDA model reloads from num_ctx changes
Thinking model support — auto-inflates num_predict 4x for reasoning models on CUDA
Request tagging — track per-project usage across your CUDA Ollama cluster

Monitor your CUDA Ollama cluster

# NVIDIA CUDA fleet status
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# CUDA GPU health — 15 automated checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

# Which CUDA models are loaded
curl -s http://localhost:11435/api/ps | python3 -m json.tool

Web dashboard at http://localhost:11435/dashboard — live view of all NVIDIA CUDA nodes, queues, and models.

Optimize Ollama for NVIDIA CUDA

# Linux (systemd)
sudo systemctl edit ollama
# Add under [Service]:
#   Environment="OLLAMA_KEEP_ALIVE=-1"
#   Environment="OLLAMA_MAX_LOADED_MODELS=-1"
#   Environment="OLLAMA_NUM_PARALLEL=2"
sudo systemctl restart ollama

# Windows (PowerShell)
[System.Environment]::SetEnvironmentVariable("OLLAMA_KEEP_ALIVE", "-1", "User")
[System.Environment]::SetEnvironmentVariable("OLLAMA_MAX_LOADED_MODELS", "-1", "User")

Also available on this CUDA Ollama fleet

Image generation

curl http://localhost:11435/api/generate-image \
  -d '{"model": "z-image-turbo", "prompt": "NVIDIA GPU rendering abstract art", "width": 1024, "height": 1024}'

Embeddings

curl http://localhost:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": "NVIDIA CUDA GPU inference routing"}'

Full documentation

Contribute

Ollama Herd is open source (MIT). NVIDIA CUDA users, PRs welcome:

Star on GitHub — help CUDA Ollama users find local inference
Open an issue

Guardrails

CUDA Ollama model downloads require explicit user confirmation — models range from 1GB to 400GB+.
CUDA Ollama model deletion requires explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/.
No models are downloaded automatically — all pulls are user-initiated or require opt-in via auto_pull.

安全使用建议

This skill appears internally consistent for routing Ollama inference across NVIDIA GPUs, but before installing: (1) verify the upstream project and PyPI package (check the GitHub repo and PyPI page, confirm maintainers and release signatures), (2) review the package source code if possible because pip install executes code on your machine, (3) be aware herd-node uses mDNS/auto-discovery and opens a local HTTP API (default port 11435) — consider firewalling or binding to localhost if you do not want cluster info broadcast on your LAN, (4) editing systemd or environment variables requires sudo/administrative rights so only perform those steps on trusted machines, and (5) ensure Ollama and NVIDIA drivers are up to date and run nvidia-smi to confirm GPU state. If you want higher assurance, request a repository commit hash or signed release and inspect the package before running pip install.

能力评估

✓ Purpose & Capability

Name/description (CUDA Ollama routing across NVIDIA GPUs) match the instructions: pip install ollama-herd, run herd/ herd-node, use nvidia-smi/ollama and local HTTP endpoints. Declared optional binaries (python3, pip, nvidia-smi) and anyBins (curl|wget) are appropriate for this purpose. The declared config paths under ~/.fleet-manager are plausible for a fleet manager.

✓ Instruction Scope

SKILL.md limits actions to installing the herd package, starting router and node agents, querying local HTTP endpoints, optional systemd/PowerShell environment changes, and mDNS auto-discovery. It does not instruct reading unrelated system files or exfiltrating secrets. Note: mDNS/auto-discovery and fleet dashboard imply local-network broadcasting and discovery, which is expected for a cluster manager but increases network exposure.

ℹ Install Mechanism

No install spec is embedded in the skill bundle (instruction-only), but SKILL.md instructs users to pip install ollama-herd from PyPI. Installing a PyPI package runs arbitrary code on the machine — this is expected for a Python-based fleet tool but is a moderate-risk operation that requires verifying the package and release source before installation.

✓ Credentials

The skill does not request credentials, secrets, or system-wide config paths beyond its own ~/.fleet-manager files. Suggested changes to Ollama environment variables and use of systemd/PowerShell are consistent with configuring a local service. No unrelated environment variables or external service keys are required.

✓ Persistence & Privilege

always is false and autonomous invocation is allowed by default — appropriate for an invocable skill. The skill does not request to modify other skills or system-wide agent configs beyond its own service configuration. Running nodes and router is expected behavior for a cluster manager and may require elevated privileges (e.g., systemd edits).

版本历史

v1.0.0

Initial release of CUDA Ollama fleet router. - Route Ollama LLM inference across multiple NVIDIA CUDA GPUs with automatic load balancing. - Supports GPU fleets including RTX 4090, 4080, 4070, 3090, A100, L40S, H100 on Linux and Windows. - Features 7-signal scoring, vRAM-aware fallback, and CUDA auto-retry for robust routing. - Provides cluster health monitoring, web dashboard, and OpenAI-compatible API endpoints. - Manual control for model downloads and deletions for safety.

元数据

Slug cuda-ollama

版本 1.0.0

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 1

常见问题

Cuda Ollama 是什么？

CUDA Ollama — route Ollama LLM inference across NVIDIA GPUs with automatic CUDA load balancing. CUDA Ollama cluster for RTX 4090, RTX 4080, A100, L40S, H100.... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 114 次。

如何安装 Cuda Ollama？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install cuda-ollama」即可一键安装，无需额外配置。

Cuda Ollama 是免费的吗？

是的，Cuda Ollama 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Cuda Ollama 支持哪些平台？

Cuda Ollama 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（linux, windows）。

谁开发了 Cuda Ollama？

由 Twin Geeks（@twinsgeeks）开发并维护，当前版本 v1.0.0。

Cuda Ollama