Fleet Embeddings
/install fleet-embeddings
Fleet Embeddings
You're helping someone generate embeddings — converting text into vectors for semantic search, RAG pipelines, duplicate detection, or recommendation systems. Instead of hitting one Ollama instance, the fleet distributes embedding requests across all available nodes automatically.
Why fleet embeddings matter
Building a RAG knowledge base means embedding thousands of document chunks. On a single machine, embedding 10,000 chunks takes significant time and blocks LLM inference. With fleet routing, embedding requests spread across nodes — the machine that's least busy handles each batch, and LLM inference continues uninterrupted on other nodes.
Same Ollama embedding models you already know. Same API. Just faster because the fleet parallelizes it.
Get started
pip install ollama-herd
herd # start the router (port 11435)
herd-node # start on each device
ollama pull nomic-embed-text # pull an embedding model
No feature toggle needed — embeddings route through Ollama automatically.
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
Generate embeddings
Ollama format (curl)
curl http://localhost:11435/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "The fleet manages all inference routing"
}'
OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.embeddings.create(
model="nomic-embed-text",
input="The fleet manages all inference routing",
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")
Python (httpx)
import httpx
def embed(text, model="nomic-embed-text"):
resp = httpx.post(
"http://localhost:11435/api/embeddings",
json={"model": model, "prompt": text},
timeout=30.0,
)
resp.raise_for_status()
return resp.json()["embedding"]
vector = embed("search query here")
Batch embedding for RAG
import httpx
def embed_batch(texts, model="nomic-embed-text"):
"""Embed a list of texts. Fleet distributes across nodes."""
vectors = []
for text in texts:
resp = httpx.post(
"http://localhost:11435/api/embeddings",
json={"model": model, "prompt": text},
timeout=30.0,
)
resp.raise_for_status()
vectors.append(resp.json()["embedding"])
return vectors
# Embed document chunks for RAG
chunks = [
"Introduction to fleet management...",
"The scoring engine uses 7 signals...",
"Context protection prevents model reloads...",
]
vectors = embed_batch(chunks)
print(f"Embedded {len(vectors)} chunks, {len(vectors[0])} dimensions each")
Available embedding models
Check what's available:
curl -s http://localhost:11435/api/tags | python3 -c "
import json, sys
for m in json.load(sys.stdin)['models']:
if 'embed' in m['name'].lower() or 'nomic' in m['name'].lower():
print(f' {m[\"name\"]}')"
Common models: nomic-embed-text, mxbai-embed-large, all-minilm, snowflake-arctic-embed.
Pull a model if needed:
curl -X POST http://localhost:11435/dashboard/api/pull \
-H "Content-Type: application/json" \
-d '{"model": "nomic-embed-text", "node_id": "your-node-id"}'
Usage analytics
Tag embedding requests to track per-project usage:
resp = httpx.post(
"http://localhost:11435/api/embeddings",
json={
"model": "nomic-embed-text",
"prompt": text,
"metadata": {"tags": ["my-rag-pipeline", "indexing"]},
},
)
Also available on this fleet
LLM inference
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-oss:120b","messages":[{"role":"user","content":"Hello"}]}'
Drop-in OpenAI SDK compatible. 7-signal scoring routes to the optimal node.
Image generation
curl -o image.png http://localhost:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model":"z-image-turbo","prompt":"a sunset","width":1024,"height":1024,"steps":4}'
Requires FLEET_IMAGE_GENERATION=true. Uses mflux (MLX-native Flux).
Speech-to-text
curl -s http://localhost:11435/api/transcribe \
-F "[email protected]" | python3 -m json.tool
Requires FLEET_TRANSCRIPTION=true. Uses Qwen3-ASR.
Monitoring
# Fleet health and model recommendations
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
# Per-app usage (see which projects use the most tokens)
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
Dashboard at http://localhost:11435/dashboard — embedding requests flow through the same queues as LLM requests.
Full documentation
Agent Setup Guide — complete reference for all 4 model types.
Request Tagging Guide — tag requests for per-project analytics.
Guardrails
- Never delete or modify files in
~/.fleet-manager/. - Never pull or delete models without user confirmation.
- If embedding model not available, suggest:
ollama pull nomic-embed-text. - If router not running, suggest:
herdoruv run herd.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install fleet-embeddings - 安装完成后,直接呼叫该 Skill 的名称或使用
/fleet-embeddings触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Fleet Embeddings 是什么?
Embeddings with nomic-embed-text, mxbai-embed, and snowflake-arctic-embed across your device fleet. Fleet-routed via Ollama for RAG, semantic search, and vec... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 142 次。
如何安装 Fleet Embeddings?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install fleet-embeddings」即可一键安装,无需额外配置。
Fleet Embeddings 是免费的吗?
是的,Fleet Embeddings 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Fleet Embeddings 支持哪些平台?
Fleet Embeddings 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(darwin, linux, windows)。
谁开发了 Fleet Embeddings?
由 Twin Geeks(@twinsgeeks)开发并维护,当前版本 v1.1.1。