← 返回 Skills 市场
joekravelli

Local Inference Context

作者 JoeKravelli · GitHub ↗ · v1.0.0 · MIT-0
linuxdarwin ✓ 安全检测通过
16
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install local-inference-context
功能描述
Context management for self-hosted LLM backends (llama.cpp, Ollama). Prevents mid-task 503 errors and context overflows caused by VRAM-limited KV caches. Use...
使用说明 (SKILL.md)

Local Inference Context

Generic context skills assume a reliable, large-context cloud provider. Local backends (llama.cpp, Ollama) have a different failure profile: the KV cache is bounded by VRAM, the server can return 503 before OpenClaw's compaction logic triggers, and the compaction model is the same overloaded local model. This skill addresses that reality.


Why local backends fail differently

Cloud provider Local llama.cpp / Ollama
Context limit is a soft API error, OpenClaw retries after compaction KV-cache fills up, server returns 503 or context length exceeded mid-request
Compaction uses same model, which is always available Compaction uses same overloaded local model — may also fail
Context window is exactly what the API reports Effective context = min(configured --ctx-size, available VRAM for KV cache)
No idle slot eviction Idle slots can be evicted; server returns "Loading model" 503 on next request

The practical consequence: on a GPU-constrained setup (e.g. a 24 GB card running a 27B Q5 model), the usable KV-cache budget is roughly 5–8 GB. At 32k tokens configured context, that fills up faster than the configured limit suggests. Treat 50 % fill as amber and 70 % as red — not 60/80 %.


Calibrating your effective context budget

Before a long session, run this once to understand your actual headroom:

# Check VRAM headroom
nvidia-smi --query-gpu=memory.used,memory.free,memory.total \
  --format=csv,noheader,nounits

# Check llama.cpp slot state
curl -s http://localhost:8081/slots | python3 -m json.tool

If memory.free is less than 4 GB, treat the session as already amber regardless of what /status reports. Log the result to memory:

VRAM free: X MB — effective context budget: reduced

Thresholds for local backends

Fill level State Action
\x3C 50 % Green Proceed normally
50–69 % Amber Trim tool outputs, flush key facts to memory
70–84 % Red Checkpoint, offer /compact before continuing
≥ 85 % Critical Stop expanding. Compact or /new before next tool call

Check /status at session start and after any tool call that returns more than ~200 lines of output.


Recognising a local backend failure

These are server-side errors, not OpenClaw compaction events. They require a different response than a normal context overflow:

Signal Meaning
HTTP 503 with body "loading model" Idle slot was evicted; model is reloading. Wait 10–30 s, then retry once.
HTTP 503 with body "no slot available" All slots busy or KV cache full. Do NOT retry immediately — compact first.
context length exceeded in error Hard KV-cache overflow. Compact or start /new before any retry.
Sudden very slow response then timeout KV cache thrashing — reduce context before next request.

Never retry a 503 "no slot available" or context overflow without first reducing context. Retrying makes the problem worse by sending the same oversized payload again.


Pre-task checklist for long operations

Before any task you expect to span more than 4 turns (file edits, debugging sessions, multi-step setups):

  1. Run /status — note current fill %.
  2. Check nvidia-smi if fill is already above 40 %.
  3. Estimate token cost of the task:
    • Each file read ≈ 500–3000 tokens depending on file size
    • Each exec result ≈ 200–1500 tokens
    • Each web_fetch ≈ 1000–4000 tokens
  4. If estimated total would push past 70 %, split into phases and tell the user upfront.

Amber state (50–69 %): lean tool hygiene

Apply these habits to every tool call in amber state:

# Instead of reading entire files:
sed -n '1,50p' /path/to/file          # first 50 lines
grep -n "error\|warn\|fail" logfile   # targeted grep
tail -100 /var/log/syslog             # recent entries only

# Instead of verbose exec output:
some-command 2>&1 | tail -30
systemctl status service --no-pager --lines=20

# Summarise large outputs in one sentence, then discard them:
# "Command succeeded. Key values: port=8081, pid=12345"

Write key values to memory immediately after each tool call — do not rely on them surviving a compaction summary intact.


Red state (70–84 %): checkpoint before continuing

  1. Write a checkpoint to memory now:
## Checkpoint [timestamp]
Status: [what is done]
Pending: [what is next]
Critical values: [file paths, ports, error codes, config keys]
  1. Tell the user:

⚠️ Context at ~N % (local backend — conservative threshold). I've saved progress to memory. Recommend /compact Focus on [task] before continuing. Or /new for a clean session.

  1. If continuing, use /compact Focus on \x3Ccurrent task> — not bare /compact. The local model needs a focused instruction to produce a useful summary under memory pressure.

Critical state (≥ 85 %): stop and recover

Do not issue any more tool calls that expand context.

  1. Write the checkpoint (see above).
  2. Send the user a recovery message:
🛑 Context critical (~N %). Stopping to prevent a server error.

Done: [X]
Pending: [Y]
Key info: [Z]

Options:
  /compact Focus on [task]   — summarise and continue
  /new                       — fresh session (I'll reload from memory)
  1. Wait for the user to choose. Do not attempt to continue on your own.

After a 503 or context-overflow error

If the server already returned an error before you could act:

  1. Do not panic and do not retry the same request.
  2. Check the error type:
    • "loading model" → wait 15–30 s, then retry once with a minimal message.
    • "no slot available" or context length exceeded → compact first.
  3. Run /compact Focus on [what you were doing].
  4. After compaction, verify the slot is ready:
    curl -s http://localhost:8081/health
    # expect: {"status":"ok"}
    
  5. Re-read any file paths or config values from memory or disk — do not trust the compaction summary to have preserved them verbatim.
  6. Resume with a short, targeted first message to re-establish the session before loading more context.

Compaction model — required, not optional

Without a dedicated compaction model, OpenClaw uses the same local model for summarisation — the identical model whose KV cache just caused the overflow. This means compaction will likely fail or produce a degraded summary. A separate compaction model is a prerequisite for this skill to work reliably, not an optional optimisation.

The compaction model should run on a different machine or a second inference instance with its own memory budget. It does not need to be powerful — it only needs to summarise text faithfully and follow instructions. A 7B–8B model is sufficient.

Recommended model: qwen2.5:7b via Ollama (fits in ~5 GB RAM/VRAM, fast, excellent at summarisation and instruction-following). Fallback if speed is critical: llama3.2:3b (~2 GB).

{
  "agents": {
    "defaults": {
      "compaction": {
        "model": "ollama/qwen2.5:7b",
        "notifyUser": true,
        "memoryFlush": {
          "model": "ollama/qwen2.5:7b"
        }
      }
    }
  },
  "providers": {
    "ollama": {
      "baseUrl": "http://\x3CCOMPACTION-SERVER-IP>:11434"
    }
  }
}

Without this configuration, the skill provides partial benefit only: the conservative thresholds and lean tool habits reduce overflow frequency, but cannot recover reliably once an overflow occurs.


Slash command reference

Command When to use
/status Check fill % — use at session start and after large tool outputs
/context list See which injected files and skills consume the most tokens
/compact Focus on \x3Ctopic> Guided compaction — always specify focus on a local backend
/new Clean slate — fastest recovery when context is critical
/usage tokens Per-reply token counter — useful for calibrating estimates

Relationship to other skills

Skill When to use instead
context-recovery After compaction on any backend — recovers lost context via channel history
context-budgeting Cloud providers or stable local setups — heartbeat-based GC at >80 %
context-clean-up Diagnosing chronic context bloat — ranked offender audit
context-anchor Post-compaction orientation via memory file scan

Use local-inference-context before problems occur and context-recovery after compaction if context was lost.

安全使用建议
This appears safe for users running local LLM backends, especially llama.cpp or Ollama. Before installing, be aware that it may prompt local diagnostic commands and encourage memory checkpoints; keep those checkpoints concise and do not store secrets or sensitive logs.
功能分析
Type: OpenClaw Skill Name: local-inference-context Version: 1.0.0 The skill provides purely instructional guidance for an AI agent to manage context constraints when using local inference backends like Ollama or llama.cpp. It uses standard diagnostic commands (nvidia-smi, curl) and efficient file-reading practices (sed, grep) to prevent VRAM overflows and 503 errors, with no evidence of malicious intent or data exfiltration.
能力评估
Purpose & Capability
The visible artifacts are coherent with the stated purpose: managing context pressure for local llama.cpp/Ollama-style backends.
Instruction Scope
The instructions focus on checking context status, reducing output size, compacting, or waiting for user direction when context is high.
Install Mechanism
There is no install spec and no code files; this is an instruction-only skill.
Credentials
The skill asks for local diagnostic commands and localhost checks, which are proportionate for local inference troubleshooting but should be expected by the user.
Persistence & Privilege
The skill instructs the agent to save key values and checkpoints to memory; this is purpose-aligned but could retain sensitive local/task details if used carelessly.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install local-inference-context
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /local-inference-context 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of local-inference-context skill. - Adds context management tailored for self-hosted LLM backends (llama.cpp, Ollama), including VRAM-aware thresholds. - Prevents mid-task 503 errors and context overflows caused by limited KV-cache on local hardware. - Provides detailed guidance on calibrating effective context budgets, error signals, and pre-task checklists. - Introduces tailored recommendations for amber, red, and critical fill states to minimize context-related failures. - Highlights the necessity of a dedicated compaction model for reliable recovery.
元数据
Slug local-inference-context
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Local Inference Context 是什么?

Context management for self-hosted LLM backends (llama.cpp, Ollama). Prevents mid-task 503 errors and context overflows caused by VRAM-limited KV caches. Use... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 16 次。

如何安装 Local Inference Context?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install local-inference-context」即可一键安装,无需额外配置。

Local Inference Context 是免费的吗?

是的,Local Inference Context 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Local Inference Context 支持哪些平台?

Local Inference Context 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(linux, darwin)。

谁开发了 Local Inference Context?

由 JoeKravelli(@joekravelli)开发并维护,当前版本 v1.0.0。

💬 留言讨论