Performance Tuning: Token Cost Control, Context Budget Management and Concurrent Lane Config
Chapter 40: Performance Tuning — Token Cost Control, Context Budget Management, and Concurrent Lane Configuration
Overview
OpenClaw's operating cost and performance are largely determined by three tunable variables: token consumption (which maps directly to API costs), context window management (which affects session quality and Compaction frequency), and Lane concurrency configuration (which affects multi-task throughput). This chapter starts with a cost analysis and systematically covers every optimizable link in the chain, helping you find the optimal balance between cost, speed, and quality.
40.1 Analyzing the Sources of Token Cost
Every LLM API call generates token consumption composed of the following parts:
The Four Sources of Input Tokens
Total Input Tokens = System Prompt + Skills Injection + Conversation History + Tool Results
Example proportions (typical long session):
System Prompt: ~3,000 tokens (18%)
Skills Injection: ~8,000 tokens (48%) ← The largest optimizable item
Conversation History: ~4,000 tokens (24%)
Tool Results: ~1,600 tokens (10%)
Total: ~16,600 tokens/call
System Prompt: Core instructions injected by OpenClaw on every call — Agent role definition, behavioral guidelines, safety constraints, etc. This portion is relatively fixed at roughly 1,500-4,000 tokens, with limited room for optimization.
Skills Injection: Every activated Skill injects its tool definition (Function Schema) into the System Prompt. If 20 Skills are enabled and each Schema averages 400 tokens, the Skills injection reaches 8,000 tokens — even if 18 of those Skills are completely irrelevant to the current task.
Conversation History: The full record of historical messages (including Agent replies, tool calls, and tool results), which grows linearly as the session progresses. This is what the Compaction mechanism primarily addresses.
Tool Results: Content returned by tool calls from the previous turn. Large tool results (such as reading a long file or scraping a web page) can rapidly consume a large number of tokens.
Output Tokens
Output tokens typically account for 10-30% of total cost, but on certain models (such as o3-mini) they are priced comparably to input tokens and should be appropriately capped with the maxTokens parameter.
40.2 Skills Lazy Loading: Reducing Per-Call Token Consumption
Eager Loading (Default Behavior, Not Recommended)
{
"skills": {
"lazy": false,
"enabled": ["web-search", "code-exec", "email", "calendar", "github", ...]
}
}
In eager mode, the complete Function Schema of all enabled Skills is injected on every API call — even when the current conversation doesn't need any of them. 20 Skills × 400 tokens = 8,000 tokens of fixed overhead per call.
Lazy Loading Mode (Recommended)
{
"skills": {
"lazy": true,
"alwaysInject": ["web-search"], // Core Skills always injected
"enabled": ["web-search", "code-exec", "email", "calendar", "github", ...]
}
}
In lazy loading mode:
- Initial injection: Only the metadata of all Skills (name + short description) is injected — approximately 50 tokens each
- Agent judgment: The Agent determines which Skills are needed based on the user's request
- On-demand loading: The Agent issues a
skill_loadrequest; the Gateway dynamically injects the full Schema of the target Skill - Session-scoped cache: Once a Skill is loaded, it remains available for the duration of the current session
Token Savings from Lazy Loading
Scenario: User asks "What's the weather today?"
Eager: Inject all 20 Skills → 8,000 tokens (Skills portion)
Lazy: 20 Skill metadata (1,000t) + load web-search on demand (400t) = 1,400 tokens
Savings: 6,600 tokens/call (approximately 83%)
Scenario: User requests a complex multi-Skill task (uses 8 Skills)
Eager: 8,000 tokens
Lazy: 1,000 + 8×400 = 4,200 tokens
Savings: 3,800 tokens/call (approximately 48%)
40.3 command-dispatch: Zero-Inference-Cost Invocations
How It Works
command-dispatch is a special fast path for handling highly structured, predictable command-style requests — bypassing the LLM reasoning process and routing directly to the corresponding Skill or built-in handler.
Standard flow:
User input → LLM reasoning (consumes tokens) → Tool Call → Execute → Return
command-dispatch flow:
User input (matches command format) → Rule match → Execute directly → Return
(No LLM reasoning; zero token consumption)
Configuring command-dispatch Rules
{
"commandDispatch": {
"enabled": true,
"rules": [
{
"pattern": "^/status$",
"action": "gateway.status",
"description": "Display Gateway status"
},
{
"pattern": "^/nodes$",
"action": "nodes.list",
"description": "List all Nodes"
},
{
"pattern": "^/run (.+)$",
"action": "system.run",
"captureGroup": 1,
"node": "default-headless"
},
{
"pattern": "^/snap$",
"action": "camera.snap",
"node": "default-ios"
}
]
}
}
When to Use command-dispatch
command-dispatch is well-suited for:
- Status query commands (
/status,/nodes,/health) - Fixed-format script triggers (
/run backup.sh) - Built-in triggers for Cron tasks (no reasoning required; execute directly)
- Automated responses to monitoring alerts
It is not suitable for:
- Open-ended conversations that require Agent judgment and reasoning
- Scenarios requiring flexible, context-dependent tool selection
40.4 Compaction Threshold Tuning
How Compaction Works
When the conversation history token count approaches the context window limit, OpenClaw triggers Compaction (context compression):
Compaction flow:
1. Calculate the total token count of the current conversation history
2. If it exceeds softThreshold (e.g., 85%) → trigger soft compaction
Older messages are summarized (LLM generates a summary, replacing the detailed messages)
3. If it exceeds hardThreshold (e.g., 95%) → trigger forced compaction
Oldest messages are removed, retaining only the summary
4. reserveFloor holds back token space for the Agent's current reply
Key Parameter Reference
{
"context": {
"reserveFloor": 8000, // Minimum token space reserved for the Agent's current reply
"softThreshold": 0.85, // Context usage ratio that triggers soft compaction (85%)
"hardThreshold": 0.95, // Context usage ratio that triggers forced compaction (95%)
"compactionModel": "anthropic/claude-haiku-3-5", // Model used to generate summaries (can be cheaper)
"summaryMaxTokens": 2000 // Maximum length of each summary
}
}
Tuning Scenarios
Scenario 1: Long tasks — reduce Compaction interruptions
If your Agent frequently triggers Compaction during long tasks (causing brief delays):
{
"context": {
"softThreshold": 0.90, // Raise the soft compaction threshold; compress less often
"reserveFloor": 4000, // Reduce reserved space if replies are typically short
"compactionModel": "anthropic/claude-haiku-3-5" // Use a cheap, fast model for summaries
}
}
Scenario 2: High-precision conversations — minimize information loss
{
"context": {
"softThreshold": 0.75, // Trigger compression earlier; preserve more detailed history
"summaryMaxTokens": 3000, // Generate more detailed summaries
"reserveFloor": 12000 // Reserve more space for potentially long replies
}
}
Scenario 3: Cost-sensitive — maximize compression
{
"context": {
"softThreshold": 0.70,
"compactionModel": "openai/gpt-4.1-nano", // Cheapest summarization model
"summaryMaxTokens": 1000 // Concise summaries to save tokens
}
}
Monitoring Compaction Frequency
# View the frequency of Compaction events
grep "Compaction triggered" /var/log/openclaw/gateway.log | \
awk '{print $1, $2}' | \
cut -d: -f1-2 | \
sort | uniq -c | sort -rn | head -10
# If Compaction occurs more than 10 times per hour, consider:
# 1. Lowering softThreshold (compress earlier)
# 2. Or switching to a model with a larger context window
40.5 Tool Result Pruning vs. Compaction
Both strategies control Context size but are suited to different scenarios:
Pruning: Targeting Individual Large Tool Results
When a single tool call returns an oversized payload (e.g., reading a 200KB file), Compaction alone is insufficient. The Pruning strategy directly truncates or summarizes the tool result:
{
"toolResultPruning": {
"enabled": true,
"maxResultTokens": 8000, // Maximum tokens for a single tool result
"strategy": "truncate", // "truncate" or "summarize"
"preserveStructure": true // Preserve JSON structure; only truncate values
}
}
Truncate strategy: Directly cuts off the excess content and appends a [Truncated: original length 52,847 tokens] annotation.
Summarize strategy: Calls a fast model to summarize the tool result. Slightly more expensive but retains information better.
Comparison
| Strategy | Best For | Token Savings | Information Loss |
|---|---|---|---|
| Pruning (truncate) | File reads, log viewing | High | Low (usually only the beginning is needed) |
| Pruning (summarize) | Web content, long documents | Moderate | Low (summary covers the full content) |
| Compaction | Long conversation history | High | Low (key information is retained) |
| Both combined | Large tool results + long sessions | Highest | Lowest |
40.6 Lane Concurrency Tuning
What Is a Lane?
A Lane is OpenClaw's concurrent execution unit. Every active tool call or Sub-Agent occupies one Lane. The Lane count limits the number of parallel operations that can proceed simultaneously.
Global Lanes: How many tools the primary Agent can call in parallel
Sub-Agent Lanes: How many tools each Sub-Agent can call in parallel
Default Configuration
{
"lanes": {
"global": 4, // Primary Agent: up to 4 parallel tool calls at once
"subAgent": 8 // Sub-Agent: up to 8 parallel tool calls at once
}
}
Why Is the Sub-Agent Lane Count Higher?
Sub-Agents are typically used to process independent subtasks in parallel (e.g., collecting data from multiple Nodes simultaneously, or searching multiple keywords in parallel). They need higher concurrency. The primary Agent, as the coordinator, usually needs no more than 4 parallel operations.
Tuning Strategies
Scenario 1: IO-intensive tasks (many network requests / Node calls)
{
"lanes": {
"global": 8, // Increase global concurrency (network waits don't consume CPU)
"subAgent": 16
}
}
Scenario 2: Compute-intensive tasks (many LLM calls)
{
"lanes": {
"global": 2, // Lower concurrency to avoid API rate limiting
"subAgent": 4
}
}
Scenario 3: Cost-sensitive (control total concurrency)
{
"lanes": {
"global": 2,
"subAgent": 4,
"maxTotalConcurrent": 6 // Hard cap on total concurrent operations
}
}
Diagnosing Lane Saturation
# Look for Lane wait events in logs
grep "lane_wait" /var/log/openclaw/gateway.log | wc -l
# If wait events are frequent, Lane count is insufficient
# Increase the global/subAgent values
# If API Rate Limit errors are frequent:
grep "rate_limit" /var/log/openclaw/gateway.log | wc -l
# Frequent rate limits mean concurrency is too high; reduce Lane count
40.7 Per-Provider Model Cost Matrix
Choosing the right model is the most direct cost control lever. Below is a three-dimensional evaluation (speed / cost / quality) of mainstream models:
Cost Comparison Table (Reference Prices, April 2026)
| Model | Provider | Input ($/1M tokens) | Output ($/1M tokens) | Speed | Quality | Best For |
|---|---|---|---|---|---|---|
| claude-opus-4-6 | Anthropic | $15.00 | $75.00 | Slow | S | Complex reasoning/writing |
| claude-sonnet-4-6 | Anthropic | $3.00 | $15.00 | Medium | A+ | Daily workhorse (recommended) |
| claude-haiku-3-5 | Anthropic | $0.80 | $4.00 | Fast | B+ | Compaction/summarization |
| gpt-5 | OpenAI | $10.00 | $40.00 | Medium | S | Code/multimodal |
| gpt-4.1-mini | OpenAI | $0.40 | $1.60 | Fast | B+ | Simple tasks/classification |
| gpt-4.1-nano | OpenAI | $0.10 | $0.40 | Very fast | B | Routing/classification/summaries |
| gemini-2.5-pro | $3.50 | $10.50 | Medium | A+ | Long documents (2M ctx) | |
| gemini-2.5-flash | $0.15 | $0.60 | Very fast | B+ | Cost-sensitive tasks | |
| deepseek-r1 | DeepSeek | $0.55 | $2.19 | Medium | A | Reasoning/math |
| deepseek-v3 | DeepSeek | $0.27 | $1.10 | Fast | B+ | General tasks / high value |
| llama3.3-70b | Ollama (local) | $0 | $0 | Hardware-dependent | B | Privacy-sensitive / offline |
| qwen2.5-72b | Ollama (local) | $0 | $0 | Hardware-dependent | B | Chinese tasks / offline |
Cost Calculation Example
Scenario: 100 sessions per day, averaging 20,000 input tokens + 2,000 output tokens per session
Using claude-opus-4-6:
Input: 100 × 20,000 / 1M × $15.00 = $30.00/day
Output: 100 × 2,000 / 1M × $75.00 = $15.00/day
Daily cost: $45.00 Monthly cost: $1,350
Using claude-sonnet-4-6:
Input: $6.00/day Output: $3.00/day
Daily cost: $9.00 Monthly cost: $270
Using gpt-4.1-nano (simple tasks):
Input: $0.20/day Output: $0.08/day
Daily cost: $0.28 Monthly cost: $8.40
Conclusion: By choosing models appropriately, costs for the same workload
can differ by a factor of 160.
Tiered Model Strategy (Recommended)
{
"model": "anthropic/claude-sonnet-4-6", // Primary model: complex tasks
"fallbackModel": "openai/gpt-4.1-mini", // Fallback: degrade when rate-limited
"context": {
"compactionModel": "anthropic/claude-haiku-3-5" // Summaries: use a cheap model
},
"routing": {
"simpleQueries": "openai/gpt-4.1-nano" // Simple queries: lowest cost
}
}
40.8 API Key Rotation to Balance Quota
Why Rotation Is Necessary
Each API Key has its own independent RPM (Requests Per Minute) and TPM (Tokens Per Minute) quota limits. A single Key can easily trigger Rate Limiting under high concurrency, causing request delays or failures.
Configuring Multi-Key Rotation
{
"providers": {
"anthropic": {
"keys": [
{
"key": "${ANTHROPIC_KEY_1}",
"weight": 2,
"tier": "claude-sonnet-4-6"
},
{
"key": "${ANTHROPIC_KEY_2}",
"weight": 1,
"tier": "claude-sonnet-4-6"
},
{
"key": "${ANTHROPIC_KEY_3}",
"weight": 1,
"tier": "claude-haiku-3-5"
}
],
"rotation": "weighted-round-robin",
"onRateLimit": "next-key",
"cooldownMs": 60000
}
}
}
Rotation Strategy Reference
| Strategy | Description | Best For |
|---|---|---|
round-robin |
Rotate in sequence, evenly distributed | Keys with similar quotas |
weighted-round-robin |
Distribute by weight (weight field) | Keys with unequal quotas |
least-used |
Prefer the Key with the lowest current usage | Precise quota balancing |
on-error |
Switch only when the current Key returns an error | Minimize switching overhead |
Monitoring Key Usage
# View usage statistics per Key
openclaw security audit --keys
# Output:
# Key Requests Tokens Used Rate Limits Status
# anthropic-key-1 1,234 18.4M 2 active
# anthropic-key-2 617 9.2M 0 active
# anthropic-key-3 301 1.8M 0 active (haiku only)
40.9 Diagnosing and Resolving Long-Session Performance Degradation
Symptoms of Performance Degradation
As a session progresses, you may notice:
- Increasing response latency: Per-call latency grows from 2s to 8s+
- Rising Compaction frequency: More than 10 Compactions per hour
- Increasing tool call error rate: Agent confuses parameters across different tools
- Context confusion: Agent misremembers early conversation content
Diagnostic Tools
# View token usage trends for the current session
openclaw gateway dashboard
# Example output:
# Session #4891 (3h 24m)
# Total calls: 47
# Avg tokens/call: 18,432 → 31,204 (↑69%) ← Significant growth, needs attention
# Compactions: 8 (last: 12min ago)
# Tool errors: 3 (6.4%) ← Exceeds the 5% warning threshold
# Check context usage
openclaw agent --session 4891 --context-stats
# Context window: 200,000 tokens
# Current usage: 156,000 (78%)
# Reserve floor: 8,000
# Available for reply: 36,000
Resolution Strategies
Strategy 1: Proactively compact the current session
# Type in the Control UI chat box
/compact
# Or via CLI
openclaw agent --session 4891 --compact
# Immediately triggers compression, summarizes history, frees context space
Strategy 2: Adjust the Compaction threshold (permanent)
{
"context": {
"softThreshold": 0.70, // Compress earlier; prevent excessive history buildup
"reserveFloor": 6000
}
}
Strategy 3: Switch to a model with a larger context window
{
"model": "google/gemini-2.5-pro" // 2M token context; Compaction is rarely needed
}
Strategy 4: Split long tasks into multiple shorter sessions
For tasks exceeding 4 hours, it is advisable to proactively start a new session at phase boundaries, recording phase results in memory/ so the next session can pick up where the previous one left off.
40.10 Comprehensive Optimization Impact Assessment
Before and After Comparison (Typical Production Scenario)
| Metric | Before Optimization | After Optimization | Improvement |
|---|---|---|---|
| Average tokens per call | 22,000 | 8,400 | -62% |
| Skills injection tokens | 8,000 | 1,400 | -83% |
| Monthly API cost (100 sessions/day) | $594 | $226 | -62% |
| Average response latency | 4.2s | 2.8s | -33% |
| Compaction frequency (per hour) | 18 | 4 | -78% |
| Tool error rate | 4.8% | 1.2% | -75% |
Optimization Checklist
Token Cost Optimization:
☑ Enable Skills lazy loading (lazy: true)
☑ Configure alwaysInject to include only core Skills
☑ Enable tool result Pruning (maxResultTokens: 8000)
☑ Use a cheap, fast model for Compaction (haiku/nano)
Context Management Optimization:
☑ Adjust softThreshold based on task type (0.75-0.90)
☑ Match reserveFloor to actual typical reply length
☑ Periodically run /compact during long tasks
Concurrency Optimization:
☑ IO-intensive tasks: increase Lane count (global: 8)
☑ Frequent API rate limiting: decrease Lane count (global: 2)
☑ Configure API Key Rotation (2-3 Keys)
Model Selection Optimization:
☑ Primary model: claude-sonnet-4-6 or equivalent tier
☑ Compaction summaries: haiku- or nano-class model
☑ Simple routing/classification: gpt-4.1-nano or gemini-flash
40.11 Summary
Token cost control, context budget management, and Lane concurrency tuning are the three most important performance levers in operating OpenClaw in production. Skills lazy loading typically saves 50-80% of Skills injection tokens. Appropriate Compaction thresholds can reduce compression frequency by 3-5x. A well-designed tiered model strategy can cut overall API costs by 40-70%. Stacked together, these three optimizations make it realistically achievable to reduce total costs by more than 60% while maintaining the same output quality.
This concludes Chapters 36-40 of the Complete Guide to OpenClaw. These five chapters cover the complete technical path from physical device integration and edge computing, through the control interface and production deployment, to performance tuning — providing a systematic technical reference for operating an OpenClaw Agent in production.
Chapter Complete | The Complete Guide to OpenClaw, Chapters 36-40