Chapter 40

Performance Tuning: Token Cost Control, Context Budget Management and Concurrent Lane Config

Chapter 40: Performance Tuning — Token Cost Control, Context Budget Management, and Concurrent Lane Configuration

Overview

OpenClaw's operating cost and performance are largely determined by three tunable variables: token consumption (which maps directly to API costs), context window management (which affects session quality and Compaction frequency), and Lane concurrency configuration (which affects multi-task throughput). This chapter starts with a cost analysis and systematically covers every optimizable link in the chain, helping you find the optimal balance between cost, speed, and quality.


40.1 Analyzing the Sources of Token Cost

Every LLM API call generates token consumption composed of the following parts:

The Four Sources of Input Tokens

Total Input Tokens = System Prompt + Skills Injection + Conversation History + Tool Results

Example proportions (typical long session):
  System Prompt:         ~3,000 tokens  (18%)
  Skills Injection:      ~8,000 tokens  (48%)  ← The largest optimizable item
  Conversation History:  ~4,000 tokens  (24%)
  Tool Results:          ~1,600 tokens  (10%)
  Total:                 ~16,600 tokens/call

System Prompt: Core instructions injected by OpenClaw on every call — Agent role definition, behavioral guidelines, safety constraints, etc. This portion is relatively fixed at roughly 1,500-4,000 tokens, with limited room for optimization.

Skills Injection: Every activated Skill injects its tool definition (Function Schema) into the System Prompt. If 20 Skills are enabled and each Schema averages 400 tokens, the Skills injection reaches 8,000 tokens — even if 18 of those Skills are completely irrelevant to the current task.

Conversation History: The full record of historical messages (including Agent replies, tool calls, and tool results), which grows linearly as the session progresses. This is what the Compaction mechanism primarily addresses.

Tool Results: Content returned by tool calls from the previous turn. Large tool results (such as reading a long file or scraping a web page) can rapidly consume a large number of tokens.

Output Tokens

Output tokens typically account for 10-30% of total cost, but on certain models (such as o3-mini) they are priced comparably to input tokens and should be appropriately capped with the maxTokens parameter.


40.2 Skills Lazy Loading: Reducing Per-Call Token Consumption

{
  "skills": {
    "lazy": false,
    "enabled": ["web-search", "code-exec", "email", "calendar", "github", ...]
  }
}

In eager mode, the complete Function Schema of all enabled Skills is injected on every API call — even when the current conversation doesn't need any of them. 20 Skills × 400 tokens = 8,000 tokens of fixed overhead per call.

{
  "skills": {
    "lazy": true,
    "alwaysInject": ["web-search"],  // Core Skills always injected
    "enabled": ["web-search", "code-exec", "email", "calendar", "github", ...]
  }
}

In lazy loading mode:

  1. Initial injection: Only the metadata of all Skills (name + short description) is injected — approximately 50 tokens each
  2. Agent judgment: The Agent determines which Skills are needed based on the user's request
  3. On-demand loading: The Agent issues a skill_load request; the Gateway dynamically injects the full Schema of the target Skill
  4. Session-scoped cache: Once a Skill is loaded, it remains available for the duration of the current session

Token Savings from Lazy Loading

Scenario: User asks "What's the weather today?"
  Eager:   Inject all 20 Skills → 8,000 tokens (Skills portion)
  Lazy:    20 Skill metadata (1,000t) + load web-search on demand (400t) = 1,400 tokens
  Savings: 6,600 tokens/call (approximately 83%)

Scenario: User requests a complex multi-Skill task (uses 8 Skills)
  Eager:   8,000 tokens
  Lazy:    1,000 + 8×400 = 4,200 tokens
  Savings: 3,800 tokens/call (approximately 48%)

40.3 command-dispatch: Zero-Inference-Cost Invocations

How It Works

command-dispatch is a special fast path for handling highly structured, predictable command-style requests — bypassing the LLM reasoning process and routing directly to the corresponding Skill or built-in handler.

Standard flow:
  User input → LLM reasoning (consumes tokens) → Tool Call → Execute → Return

command-dispatch flow:
  User input (matches command format) → Rule match → Execute directly → Return
  (No LLM reasoning; zero token consumption)

Configuring command-dispatch Rules

{
  "commandDispatch": {
    "enabled": true,
    "rules": [
      {
        "pattern": "^/status$",
        "action": "gateway.status",
        "description": "Display Gateway status"
      },
      {
        "pattern": "^/nodes$",
        "action": "nodes.list",
        "description": "List all Nodes"
      },
      {
        "pattern": "^/run (.+)$",
        "action": "system.run",
        "captureGroup": 1,
        "node": "default-headless"
      },
      {
        "pattern": "^/snap$",
        "action": "camera.snap",
        "node": "default-ios"
      }
    ]
  }
}

When to Use command-dispatch

command-dispatch is well-suited for:

It is not suitable for:


40.4 Compaction Threshold Tuning

How Compaction Works

When the conversation history token count approaches the context window limit, OpenClaw triggers Compaction (context compression):

Compaction flow:
  1. Calculate the total token count of the current conversation history
  2. If it exceeds softThreshold (e.g., 85%) → trigger soft compaction
     Older messages are summarized (LLM generates a summary, replacing the detailed messages)
  3. If it exceeds hardThreshold (e.g., 95%) → trigger forced compaction
     Oldest messages are removed, retaining only the summary
  4. reserveFloor holds back token space for the Agent's current reply

Key Parameter Reference

{
  "context": {
    "reserveFloor": 8000,        // Minimum token space reserved for the Agent's current reply
    "softThreshold": 0.85,       // Context usage ratio that triggers soft compaction (85%)
    "hardThreshold": 0.95,       // Context usage ratio that triggers forced compaction (95%)
    "compactionModel": "anthropic/claude-haiku-3-5",  // Model used to generate summaries (can be cheaper)
    "summaryMaxTokens": 2000     // Maximum length of each summary
  }
}

Tuning Scenarios

Scenario 1: Long tasks — reduce Compaction interruptions

If your Agent frequently triggers Compaction during long tasks (causing brief delays):

{
  "context": {
    "softThreshold": 0.90,   // Raise the soft compaction threshold; compress less often
    "reserveFloor": 4000,    // Reduce reserved space if replies are typically short
    "compactionModel": "anthropic/claude-haiku-3-5"  // Use a cheap, fast model for summaries
  }
}

Scenario 2: High-precision conversations — minimize information loss

{
  "context": {
    "softThreshold": 0.75,   // Trigger compression earlier; preserve more detailed history
    "summaryMaxTokens": 3000, // Generate more detailed summaries
    "reserveFloor": 12000     // Reserve more space for potentially long replies
  }
}

Scenario 3: Cost-sensitive — maximize compression

{
  "context": {
    "softThreshold": 0.70,
    "compactionModel": "openai/gpt-4.1-nano",  // Cheapest summarization model
    "summaryMaxTokens": 1000  // Concise summaries to save tokens
  }
}

Monitoring Compaction Frequency

# View the frequency of Compaction events
grep "Compaction triggered" /var/log/openclaw/gateway.log | \
  awk '{print $1, $2}' | \
  cut -d: -f1-2 | \
  sort | uniq -c | sort -rn | head -10

# If Compaction occurs more than 10 times per hour, consider:
# 1. Lowering softThreshold (compress earlier)
# 2. Or switching to a model with a larger context window

40.5 Tool Result Pruning vs. Compaction

Both strategies control Context size but are suited to different scenarios:

Pruning: Targeting Individual Large Tool Results

When a single tool call returns an oversized payload (e.g., reading a 200KB file), Compaction alone is insufficient. The Pruning strategy directly truncates or summarizes the tool result:

{
  "toolResultPruning": {
    "enabled": true,
    "maxResultTokens": 8000,    // Maximum tokens for a single tool result
    "strategy": "truncate",     // "truncate" or "summarize"
    "preserveStructure": true   // Preserve JSON structure; only truncate values
  }
}

Truncate strategy: Directly cuts off the excess content and appends a [Truncated: original length 52,847 tokens] annotation.

Summarize strategy: Calls a fast model to summarize the tool result. Slightly more expensive but retains information better.

Comparison

Strategy Best For Token Savings Information Loss
Pruning (truncate) File reads, log viewing High Low (usually only the beginning is needed)
Pruning (summarize) Web content, long documents Moderate Low (summary covers the full content)
Compaction Long conversation history High Low (key information is retained)
Both combined Large tool results + long sessions Highest Lowest

40.6 Lane Concurrency Tuning

What Is a Lane?

A Lane is OpenClaw's concurrent execution unit. Every active tool call or Sub-Agent occupies one Lane. The Lane count limits the number of parallel operations that can proceed simultaneously.

Global Lanes:      How many tools the primary Agent can call in parallel
Sub-Agent Lanes:   How many tools each Sub-Agent can call in parallel

Default Configuration

{
  "lanes": {
    "global": 4,      // Primary Agent: up to 4 parallel tool calls at once
    "subAgent": 8     // Sub-Agent: up to 8 parallel tool calls at once
  }
}

Why Is the Sub-Agent Lane Count Higher?

Sub-Agents are typically used to process independent subtasks in parallel (e.g., collecting data from multiple Nodes simultaneously, or searching multiple keywords in parallel). They need higher concurrency. The primary Agent, as the coordinator, usually needs no more than 4 parallel operations.

Tuning Strategies

Scenario 1: IO-intensive tasks (many network requests / Node calls)

{
  "lanes": {
    "global": 8,       // Increase global concurrency (network waits don't consume CPU)
    "subAgent": 16
  }
}

Scenario 2: Compute-intensive tasks (many LLM calls)

{
  "lanes": {
    "global": 2,       // Lower concurrency to avoid API rate limiting
    "subAgent": 4
  }
}

Scenario 3: Cost-sensitive (control total concurrency)

{
  "lanes": {
    "global": 2,
    "subAgent": 4,
    "maxTotalConcurrent": 6   // Hard cap on total concurrent operations
  }
}

Diagnosing Lane Saturation

# Look for Lane wait events in logs
grep "lane_wait" /var/log/openclaw/gateway.log | wc -l

# If wait events are frequent, Lane count is insufficient
# Increase the global/subAgent values

# If API Rate Limit errors are frequent:
grep "rate_limit" /var/log/openclaw/gateway.log | wc -l
# Frequent rate limits mean concurrency is too high; reduce Lane count

40.7 Per-Provider Model Cost Matrix

Choosing the right model is the most direct cost control lever. Below is a three-dimensional evaluation (speed / cost / quality) of mainstream models:

Cost Comparison Table (Reference Prices, April 2026)

Model Provider Input ($/1M tokens) Output ($/1M tokens) Speed Quality Best For
claude-opus-4-6 Anthropic $15.00 $75.00 Slow S Complex reasoning/writing
claude-sonnet-4-6 Anthropic $3.00 $15.00 Medium A+ Daily workhorse (recommended)
claude-haiku-3-5 Anthropic $0.80 $4.00 Fast B+ Compaction/summarization
gpt-5 OpenAI $10.00 $40.00 Medium S Code/multimodal
gpt-4.1-mini OpenAI $0.40 $1.60 Fast B+ Simple tasks/classification
gpt-4.1-nano OpenAI $0.10 $0.40 Very fast B Routing/classification/summaries
gemini-2.5-pro Google $3.50 $10.50 Medium A+ Long documents (2M ctx)
gemini-2.5-flash Google $0.15 $0.60 Very fast B+ Cost-sensitive tasks
deepseek-r1 DeepSeek $0.55 $2.19 Medium A Reasoning/math
deepseek-v3 DeepSeek $0.27 $1.10 Fast B+ General tasks / high value
llama3.3-70b Ollama (local) $0 $0 Hardware-dependent B Privacy-sensitive / offline
qwen2.5-72b Ollama (local) $0 $0 Hardware-dependent B Chinese tasks / offline

Cost Calculation Example

Scenario: 100 sessions per day, averaging 20,000 input tokens + 2,000 output tokens per session

Using claude-opus-4-6:
  Input:  100 × 20,000 / 1M × $15.00 = $30.00/day
  Output: 100 × 2,000  / 1M × $75.00 = $15.00/day
  Daily cost: $45.00  Monthly cost: $1,350

Using claude-sonnet-4-6:
  Input: $6.00/day  Output: $3.00/day
  Daily cost: $9.00  Monthly cost: $270

Using gpt-4.1-nano (simple tasks):
  Input: $0.20/day  Output: $0.08/day
  Daily cost: $0.28  Monthly cost: $8.40

Conclusion: By choosing models appropriately, costs for the same workload
can differ by a factor of 160.
{
  "model": "anthropic/claude-sonnet-4-6",   // Primary model: complex tasks
  "fallbackModel": "openai/gpt-4.1-mini",   // Fallback: degrade when rate-limited
  "context": {
    "compactionModel": "anthropic/claude-haiku-3-5"  // Summaries: use a cheap model
  },
  "routing": {
    "simpleQueries": "openai/gpt-4.1-nano"  // Simple queries: lowest cost
  }
}

40.8 API Key Rotation to Balance Quota

Why Rotation Is Necessary

Each API Key has its own independent RPM (Requests Per Minute) and TPM (Tokens Per Minute) quota limits. A single Key can easily trigger Rate Limiting under high concurrency, causing request delays or failures.

Configuring Multi-Key Rotation

{
  "providers": {
    "anthropic": {
      "keys": [
        {
          "key": "${ANTHROPIC_KEY_1}",
          "weight": 2,
          "tier": "claude-sonnet-4-6"
        },
        {
          "key": "${ANTHROPIC_KEY_2}",
          "weight": 1,
          "tier": "claude-sonnet-4-6"
        },
        {
          "key": "${ANTHROPIC_KEY_3}",
          "weight": 1,
          "tier": "claude-haiku-3-5"
        }
      ],
      "rotation": "weighted-round-robin",
      "onRateLimit": "next-key",
      "cooldownMs": 60000
    }
  }
}

Rotation Strategy Reference

Strategy Description Best For
round-robin Rotate in sequence, evenly distributed Keys with similar quotas
weighted-round-robin Distribute by weight (weight field) Keys with unequal quotas
least-used Prefer the Key with the lowest current usage Precise quota balancing
on-error Switch only when the current Key returns an error Minimize switching overhead

Monitoring Key Usage

# View usage statistics per Key
openclaw security audit --keys

# Output:
# Key                  Requests  Tokens Used  Rate Limits  Status
# anthropic-key-1      1,234     18.4M        2            active
# anthropic-key-2        617      9.2M         0            active
# anthropic-key-3        301      1.8M         0            active (haiku only)

40.9 Diagnosing and Resolving Long-Session Performance Degradation

Symptoms of Performance Degradation

As a session progresses, you may notice:

  1. Increasing response latency: Per-call latency grows from 2s to 8s+
  2. Rising Compaction frequency: More than 10 Compactions per hour
  3. Increasing tool call error rate: Agent confuses parameters across different tools
  4. Context confusion: Agent misremembers early conversation content

Diagnostic Tools

# View token usage trends for the current session
openclaw gateway dashboard

# Example output:
# Session #4891 (3h 24m)
#   Total calls: 47
#   Avg tokens/call: 18,432 → 31,204 (↑69%)  ← Significant growth, needs attention
#   Compactions: 8  (last: 12min ago)
#   Tool errors: 3  (6.4%)                    ← Exceeds the 5% warning threshold

# Check context usage
openclaw agent --session 4891 --context-stats
# Context window: 200,000 tokens
# Current usage: 156,000 (78%)
# Reserve floor: 8,000
# Available for reply: 36,000

Resolution Strategies

Strategy 1: Proactively compact the current session

# Type in the Control UI chat box
/compact

# Or via CLI
openclaw agent --session 4891 --compact

# Immediately triggers compression, summarizes history, frees context space

Strategy 2: Adjust the Compaction threshold (permanent)

{
  "context": {
    "softThreshold": 0.70,   // Compress earlier; prevent excessive history buildup
    "reserveFloor": 6000
  }
}

Strategy 3: Switch to a model with a larger context window

{
  "model": "google/gemini-2.5-pro"  // 2M token context; Compaction is rarely needed
}

Strategy 4: Split long tasks into multiple shorter sessions

For tasks exceeding 4 hours, it is advisable to proactively start a new session at phase boundaries, recording phase results in memory/ so the next session can pick up where the previous one left off.


40.10 Comprehensive Optimization Impact Assessment

Before and After Comparison (Typical Production Scenario)

Metric Before Optimization After Optimization Improvement
Average tokens per call 22,000 8,400 -62%
Skills injection tokens 8,000 1,400 -83%
Monthly API cost (100 sessions/day) $594 $226 -62%
Average response latency 4.2s 2.8s -33%
Compaction frequency (per hour) 18 4 -78%
Tool error rate 4.8% 1.2% -75%

Optimization Checklist

Token Cost Optimization:
  ☑ Enable Skills lazy loading (lazy: true)
  ☑ Configure alwaysInject to include only core Skills
  ☑ Enable tool result Pruning (maxResultTokens: 8000)
  ☑ Use a cheap, fast model for Compaction (haiku/nano)

Context Management Optimization:
  ☑ Adjust softThreshold based on task type (0.75-0.90)
  ☑ Match reserveFloor to actual typical reply length
  ☑ Periodically run /compact during long tasks

Concurrency Optimization:
  ☑ IO-intensive tasks: increase Lane count (global: 8)
  ☑ Frequent API rate limiting: decrease Lane count (global: 2)
  ☑ Configure API Key Rotation (2-3 Keys)

Model Selection Optimization:
  ☑ Primary model: claude-sonnet-4-6 or equivalent tier
  ☑ Compaction summaries: haiku- or nano-class model
  ☑ Simple routing/classification: gpt-4.1-nano or gemini-flash

40.11 Summary

Token cost control, context budget management, and Lane concurrency tuning are the three most important performance levers in operating OpenClaw in production. Skills lazy loading typically saves 50-80% of Skills injection tokens. Appropriate Compaction thresholds can reduce compression frequency by 3-5x. A well-designed tiered model strategy can cut overall API costs by 40-70%. Stacked together, these three optimizations make it realistically achievable to reduce total costs by more than 60% while maintaining the same output quality.

This concludes Chapters 36-40 of the Complete Guide to OpenClaw. These five chapters cover the complete technical path from physical device integration and edge computing, through the control interface and production deployment, to performance tuning — providing a systematic technical reference for operating an OpenClaw Agent in production.


Chapter Complete | The Complete Guide to OpenClaw, Chapters 36-40

Rate this chapter
4.5  / 5  (3 ratings)

💬 Comments