Chapter 12

Advanced Model Config: Key Rotation, Failover and Inference Depth Control

Chapter 12: Deep Model Configuration — Key Rotation, Failover Mechanisms, and Inference Depth Control

Overview

In production environments, a single API Key and fixed model configuration are often not robust enough. OpenClaw provides three major advanced mechanisms: Key Rotation, Failover, and inference depth control (Think mode). Together these three form the reliability and economic foundation for production-grade AI applications.

12.1 API Key Rotation Configuration

Why Key Rotation Is Needed

A single API Key will quickly hit Provider rate limits (TPM/RPM) under high concurrency. Key Rotation multiplies actual throughput by rotating across multiple Keys.

Configuration Format

{
  "providers": {
    "anthropic": {
      "api_keys": [
        {
          "key": "${ANTHROPIC_API_KEY_1}",
          "priority": 1,
          "weight": 3,
          "label": "primary-team-a"
        },
        {
          "key": "${ANTHROPIC_API_KEY_2}",
          "priority": 1,
          "weight": 2,
          "label": "primary-team-b"
        },
        {
          "key": "${ANTHROPIC_API_KEY_BACKUP}",
          "priority": 2,
          "weight": 1,
          "label": "backup"
        }
      ],
      "rotation_strategy": "weighted_round_robin"
    }
  }
}

Priority Logic

OpenClaw's Key selection follows these rules:

Lower priority number = higher priority (priority=1 before priority=2)
Within the same priority, Keys are rotated using weighted round-robin per weight
When all higher-priority Keys are unavailable, the system automatically falls back to the next priority group

Priority 1: [Key-A weight=3, Key-B weight=2]  -> 3:2 ratio request distribution
Priority 2: [Key-Backup weight=1]              -> Only activated when Priority 1 fully fails

Rotation Strategy Options

Strategy	Description	Best For
`round_robin`	Equal-weight rotation	Keys with identical quotas
`weighted_round_robin`	Weighted rotation	Keys with different quotas
`random`	Random selection	Load spreading
`least_used`	Select the least-used Key	Fine-grained balancing

12.2 429 / Quota Auto-Retry Mechanism

Error Classification and Response

{
  "retry": {
    "enabled": true,
    "max_attempts": 5,
    "backoff_strategy": "exponential",
    "base_delay_ms": 1000,
    "max_delay_ms": 30000,
    "retryable_errors": [429, 500, 502, 503, 504],
    "non_retryable_errors": [400, 401, 403, 404]
  }
}

Retry Flow Diagram

Request sent
   |
   v
API response
   |
  Failed?
   |
   +-- No --> Return result
   |
   +-- Yes --> Is error 429?
               |
               +-- Yes --> Read Retry-After header
               |           Wait specified time
               |           Switch to next Key (same priority)
               |           Retry (decrement max_attempts)
               |
               +-- No --> Is it a retryable error code?
                          |
                          +-- Yes --> Wait with exponential backoff
                          |           Retry
                          |
                          +-- No --> Throw Non-retryable Error

Exponential Backoff Configuration Details

{
  "retry": {
    "backoff_strategy": "exponential_jitter",
    "base_delay_ms": 1000,
    "multiplier": 2.0,
    "jitter": 0.3,
    "max_delay_ms": 30000
  }
}

Delay calculation: delay = min(base * 2^attempt * (1 ± jitter), max_delay)

Attempt	Base Wait	With Jitter (approximate)
Retry 1	1000ms	700~1300ms
Retry 2	2000ms	1400~2600ms
Retry 3	4000ms	2800~5200ms
Retry 4	8000ms	5600~10400ms
Retry 5	16000ms	11200~20800ms

12.3 FailoverError Trigger Chain

What Is FailoverError

FailoverError is a special error type defined by OpenClaw. It triggers when a model/Provider cannot complete a request, and the system automatically switches to the next model in the pre-configured Failover chain.

Failover Chain Configuration

{
  "failover": {
    "enabled": true,
    "chain": [
      {
        "model": "anthropic/claude-opus-4-6",
        "timeout_ms": 30000,
        "triggers": ["FailoverError", "timeout", "context_length_exceeded"]
      },
      {
        "model": "anthropic/claude-sonnet-4-6",
        "timeout_ms": 20000,
        "triggers": ["FailoverError", "timeout"]
      },
      {
        "model": "openai/gpt-5.5",
        "timeout_ms": 25000,
        "triggers": ["FailoverError", "timeout"]
      },
      {
        "model": "ollama/llama3.2",
        "timeout_ms": 60000,
        "triggers": ["FailoverError"]
      }
    ],
    "on_failover_log": true,
    "on_failover_notify_webhook": "${ALERT_WEBHOOK_URL}"
  }
}

Trigger Condition Types

Trigger Condition	Description
`FailoverError`	Model returns explicit failure
`timeout`	No response within timeout_ms
`context_length_exceeded`	Input exceeds model context limit
`rate_limit_exhausted`	All Keys are rate-limited
`content_filtered`	Content filtered by Provider
`model_overloaded`	Model service overloaded

Real Failover Log Example

[2026-04-26T10:23:41Z] INFO  Primary model request started: anthropic/claude-opus-4-6
[2026-04-26T10:23:71Z] WARN  Timeout after 30000ms, triggering failover
[2026-04-26T10:23:71Z] INFO  Failover to: anthropic/claude-sonnet-4-6 (attempt 2/4)
[2026-04-26T10:23:85Z] INFO  Request completed successfully on fallback model

12.4 Profile Cooling Tracking

Cooling Mechanism to Prevent Frequent Switching

Frequent switching between Providers introduces unnecessary latency and state inconsistency. OpenClaw's Profile cooling mechanism ensures a Provider/Key that recently failed is not immediately re-selected.

{
  "profile_cooling": {
    "enabled": true,
    "error_threshold": 3,
    "cooling_period_seconds": 300,
    "recovery_check_interval_seconds": 60,
    "metrics": {
      "track_per_key": true,
      "track_per_model": true,
      "track_per_provider": false
    }
  }
}

Cooling State Machine

Key is normally available (Active)
    |
    | Errors >= error_threshold
    v
Enter cooling state (Cooling)
    |
    | Wait cooling_period_seconds
    v
Attempt recovery check (Recovery Check)
    |
    +-- Health check passes --> Active
    |
    +-- Health check fails --> Reset cooling timer --> Cooling

View Current Profile Status

# CLI command
openclaw profile status

# Example output
Provider Profile Status:
  anthropic/claude-opus-4-6
    Key: sk-ant-...xxx1  Status: ACTIVE    Errors: 0    Last used: 2s ago
    Key: sk-ant-...xxx2  Status: COOLING   Errors: 3    Cooling until: 14:28:41
    Key: sk-ant-...xxx3  Status: ACTIVE    Errors: 1    Last used: 45s ago

  openai/gpt-5.5
    Key: sk-proj-...yyy1 Status: ACTIVE    Errors: 0    Last used: 1m ago

12.5 Inference Depth Control: The /think Command

Three Inference Modes

OpenClaw supports dynamically controlling the model's inference depth via the /think command. This is especially useful for scenarios requiring trade-offs between response speed and reasoning quality.

# Adaptive mode (default): automatically selects inference depth based on problem complexity
/think adaptive

# High-depth reasoning: enables CoT/Extended Thinking
/think high

# Disable reasoning: fastest response speed, direct output
/think off

Applicable Scenarios for Each Mode

Mode	Token Consumption	Response Latency	Applicable Scenarios
`/think off`	Lowest	Shortest	Simple Q&A, classification, summarization, formatting
`/think adaptive`	Medium	Medium	General purpose, recommended default
`/think high`	Highest	Longest	Mathematical derivation, code debugging, complex planning

Setting Default Inference Mode in Config File

{
  "inference": {
    "default_think_mode": "adaptive",
    "per_model_overrides": {
      "anthropic/claude-opus-4-6": "high",
      "anthropic/claude-haiku-4-5": "off",
      "openai/o3": "high",
      "openai/gpt-5.4-mini": "off"
    }
  }
}

Inference Token Budget Control

{
  "inference": {
    "think_budget": {
      "adaptive_min_tokens": 500,
      "adaptive_max_tokens": 4000,
      "high_max_tokens": 16000
    }
  }
}

12.6 Binding Different Models per Channel

Business Scenario Design

Different user access channels have different requirements for response speed and quality:

WhatsApp: Users expect quick replies — use lightweight, fast models
Telegram: Medium-complexity conversations — use balanced models
Web API: Professional users who can accept longer waits — use flagship models
Internal tools: Can use local models to save costs

Channel Model Binding Configuration

{
  "channel_model_bindings": {
    "whatsapp": {
      "model": "anthropic/claude-haiku-4-5",
      "think_mode": "off",
      "max_tokens": 1024,
      "temperature": 0.7
    },
    "telegram": {
      "model": "anthropic/claude-sonnet-4-6",
      "think_mode": "adaptive",
      "max_tokens": 4096,
      "temperature": 0.7
    },
    "web_api": {
      "model": "anthropic/claude-opus-4-6",
      "think_mode": "high",
      "max_tokens": 8192,
      "temperature": 0.5
    },
    "internal_tool": {
      "model": "ollama/llama3.2",
      "think_mode": "off",
      "max_tokens": 2048
    }
  }
}

Referencing Channel Bindings in Agent Definitions

{
  "agent": {
    "name": "customer-support",
    "channel": "${INCOMING_CHANNEL}",
    "fallback_channel": "telegram"
  }
}

12.7 Cost Control: Per-Provider Model Cost Comparison

Cost-Aware Configuration

{
  "cost_control": {
    "enabled": true,
    "budget": {
      "daily_usd": 100.0,
      "monthly_usd": 2000.0,
      "alert_threshold_pct": 80
    },
    "auto_downgrade": {
      "enabled": true,
      "trigger_pct": 90,
      "downgrade_to": "anthropic/claude-sonnet-4-6"
    }
  }
}

Mainstream Model Cost Comparison (Reference Prices, Q1 2026)

Model	Input Price ($/M tokens)	Output Price ($/M tokens)	Cost-Effectiveness
anthropic/claude-haiku-4-5	0.25	1.25	Very High
anthropic/claude-sonnet-4-6	3.00	15.00	High
anthropic/claude-opus-4-6	15.00	75.00	Low (strongest capability)
openai/gpt-5.4-mini	0.15	0.60	Very High
openai/gpt-5.5	10.00	30.00	Medium
openai/o3	10.00	40.00	Low (strongest reasoning)
deepseek/deepseek-r1	0.55	2.19	Very High (reasoning)
google/gemini-3-flash	0.075	0.30	Highest
ollama/any model	0.00	0.00	Infinite (local)

Cost Optimization Strategy

{
  "cost_optimization": {
    "routing_rules": [
      {
        "condition": "token_estimate < 500",
        "route_to": "anthropic/claude-haiku-4-5"
      },
      {
        "condition": "token_estimate >= 500 AND complexity == 'low'",
        "route_to": "openai/gpt-5.4-mini"
      },
      {
        "condition": "complexity == 'high' OR requires_reasoning == true",
        "route_to": "anthropic/claude-opus-4-6"
      }
    ]
  }
}

12.8 Comprehensive Example: Full Production-Grade Configuration

{
  "providers": {
    "anthropic": {
      "api_keys": [
        {"key": "${ANT_KEY_1}", "priority": 1, "weight": 3},
        {"key": "${ANT_KEY_2}", "priority": 1, "weight": 2},
        {"key": "${ANT_KEY_BACKUP}", "priority": 2, "weight": 1}
      ],
      "rotation_strategy": "weighted_round_robin"
    },
    "openai": {
      "api_keys": [
        {"key": "${OAI_KEY_1}", "priority": 1, "weight": 1}
      ]
    },
    "ollama": {
      "base_url": "http://localhost:11434"
    }
  },
  "failover": {
    "enabled": true,
    "chain": [
      {"model": "anthropic/claude-sonnet-4-6", "timeout_ms": 30000},
      {"model": "openai/gpt-5.4-mini", "timeout_ms": 20000},
      {"model": "ollama/llama3.2", "timeout_ms": 60000}
    ]
  },
  "profile_cooling": {
    "enabled": true,
    "error_threshold": 3,
    "cooling_period_seconds": 300
  },
  "retry": {
    "enabled": true,
    "max_attempts": 3,
    "backoff_strategy": "exponential_jitter"
  },
  "inference": {
    "default_think_mode": "adaptive"
  },
  "cost_control": {
    "daily_usd": 50.0,
    "alert_threshold_pct": 80
  }
}

Chapter Summary

Key Rotation uses priority + weight strategy, treating multiple API Keys as a unified resource pool
429/quota retry uses exponential backoff + Jitter to prevent thundering herd effects
FailoverError chains automatically switch on model failure, ensuring request completion rates
Profile cooling prevents constantly retrying failed Keys/models, avoiding wasted effort
/think modes allow on-demand inference depth control, balancing quality and cost
Channel binding implements fine-grained strategies using different models for different entry points
Cost-aware routing automatically downgrades to economical models under budget pressure

Next chapter introduces the complete practical guide for local model deployment.

Rate this chapter

4.6 / 5 (30 ratings)