Advanced Model Config: Key Rotation, Failover and Inference Depth Control
Chapter 12: Deep Model Configuration โ Key Rotation, Failover Mechanisms, and Inference Depth Control
Overview
In production environments, a single API Key and fixed model configuration are often not robust enough. OpenClaw provides three major advanced mechanisms: Key Rotation, Failover, and inference depth control (Think mode). Together these three form the reliability and economic foundation for production-grade AI applications.
12.1 API Key Rotation Configuration
Why Key Rotation Is Needed
A single API Key will quickly hit Provider rate limits (TPM/RPM) under high concurrency. Key Rotation multiplies actual throughput by rotating across multiple Keys.
Configuration Format
{
"providers": {
"anthropic": {
"api_keys": [
{
"key": "${ANTHROPIC_API_KEY_1}",
"priority": 1,
"weight": 3,
"label": "primary-team-a"
},
{
"key": "${ANTHROPIC_API_KEY_2}",
"priority": 1,
"weight": 2,
"label": "primary-team-b"
},
{
"key": "${ANTHROPIC_API_KEY_BACKUP}",
"priority": 2,
"weight": 1,
"label": "backup"
}
],
"rotation_strategy": "weighted_round_robin"
}
}
}
Priority Logic
OpenClaw's Key selection follows these rules:
- Lower priority number = higher priority (priority=1 before priority=2)
- Within the same priority, Keys are rotated using weighted round-robin per
weight - When all higher-priority Keys are unavailable, the system automatically falls back to the next priority group
Priority 1: [Key-A weight=3, Key-B weight=2] -> 3:2 ratio request distribution
Priority 2: [Key-Backup weight=1] -> Only activated when Priority 1 fully fails
Rotation Strategy Options
| Strategy | Description | Best For |
|---|---|---|
round_robin |
Equal-weight rotation | Keys with identical quotas |
weighted_round_robin |
Weighted rotation | Keys with different quotas |
random |
Random selection | Load spreading |
least_used |
Select the least-used Key | Fine-grained balancing |
12.2 429 / Quota Auto-Retry Mechanism
Error Classification and Response
{
"retry": {
"enabled": true,
"max_attempts": 5,
"backoff_strategy": "exponential",
"base_delay_ms": 1000,
"max_delay_ms": 30000,
"retryable_errors": [429, 500, 502, 503, 504],
"non_retryable_errors": [400, 401, 403, 404]
}
}
Retry Flow Diagram
Request sent
|
v
API response
|
Failed?
|
+-- No --> Return result
|
+-- Yes --> Is error 429?
|
+-- Yes --> Read Retry-After header
| Wait specified time
| Switch to next Key (same priority)
| Retry (decrement max_attempts)
|
+-- No --> Is it a retryable error code?
|
+-- Yes --> Wait with exponential backoff
| Retry
|
+-- No --> Throw Non-retryable Error
Exponential Backoff Configuration Details
{
"retry": {
"backoff_strategy": "exponential_jitter",
"base_delay_ms": 1000,
"multiplier": 2.0,
"jitter": 0.3,
"max_delay_ms": 30000
}
}
Delay calculation: delay = min(base * 2^attempt * (1 ยฑ jitter), max_delay)
| Attempt | Base Wait | With Jitter (approximate) |
|---|---|---|
| Retry 1 | 1000ms | 700~1300ms |
| Retry 2 | 2000ms | 1400~2600ms |
| Retry 3 | 4000ms | 2800~5200ms |
| Retry 4 | 8000ms | 5600~10400ms |
| Retry 5 | 16000ms | 11200~20800ms |
12.3 FailoverError Trigger Chain
What Is FailoverError
FailoverError is a special error type defined by OpenClaw. It triggers when a model/Provider cannot complete a request, and the system automatically switches to the next model in the pre-configured Failover chain.
Failover Chain Configuration
{
"failover": {
"enabled": true,
"chain": [
{
"model": "anthropic/claude-opus-4-6",
"timeout_ms": 30000,
"triggers": ["FailoverError", "timeout", "context_length_exceeded"]
},
{
"model": "anthropic/claude-sonnet-4-6",
"timeout_ms": 20000,
"triggers": ["FailoverError", "timeout"]
},
{
"model": "openai/gpt-5.5",
"timeout_ms": 25000,
"triggers": ["FailoverError", "timeout"]
},
{
"model": "ollama/llama3.2",
"timeout_ms": 60000,
"triggers": ["FailoverError"]
}
],
"on_failover_log": true,
"on_failover_notify_webhook": "${ALERT_WEBHOOK_URL}"
}
}
Trigger Condition Types
| Trigger Condition | Description |
|---|---|
FailoverError |
Model returns explicit failure |
timeout |
No response within timeout_ms |
context_length_exceeded |
Input exceeds model context limit |
rate_limit_exhausted |
All Keys are rate-limited |
content_filtered |
Content filtered by Provider |
model_overloaded |
Model service overloaded |
Real Failover Log Example
[2026-04-26T10:23:41Z] INFO Primary model request started: anthropic/claude-opus-4-6
[2026-04-26T10:23:71Z] WARN Timeout after 30000ms, triggering failover
[2026-04-26T10:23:71Z] INFO Failover to: anthropic/claude-sonnet-4-6 (attempt 2/4)
[2026-04-26T10:23:85Z] INFO Request completed successfully on fallback model
12.4 Profile Cooling Tracking
Cooling Mechanism to Prevent Frequent Switching
Frequent switching between Providers introduces unnecessary latency and state inconsistency. OpenClaw's Profile cooling mechanism ensures a Provider/Key that recently failed is not immediately re-selected.
{
"profile_cooling": {
"enabled": true,
"error_threshold": 3,
"cooling_period_seconds": 300,
"recovery_check_interval_seconds": 60,
"metrics": {
"track_per_key": true,
"track_per_model": true,
"track_per_provider": false
}
}
}
Cooling State Machine
Key is normally available (Active)
|
| Errors >= error_threshold
v
Enter cooling state (Cooling)
|
| Wait cooling_period_seconds
v
Attempt recovery check (Recovery Check)
|
+-- Health check passes --> Active
|
+-- Health check fails --> Reset cooling timer --> Cooling
View Current Profile Status
# CLI command
openclaw profile status
# Example output
Provider Profile Status:
anthropic/claude-opus-4-6
Key: sk-ant-...xxx1 Status: ACTIVE Errors: 0 Last used: 2s ago
Key: sk-ant-...xxx2 Status: COOLING Errors: 3 Cooling until: 14:28:41
Key: sk-ant-...xxx3 Status: ACTIVE Errors: 1 Last used: 45s ago
openai/gpt-5.5
Key: sk-proj-...yyy1 Status: ACTIVE Errors: 0 Last used: 1m ago
12.5 Inference Depth Control: The /think Command
Three Inference Modes
OpenClaw supports dynamically controlling the model's inference depth via the /think command. This is especially useful for scenarios requiring trade-offs between response speed and reasoning quality.
# Adaptive mode (default): automatically selects inference depth based on problem complexity
/think adaptive
# High-depth reasoning: enables CoT/Extended Thinking
/think high
# Disable reasoning: fastest response speed, direct output
/think off
Applicable Scenarios for Each Mode
| Mode | Token Consumption | Response Latency | Applicable Scenarios |
|---|---|---|---|
/think off |
Lowest | Shortest | Simple Q&A, classification, summarization, formatting |
/think adaptive |
Medium | Medium | General purpose, recommended default |
/think high |
Highest | Longest | Mathematical derivation, code debugging, complex planning |
Setting Default Inference Mode in Config File
{
"inference": {
"default_think_mode": "adaptive",
"per_model_overrides": {
"anthropic/claude-opus-4-6": "high",
"anthropic/claude-haiku-4-5": "off",
"openai/o3": "high",
"openai/gpt-5.4-mini": "off"
}
}
}
Inference Token Budget Control
{
"inference": {
"think_budget": {
"adaptive_min_tokens": 500,
"adaptive_max_tokens": 4000,
"high_max_tokens": 16000
}
}
}
12.6 Binding Different Models per Channel
Business Scenario Design
Different user access channels have different requirements for response speed and quality:
- WhatsApp: Users expect quick replies โ use lightweight, fast models
- Telegram: Medium-complexity conversations โ use balanced models
- Web API: Professional users who can accept longer waits โ use flagship models
- Internal tools: Can use local models to save costs
Channel Model Binding Configuration
{
"channel_model_bindings": {
"whatsapp": {
"model": "anthropic/claude-haiku-4-5",
"think_mode": "off",
"max_tokens": 1024,
"temperature": 0.7
},
"telegram": {
"model": "anthropic/claude-sonnet-4-6",
"think_mode": "adaptive",
"max_tokens": 4096,
"temperature": 0.7
},
"web_api": {
"model": "anthropic/claude-opus-4-6",
"think_mode": "high",
"max_tokens": 8192,
"temperature": 0.5
},
"internal_tool": {
"model": "ollama/llama3.2",
"think_mode": "off",
"max_tokens": 2048
}
}
}
Referencing Channel Bindings in Agent Definitions
{
"agent": {
"name": "customer-support",
"channel": "${INCOMING_CHANNEL}",
"fallback_channel": "telegram"
}
}
12.7 Cost Control: Per-Provider Model Cost Comparison
Cost-Aware Configuration
{
"cost_control": {
"enabled": true,
"budget": {
"daily_usd": 100.0,
"monthly_usd": 2000.0,
"alert_threshold_pct": 80
},
"auto_downgrade": {
"enabled": true,
"trigger_pct": 90,
"downgrade_to": "anthropic/claude-sonnet-4-6"
}
}
}
Mainstream Model Cost Comparison (Reference Prices, Q1 2026)
| Model | Input Price ($/M tokens) | Output Price ($/M tokens) | Cost-Effectiveness |
|---|---|---|---|
| anthropic/claude-haiku-4-5 | 0.25 | 1.25 | Very High |
| anthropic/claude-sonnet-4-6 | 3.00 | 15.00 | High |
| anthropic/claude-opus-4-6 | 15.00 | 75.00 | Low (strongest capability) |
| openai/gpt-5.4-mini | 0.15 | 0.60 | Very High |
| openai/gpt-5.5 | 10.00 | 30.00 | Medium |
| openai/o3 | 10.00 | 40.00 | Low (strongest reasoning) |
| deepseek/deepseek-r1 | 0.55 | 2.19 | Very High (reasoning) |
| google/gemini-3-flash | 0.075 | 0.30 | Highest |
| ollama/any model | 0.00 | 0.00 | Infinite (local) |
Cost Optimization Strategy
{
"cost_optimization": {
"routing_rules": [
{
"condition": "token_estimate < 500",
"route_to": "anthropic/claude-haiku-4-5"
},
{
"condition": "token_estimate >= 500 AND complexity == 'low'",
"route_to": "openai/gpt-5.4-mini"
},
{
"condition": "complexity == 'high' OR requires_reasoning == true",
"route_to": "anthropic/claude-opus-4-6"
}
]
}
}
12.8 Comprehensive Example: Full Production-Grade Configuration
{
"providers": {
"anthropic": {
"api_keys": [
{"key": "${ANT_KEY_1}", "priority": 1, "weight": 3},
{"key": "${ANT_KEY_2}", "priority": 1, "weight": 2},
{"key": "${ANT_KEY_BACKUP}", "priority": 2, "weight": 1}
],
"rotation_strategy": "weighted_round_robin"
},
"openai": {
"api_keys": [
{"key": "${OAI_KEY_1}", "priority": 1, "weight": 1}
]
},
"ollama": {
"base_url": "http://localhost:11434"
}
},
"failover": {
"enabled": true,
"chain": [
{"model": "anthropic/claude-sonnet-4-6", "timeout_ms": 30000},
{"model": "openai/gpt-5.4-mini", "timeout_ms": 20000},
{"model": "ollama/llama3.2", "timeout_ms": 60000}
]
},
"profile_cooling": {
"enabled": true,
"error_threshold": 3,
"cooling_period_seconds": 300
},
"retry": {
"enabled": true,
"max_attempts": 3,
"backoff_strategy": "exponential_jitter"
},
"inference": {
"default_think_mode": "adaptive"
},
"cost_control": {
"daily_usd": 50.0,
"alert_threshold_pct": 80
}
}
Chapter Summary
- Key Rotation uses priority + weight strategy, treating multiple API Keys as a unified resource pool
- 429/quota retry uses exponential backoff + Jitter to prevent thundering herd effects
- FailoverError chains automatically switch on model failure, ensuring request completion rates
- Profile cooling prevents constantly retrying failed Keys/models, avoiding wasted effort
- /think modes allow on-demand inference depth control, balancing quality and cost
- Channel binding implements fine-grained strategies using different models for different entry points
- Cost-aware routing automatically downgrades to economical models under budget pressure
Next chapter introduces the complete practical guide for local model deployment.