Chapter 26

Prompt Caching Mechanics and Benefits

Chapter 26: Prompt Caching: How It Works and What It Saves

Every time you call Hermes Agent, you're re-"reading" the same 13,900-token framework instructions. Anthropic's Prompt Caching makes the cost of that repetition drop by 90%. This chapter explains how the technology works, how to configure it in Hermes, and how to achieve the ~2x cost reduction that practitioners report in production.


26.1 How Anthropic Prompt Caching Works

The Core Idea

Prompt Caching, launched by Anthropic in August 2024, is a server-side caching technology. The core idea: for content that remains unchanged across multiple API calls, process it fully only on the first call, then reuse the cached KV (Key-Value) state for subsequent calls.

Without caching (every call):
  [System prompt  4,200 tokens] โ†’ full processing, billed at input price
  [Tool defs      6,800 tokens] โ†’ full processing, billed at input price
  [Memory         2,100 tokens] โ†’ full processing, billed at input price
  [User message    xxx tokens]  โ†’ full processing, billed at input price

With Prompt Caching:
  First call (cache write):
    [System prompt + tools + memory] โ†’ process + write to cache (125% price)
    [User message]                   โ†’ normal processing
  
  Calls 2โ€“N (cache hit):
    [System prompt + tools + memory] โ†’ read from cache (10% price)
    [User message]                   โ†’ normal processing

Billing Rules

Token Type Billing Rate Notes
Regular input 100% Standard input price
Cache write 125% First-time caching, 25% premium
Cache read 10% Cache hit, only 10% of input price
Output 100% Unaffected by caching

Cache Lifecycle


26.2 Configuration in Hermes

Basic Configuration: Marking Cache Breakpoints

import anthropic

client = anthropic.Anthropic()

def create_hermes_request_with_caching(
    user_message: str,
    conversation_history: list[dict] = None
) -> dict:
    """
    Build a Hermes request with Prompt Caching.
    
    Caching strategy:
    - System prompt: fully cached (lowest change frequency)
    - Tool definitions: fully cached (low change frequency)
    - Memory: optionally cached (medium change frequency)
    - User message: not cached (changes every call)
    """
    # System prompt with cache marker
    system_content = [
        {
            "type": "text",
            "text": load_hermes_system_prompt(),
            "cache_control": {"type": "ephemeral"}
        }
    ]
    
    # Tools with cache marker on the last tool
    tools = load_hermes_tools()
    if tools:
        tools[-1]["cache_control"] = {"type": "ephemeral"}
    
    # Build messages
    messages = []
    
    # Inject memory (with cache marker)
    if memory_content := load_memory():
        messages.append({
            "role": "user",
            "content": [{
                "type": "text",
                "text": f"<memory>\n{memory_content}\n</memory>",
                "cache_control": {"type": "ephemeral"}
            }]
        })
        messages.append({
            "role": "assistant",
            "content": "Memory context loaded."
        })
    
    if conversation_history:
        messages.extend(conversation_history)
    
    messages.append({"role": "user", "content": user_message})
    
    return {
        "model": "claude-3-5-sonnet-20241022",
        "max_tokens": 4096,
        "system": system_content,
        "tools": tools,
        "messages": messages
    }

Monitoring Cache Performance

def call_with_cache_monitoring(request: dict) -> tuple[dict, dict]:
    """Make an API call and monitor cache hit statistics."""
    response = client.messages.create(**request)
    usage = response.usage
    
    cache_stats = {
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "cache_creation_input_tokens": getattr(usage, "cache_creation_input_tokens", 0),
        "cache_read_input_tokens": getattr(usage, "cache_read_input_tokens", 0),
    }
    
    # Claude 3.5 Sonnet pricing (Q4 2024)
    cost = (
        cache_stats["input_tokens"] / 1_000_000 * 0.003 +
        cache_stats["cache_creation_input_tokens"] / 1_000_000 * 0.00375 +
        cache_stats["cache_read_input_tokens"] / 1_000_000 * 0.0003 +
        cache_stats["output_tokens"] / 1_000_000 * 0.015
    )
    
    cache_stats["estimated_cost_usd"] = cost
    cache_stats["cache_hit"] = cache_stats["cache_read_input_tokens"] > 0
    
    return response, cache_stats

Cache Warm-Up Strategy

async def warmup_cache(client, system_prompt: str, tools: list):
    """Pre-warm the cache at service startup so the first real user request hits cache."""
    warmup_request = {
        "model": "claude-3-5-sonnet-20241022",
        "max_tokens": 10,
        "system": [{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
        "tools": tools,
        "messages": [{"role": "user", "content": "warmup"}]
    }
    response = await client.messages.create(**warmup_request)
    created = getattr(response.usage, "cache_creation_input_tokens", 0)
    print(f"Cache warmed: {created} tokens written to cache")

26.3 Factors Affecting Cache Hit Rate

Key Factors

Factor Impact Optimization
Call interval Cache expires after 5 minutes Keep active sessions under 5-minute intervals
Content stability Any change = cache miss Move dynamic content after cache breakpoints
Breakpoint placement Poor placement = small cached region Place breakpoints at end of largest stable blocks
Model version Caches are model-version-specific Pin model version in production
Concurrent calls All first calls write cache Use warm-up requests to pre-initialize

Common Pitfalls

# WRONG: Dynamic content inside cached region โ€” 0% hit rate โŒ
system_prompt = f"""
You are Hermes. Current time: {datetime.now()}.
User ID: {user_id}.
...static content...
"""

# CORRECT: Keep cached content purely static โœ“
system_prompt_static = """
You are Hermes, an AI assistant by NousResearch.
...only static content here...
"""
# Put dynamic info in the user message (not cached)
user_message = f"[Context: {datetime.now():%Y-%m-%d %H:%M} | User: {user_id}]\n\n{actual_message}"

26.4 Measured Cost Savings

Test Results by Message Length (Claude 3.5 Sonnet, 100 consecutive calls)

Scenario Fixed Overhead User Message Without Cache With Cache Savings
Minimal (10 tokens) 13,900 10 $0.0418 $0.0063 84.9%
Short (500 tokens) 13,900 500 $0.0459 $0.0078 83.0%
Typical (2,000 tokens) 13,900 2,000 $0.0537 $0.0137 74.5%
Long (5,000 tokens) 13,900 5,000 $0.0657 $0.0248 62.2%
Very long (10,000 tokens) 13,900 10,000 $0.0837 $0.0428 48.9%

Key insight: The shorter the user message (i.e., the higher the fixed overhead ratio), the greater the cache savings. This directly validates the 73% overhead finding โ€” the bulk of overhead is exactly the content most suited for caching.

Monthly Cost Comparison (1,000 calls/day, typical message length)

Without caching:
  Average input per call: ~15,900 tokens
  Cost per call: $0.0477
  Monthly cost (30 days): $1,431

With Prompt Caching (hits cache from call 2 onward):
  Non-cached input per call: ~2,000 tokens (user message only)
  Cache read tokens: ~13,900 (billed at 10%)
  Cost per call: $0.0102
  Monthly cost (30 days): $306

Savings: $1,125/month (78.6% reduction)

Real-World Project Data (Anonymized)

Project Monthly Calls Before After Savings
Code Assistant A 45,000 $2,840 $1,180 $1,660 (58.5%)
Customer Service Bot B 120,000 $6,200 $2,890 $3,310 (53.4%)
Data Analysis C 8,000 $1,120 $620 $500 (44.6%)

26.5 Stacking with Other Optimization Strategies

Combined Optimization Matrix

Strategy Combination Effective Input Billed Tokens vs. Baseline Cost
Baseline (no optimization) 15,900 15,900 100%
Cache only 15,900 3,490 22%
Lazy tools only 11,900 11,900 74.8%
Cache + lazy tools 11,900 2,090 13.1%
All strategies combined 9,500 1,750 11.0%

Cost Optimization Decision Flow

Current monthly bill?
  โ”‚
  โ”œโ”€โ–บ < $50 โ†’ Don't optimize yet; revisit when scale grows
  โ”‚
  โ”œโ”€โ–บ $50โ€“$500 โ†’
  โ”‚     First: Configure Prompt Caching (1 day effort, saves 50%+)
  โ”‚     Then: Slim system prompt (half day, saves 10โ€“20%)
  โ”‚
  โ””โ”€โ–บ > $500 โ†’
        Full optimization:
        1. Prompt Caching (must do)
        2. Lazy tool loading (high ROI)
        3. Smart memory retrieval (medium ROI)
        4. History compression (high ROI)
        5. Monitoring dashboard (continuous improvement)

26.6 Summary

Prompt Caching is the single highest-ROI optimization available for Hermes Agent cost reduction:


Discussion Questions

  1. The 5-minute TTL is a double-edged sword. For batch processing (many tasks at once), how would you design the call cadence to maximize cache hit rate?

  2. If your system prompt needs updating every hour (e.g., injecting fresh business data), does Prompt Caching still make sense? How do you balance the necessity of dynamic content against caching benefits?

  3. In multi-user concurrent scenarios, is the cache shared or per-user? What does this mean for your system design?

  4. Cache writes cost 25% more than regular input. Design a mathematical model: what is the minimum number of cache hits required for Prompt Caching to become net-positive?

Rate this chapter
4.6  / 5  (6 ratings)

๐Ÿ’ฌ Comments