Description

Reduce LLM API token consumption by 20-35% through pre-send estimation, memory extraction, and context compression.

README (SKILL.md)

Token Optimizer

Name: Claude Code API Optimizer Skill
Author: playdadev

Reduce your LLM API costs by 20-35% with three proven mechanisms: pre-send token estimation, structured memory extraction, and context compression. Model-agnostic, zero dependencies.

Mechanism 1 — Pre-Send Token Estimation

Estimate token count before sending a request. If the payload exceeds a threshold, compress or truncate it. Never pay for tokens you could have avoided.

Rules

Estimate before every API call. Use these formulas:
- Plain text: tokens ≈ character_count / 4
- JSON / structured data: tokens ≈ character_count / 2
- Code (mixed): tokens ≈ character_count / 3.5
- Images / PDFs: tokens ≈ 2000 (flat per asset, regardless of size)
Set a token budget per request. Default threshold: 8 000 tokens. Adjust per use case.
If estimated tokens exceed the budget:
- Summarize or truncate the longest sections first.
- Strip intermediate reasoning, keep conclusions only.
- For JSON: remove null/empty fields, shorten keys if feeding to a model that doesn't need human-readable keys.
- For code: send only the relevant function/class, not the full file.
Log the estimate vs. actual usage (from the API response) to calibrate over time.

Example

Input: 24,000 characters of plain text
Estimated tokens: 24000 / 4 = 6,000 → under budget, send as-is.

Input: 40,000 characters of JSON
Estimated tokens: 40000 / 2 = 20,000 → over budget.
Action: strip null fields, remove redundant nested objects → 14,000 chars → 7,000 tokens → send.

Reference

See references/token-formula.md for the full formula breakdown with worked examples.

Mechanism 2 — Memory Extraction

Instead of re-reading the entire conversation history every turn, extract and persist key information into structured memory files. On subsequent turns, load only the memory index — not the raw history.

Rules

Use a lightweight secondary model (Haiku, GPT-4o-mini, Gemini Flash) as the memory extraction agent. Never burn expensive model tokens on bookkeeping.
Maintain a session cursor. Track which messages have already been processed. On each extraction pass, only read new messages since the last cursor position.
Limit extraction to 5 rounds max per session. Each round processes a batch of new messages. Stop early if no new information is found.
Parallelize I/O within rounds:
- Round 1: all reads in parallel (gather raw content).
- Round 2: all writes in parallel (persist extracted memories).
Structure memory as index + detail files:
- MEMORY.md — index file, max 200 lines. Contains only pointers: - [topic-name](memory/topic-name.md) — one-line description.
- memory/topic-name.md — full content for each topic with frontmatter (name, description, type).
Memory types (categorize each entry):
- user — who the user is, their preferences, expertise level.
- feedback — corrections and confirmed approaches (what to do / not do).
- project — current goals, deadlines, decisions, constraints.
- reference — pointers to external resources (URLs, dashboards, issue trackers).
Do not store what can be derived. No code snippets, no git history, no file paths — these are always available from the source. Store only non-obvious context.

Example — Extraction Prompt

You are a memory extraction agent. Read the following new messages (since cursor position {cursor}).

For each piece of non-obvious information, output a JSON object:
{
  "topic": "short-kebab-case-name",
  "type": "user | feedback | project | reference",
  "description": "one-line summary for the index",
  "content": "full memory content, structured with Why and How-to-apply"
}

Rules:
- Max 5 memories per pass.
- Skip anything derivable from code, git, or existing memory.
- Convert relative dates to absolute (today is {date}).
- If a memory already exists for this topic, output an update, not a duplicate.

Reference

See references/memory-extraction-pattern.md for the full pattern with prompt templates.

Mechanism 3 — Context Compression

As conversations grow, compress older exchanges into dense summaries. Keep only the last N messages in full fidelity. This prevents context windows from filling with stale reasoning.

Rules

Keep the last 6 messages uncompressed (3 user + 3 assistant). These are "fresh" — they contain active context.

Summarize everything older into a single \x3Ccompressed-context> block at the top of the conversation. Format:

\x3Ccompressed-context>
## Decisions Made
- Chose PostgreSQL over MongoDB for the user table (reason: relational queries).
- API rate limit set to 100 req/min per user.

## Current State
- Auth module: complete, merged to main.
- Payment integration: in progress, blocked on Stripe webhook config.

## Key Constraints
- Must ship by 2026-04-15.
- No breaking changes to public API v2.
\x3C/compressed-context>

What to keep in summaries:
- Decisions and their rationale.
- Current state of work (done / in-progress / blocked).
- Constraints and deadlines.
- User preferences and corrections.
What to discard:
- Intermediate reasoning ("I considered X but...").
- Exploratory questions that were already answered.
- Tool call details (file reads, grep results, build output).
- Repeated or superseded information.
Trigger compression when the conversation exceeds 60% of the model's context window. Use Mechanism 1's estimation formula to check.
Never compress system prompts or skill instructions. These must remain intact.

Example — Savings Calculation

Before compression:
  42 messages, ~32,000 tokens total.

After compression:
  Compressed block: ~2,000 tokens.
  Last 6 messages: ~4,500 tokens.
  Total: ~6,500 tokens.

  Savings: 32,000 - 6,500 = 25,500 tokens (80% reduction on history).
  Per-request savings (ongoing): ~25,500 tokens × $0.003/1K = $0.077 per request.

Combined Savings Estimate

Mechanism	Typical Savings	When It Hits
Pre-send estimation	10-15%	Every request with large payloads
Memory extraction	5-10%	Multi-session workflows
Context compression	15-25%	Long conversations (>20 messages)
Combined	20-35%	Sustained usage over a session

These are conservative estimates based on real-world agent workflows. Actual savings depend on conversation length, payload sizes, and how aggressively you compress.

Quick Start

Copy this skill into your agent's skill directory (or paste SKILL.md into your system prompt).
Apply Mechanism 1 immediately — add token estimation before your API calls.
Set up Mechanism 2 if you run multi-turn or multi-session workflows.
Enable Mechanism 3 for any conversation that runs beyond 15-20 messages.

No code to install. No dependencies. Just rules your agent follows.

Usage Guidance

This skill's instructions implement sensible token-reduction techniques, but there are several gaps you should resolve before installing: (1) The SKILL.md tells the agent to create and update memory files (MEMORY.md, memory/*.md) and to log token usage, yet the package metadata declares no storage/config paths — ask the author where files will be stored and whether they respect your workspace boundaries. (2) The SKILL.md references local reference files (references/*.md) that are not included; confirm whether the skill requires additional files or templates. (3) The doc is truncated near the end; request the full spec to ensure no hidden steps. (4) Because the skill persists user/project preferences and reference URLs, verify data-retention and privacy rules (avoid storing sensitive secrets or code excerpts). (5) Test the skill in a restricted/sandbox environment first and require explicit configuration options for storage location, logging behavior, and which secondary models or credentials it may use.

Capability Analysis

Type: OpenClaw Skill Name: claude-code-api-optimizer-skill Version: 1.0.0 The skill bundle consists of markdown instructions (SKILL.md) designed to help an AI agent optimize token usage through heuristic estimation, structured memory extraction, and context compression. It contains no executable code, external network requests, or instructions to access sensitive system information, and its logic is entirely consistent with its stated purpose of cost reduction and context management.

Capability Assessment

ℹ Purpose & Capability

The high-level purpose (pre-send estimation, memory extraction, context compression) is coherent with reducing LLM token usage. The mechanisms described are reasonable techniques for this goal. However, the skill text expects the agent to persist structured memory files (MEMORY.md and memory/*.md) and to write logs of estimates vs actual usage — the registry metadata declares no required config paths or storage expectations. Also several internal reference files (references/token-formula.md, references/memory-extraction-pattern.md) are referenced but are not present in the package manifest. This discrepancy between claimed zero-dependency/instruction-only and the expectation of persistent files and reference docs is noteworthy.

⚠ Instruction Scope

SKILL.md instructs the agent to read conversation history, extract non-obvious user/project/reference information, maintain a session cursor, create and update MEMORY.md and memory/topic-name.md files, and log estimate vs actual token usage. Those are file I/O and persistent-state operations even though the skill metadata declared no config paths. The file references and prompt templates referenced (references/*.md) are missing from the provided files. The SKILL.md is also truncated near the end ("Never comp…[truncated]") which leaves behavior unspecified. These gaps could lead an agent to perform unexpected file reads/writes or to ask for unspecified storage locations — both behaviors to surface before install.

✓ Install Mechanism

No install spec and no code files are present, which minimizes supply-chain risk. This instruction-only format is lower risk than downloading and executing remote archives. However, the skill assumes ability to persist files and access a secondary model; those are runtime capabilities rather than install-time artifacts and should be confirmed in the deployment environment.

ℹ Credentials

The skill requests no environment variables or credentials in metadata, which aligns with its stated purpose. But runtime instructions expect use of "a lightweight secondary model (Haiku, GPT-4o-mini, Gemini Flash)" and logging of API usage; the skill does not explain which model endpoints, API keys, or storage backends will be used. If the agent implements this, it may need access to credentials or storage locations not declared here. The lack of declared config/credential requirements is an omission to clarify.

⚠ Persistence & Privilege

The skill describes maintaining a session cursor, creating/updating MEMORY.md and memory/*.md files, and logging estimates vs actual usage — all persistent operations. Yet the skill metadata does not declare config paths or file access requirements. Persistent state combined with autonomous invocation (default model invocation allowed) increases blast radius if misused. The skill's 'always' flag is false (good), but the mismatch between declared capabilities and the instruction's persistence needs is a risk factor.

Version History

v1.0.0

Initial release — intelligently reduce LLM API token consumption for significant cost savings. - Introduces 3 mechanisms: pre-send token estimation, memory extraction, and context compression. - Model-agnostic, no dependencies required. - Provides detailed rules, prompts, and examples for each mechanism. - Achieves estimated 20–35% token cost reduction in typical agent workflows. - Includes quick start instructions for immediate application.

Metadata

Slug claude-code-api-optimizer-skill

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Claude Code API Optimizer Skill?

Reduce LLM API token consumption by 20-35% through pre-send estimation, memory extraction, and context compression. It is an AI Agent Skill for Claude Code / OpenClaw, with 103 downloads so far.

How do I install Claude Code API Optimizer Skill?

Run "/install claude-code-api-optimizer-skill" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Claude Code API Optimizer Skill free?

Yes, Claude Code API Optimizer Skill is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Claude Code API Optimizer Skill support?

Claude Code API Optimizer Skill is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Claude Code API Optimizer Skill?

It is built and maintained by Playda (@playdadev); the current version is v1.0.0.

More Skills

Claude Code API Optimizer Skill

Token Optimizer

Mechanism 1 — Pre-Send Token Estimation

Rules

Example

Reference

Mechanism 2 — Memory Extraction

Rules

Example — Extraction Prompt

Reference

Mechanism 3 — Context Compression

Rules

Example — Savings Calculation

Combined Savings Estimate

Quick Start

What is Claude Code API Optimizer Skill?

How do I install Claude Code API Optimizer Skill?

Is Claude Code API Optimizer Skill free?

Which platforms does Claude Code API Optimizer Skill support?

Who created Claude Code API Optimizer Skill?

💬 Comments