Chapter 4

Service Tiers and Rate Limits: Complete Guide to Priority / Standard / Batch Three-Tier SLA

Chapter 4: Prompt Design Fundamentals: The Golden Structure of Role, Instructions, and Context

4.1 Why Prompt Engineering Is Worth Systematic Study

Prompt design is not about finding magic phrases that make an AI comply. It is a discipline with reproducible patterns, measurable outcomes, and significant economic consequences. A well-designed prompt can bring Haiku's output quality close to Sonnet on many tasks—and a poorly designed prompt can make even Opus unreliable.

Anthropic's internal research has found that prompt quality can affect output results more than model tier selection. On a range of tasks, two hours spent optimizing a prompt yields larger quality gains than upgrading from Sonnet to Opus. This chapter covers the structural patterns that move the needle most reliably.

The core framework is the golden triangle: Role, Instructions, and Context. Every effective system prompt combines all three.

4.2 System Prompts vs User Messages

Claude's input structure has two distinct slots:

System Prompt
  ↓ Higher priority — defines global behavior
User / Assistant Messages
  ↓ Lower priority — task-specific requests

The system prompt establishes behavior that cannot be overridden by the user message. If the system prompt says "respond only in formal English," and the user writes in casual French, Claude will respond formally in English. This asymmetry is what allows developers to build products with reliable behavioral boundaries.

What Belongs in the System Prompt

A well-structured system prompt typically contains:

1. Role definition (who you are)
2. Core behavioral instructions (what to do / not do)
3. Output format requirements
4. Rules for handling edge cases
5. Examples (optional but often worth the token cost)

System Prompt Token Economics

Every system prompt token is billed on every request. At scale this matters:

System prompt size: 2,000 tokens
Daily requests:     10,000
Daily extra cost (Sonnet): 2,000 × 10,000 / 1,000,000 × $3 = $60/day
Annual extra cost:  ~$21,900

Guideline: 500–1,500 tokens covers most well-designed system prompts. Complex enterprise systems may justify up to 3,000. Beyond that, every token needs a clear reason to exist.

4.3 Role Definition: The Most Underestimated Technique

Why Role Setting Works

Giving Claude a specific persona activates relevant knowledge patterns from training. When Claude is told it is "a senior distributed systems engineer," it statistically samples from the portion of its training data associated with how distributed systems engineers think, write, and reason. This is not a trick—it is probabilistic activation of relevant knowledge.

Ineffective vs Effective Role Definitions

Ineffective (too vague to change behavior):

You are a helpful AI assistant.

This is Claude's default. It provides no useful signal.

Effective (specific, behavior-guiding):

You are a senior backend engineer at TechCorp with 10 years of experience
in distributed systems, specializing in the Python and Go ecosystems.

When analyzing technical problems, you:
- Identify root causes, not just symptoms
- Consider scalability implications for production environments
- Provide actionable, specific recommendations rather than general principles
- When multiple solutions exist, explain the trade-offs of each

Role Templates for Common Scenarios

Technical advisor:

You are [Company]'s [Title], specializing in [domain].
Your audience is [audience description].
You always [specific behavior].
You never [prohibited behavior].

Content analyst:

You are a professional [domain] analyst with [experience characteristics].
Your analytical style is [style: e.g., data-driven, concise].
You excel at [specific capability] and in [specific situation] you [specific behavior].

Customer service representative:

You are [Brand]'s customer success specialist [Name].
You are warm, professional, and deeply knowledgeable about [product/service].
You always understand the customer's core problem before offering a solution.
When a problem is outside your ability to resolve, you clearly state this and
provide the escalation path.

4.4 Instructions: Precision Beats Verbosity

Three Ways to Write Instructions (Ranked by Effectiveness)

Tier 3: Descriptive (weakest)

"Answer in a professional tone and provide helpful information."

Too vague. Claude interprets "professional" and "helpful" using its own defaults, which may not match yours.

Tier 2: Prohibition lists (medium)

"Don't use jargon. Don't be too long. Don't go off-topic."

Telling a model what not to do is harder to enforce than telling it what to do.

Tier 1: Positive, specific behavioral instructions (strongest)

"Each response must contain exactly:
1. A one-sentence direct answer to the question
2. An explanation in no more than 3 bullet points
3. One concrete real-world example
Total length: under 300 words."

Explicit Priority Ordering

When instructions can conflict, specify the hierarchy explicitly:

Response rules (in priority order):

HIGHEST PRIORITY — never override:
- Never provide information that could enable physical harm
- Never claim to be a human being if sincerely asked

HIGH PRIORITY:
- Responses must directly address the user's question
- Use the user's language (reply in Chinese if asked in Chinese)

NORMAL PRIORITY:
- Use Markdown formatting for responses over 100 words
- Include code examples where they add clarity

Handling Edge Cases in the Prompt

Great prompts anticipate edge cases rather than leaving Claude to guess:

SYSTEM = """You are a code review assistant.

For each code snippet:
1. Identify bugs and logic errors
2. Flag performance issues
3. Provide concrete improvement suggestions

Edge cases:
- Code under 5 lines with no obvious issues: confirm it looks correct, skip deep analysis
- Security vulnerabilities (SQL injection, XSS, etc.): prefix the response with [SECURITY WARNING]
- Unrecognized language: ask the user what language it is
- Non-code submission: politely explain that you handle code review requests only

Output format:
**Summary**: [one sentence]

**Issues Found**:
1. [description + line reference]

**Revised Code**:
[code block]

**Explanation**:
[what changed and why]
"""

4.5 Context: Give Claude What It Needs to Succeed

The Information Completeness Principle

Claude only knows what you tell it. If your request depends on background information you haven't provided, Claude will either guess (risky) or ask (slow). The rule: if a reasonable human expert would need to ask a clarifying question before starting, include that information in the prompt.

Context-starved request:

"Fix the bug in this function."

Claude doesn't know the function's intended behavior, the failing test cases, or the constraints.

Context-complete request:

"Fix the bug in the following function.

Expected behavior: Flatten a nested dict with dot-separated keys.
Example: {"a": {"b": 1}} → {"a.b": 1}

Code:
[code block]

Failing test case:
flatten({"a": {"b": {"c": 1}}}) returns {"a": {"b.c": 1}} but should return {"a.b.c": 1}
"

Context Placement Strategy

In longer prompts, position matters. Models have stronger attention to the beginning and end of the input:

Best placement pattern:
┌────────────────────────────────────────┐
│ Critical constraints / most important  │  ← Start of message (always attended to)
├────────────────────────────────────────┤
│ Background information / references    │  ← Middle (attended to, but weaker)
├────────────────────────────────────────┤
│ The specific task                      │  ← End of message (recency effect)
└────────────────────────────────────────┘

XML Tags for Organizing Complex Context

XML tags help Claude distinguish different types of information in a long prompt, and they appear naturally in many of Anthropic's own prompt examples:

def build_analysis_prompt(
    document: str,
    requirements: list[str],
    examples: list[dict]
) -> str:
    example_blocks = "\n".join(
        f"<example>\n"
        f"  <input>{e['input']}</input>\n"
        f"  <output>{e['output']}</output>\n"
        f"</example>"
        for e in examples
    )

    req_list = "\n".join(f"- {r}" for r in requirements)

    return f"""Analyze the following document and extract key information.

<requirements>
{req_list}
</requirements>

<examples>
{example_blocks}
</examples>

<document>
{document}
</document>

Based on the requirements and examples above, extract the key information."""

4.6 Few-Shot Examples: Show, Don't Just Tell

Why Examples Outperform Descriptions

Concrete examples convey desired output format and quality more precisely than abstract descriptions. A formatting requirement that takes 200 words to describe can often be communicated with one well-chosen example. The model has seen millions of (input, output) pairs in training; adding explicit examples shifts its distribution toward what you actually want.

Principles for Constructing Good Examples

Representativeness: Cover the typical task range, not just the easiest case
Diversity: Include edge cases and varied input types
Accuracy: Examples must be exactly the output you want—Claude will mirror them
Quantity: 3–5 examples are usually sufficient; more examples consume context without proportional gain

SYSTEM = """You are a sentiment classifier. Analyze text sentiment and return structured results.

Examples:

Input: "The service was incredibly slow. Waited 40 minutes for our food. Very disappointed."
Output: {"sentiment": "negative", "intensity": 0.8, "aspects": ["service", "wait_time"], "confidence": 0.95}

Input: "It was okay, nothing special but no real problems either."
Output: {"sentiment": "neutral", "intensity": 0.3, "aspects": ["overall"], "confidence": 0.85}

Input: "The food was amazing! Especially their signature dish. Highly recommended!"
Output: {"sentiment": "positive", "intensity": 0.9, "aspects": ["food", "recommendation"], "confidence": 0.97}

Schema:
- sentiment: "positive" | "negative" | "neutral" | "mixed"
- intensity: 0.0–1.0 (strength of sentiment)
- aspects: list of topics the text addresses
- confidence: 0.0–1.0 (classifier confidence)

Return only valid JSON. No explanation."""

Dynamic Few-Shot Selection

For complex tasks, selecting examples that are semantically similar to the current input produces better results than static examples:

def select_examples(query: str, pool: list[dict], n: int = 3) -> list[dict]:
    """
    Select the n most relevant examples for a given query.
    Production implementation should use embedding-based similarity.
    This version uses keyword overlap as a lightweight approximation.
    """
    query_tokens = set(query.lower().split())
    scored = [
        (len(query_tokens & set(ex["input"].lower().split())), ex)
        for ex in pool
    ]
    scored.sort(key=lambda x: x[0], reverse=True)
    return [ex for _, ex in scored[:n]]

4.7 Chain-of-Thought Prompting

When CoT Provides the Largest Gains

Chain-of-thought prompting instructs the model to show its reasoning steps before stating a conclusion. This is most impactful for:

Mathematical and logical reasoning
Multi-step problem decomposition
Decisions requiring weighing multiple factors
Code debugging (tracing execution step by step)

For simple, pattern-matching tasks (classification, extraction), CoT adds latency without meaningful quality gain.

Three CoT Implementations

Simple trigger (works well, lowest overhead):

"Think step by step before giving your answer."

Structured reasoning framework:

"Analyze using these steps:
1. Problem decomposition: break the problem into sub-problems
2. Information identification: list known facts and unknowns
3. Method selection: choose an approach and explain why
4. Execution: work through the solution step by step
5. Verification: check your answer makes sense
6. Conclusion: state the final answer clearly"

XML-tagged reasoning (clearest separation):

def cot_prompt(problem: str) -> list[dict]:
    return [{
        "role": "user",
        "content": f"""Analyze the following problem. Show your reasoning inside
<thinking> tags, then give your final answer inside <answer> tags.

<problem>
{problem}
</problem>"""
    }]

CoT vs Extended Thinking

CoT (prompt-level):
  • Instructions in the prompt ask the model to show reasoning
  • Reasoning appears in the output text (billed as output tokens)
  • Works on all models
  • User sees the reasoning inline

Extended Thinking (API-level):
  • Model reasons internally before generating the final response
  • Thinking tokens are billed at output prices, in a separate block
  • Only available on Opus and Sonnet
  • Generally more effective for hard reasoning tasks
  • Thinking content appears in a separate content block

For hard reasoning tasks where you use Opus or Sonnet, Extended Thinking typically outperforms prompt-level CoT. For Haiku, use prompt-level CoT.

4.8 The Complete Golden Structure Template

Combining all the above into a production-grade system prompt builder:

def build_system_prompt(
    role: str,
    expertise: str,
    audience: str,
    behaviors: list[str],
    output_format: str,
    constraints: list[str],
    examples: list[dict] | None = None,
) -> str:
    behavior_list = "\n".join(f"{i+1}. {b}" for i, b in enumerate(behaviors))
    constraint_list = "\n".join(f"- {c}" for c in constraints)

    examples_section = ""
    if examples:
        blocks = []
        for ex in examples:
            blocks.append(f"Input: {ex['input']}\nExpected output: {ex['output']}")
        examples_section = "\n\n## Examples\n\n" + "\n\n".join(blocks)

    return f"""## Role

You are {role}, {expertise}. Your users are {audience}.

## Core Behaviors

{behavior_list}

## Output Format

{output_format}

## Constraints

{constraint_list}{examples_section}
"""

# Usage
system = build_system_prompt(
    role="CodeReviewBot",
    expertise="a code quality analyst specializing in Python and TypeScript",
    audience="mid-to-senior software engineers",
    behaviors=[
        "Identify bugs, performance issues, and security vulnerabilities",
        "Provide specific, actionable improvement suggestions",
        "Explain the potential impact of each issue",
        "Distinguish between must-fix issues and optional improvements",
    ],
    output_format="""Use this structure:
**Critical Issues** [must fix before shipping]
**Suggestions** [best practices, not required]
**Revised Code** [when applicable]""",
    constraints=[
        "Review only the code; do not comment on the developer's skill level",
        "If code has no issues, say so directly before offering optional optimizations",
        "Keep responses under 600 words",
    ],
)

4.9 Anti-Patterns to Avoid

Anti-Pattern 1: Vague Expectations

❌ "Write some marketing copy."
✅ "Write 3 Instagram captions for [Product] targeting urban professionals
   aged 25–35. Each caption: 60–80 words, emphasizes time-saving, 
   tone is conversational not corporate, includes one call to action."

Anti-Pattern 2: Conflicting Instructions

❌ "Be detailed but brief. Be professional but conversational."
✅ "Write for a non-technical audience. Use everyday analogies for
   technical concepts. Maximum 150 words. Define any technical term
   you must include."

Anti-Pattern 3: Implicit Assumptions

❌ "Write me an apology email." (no recipient, no reason, no tone guidance)
✅ "Write a business apology email to a client (formal relationship).
   Reason: project delayed by 2 weeks. Tone: sincere but not groveling.
   Include a specific remediation plan. Length: 200–300 words."

Anti-Pattern 4: Rules Overload

Every instruction consumes attention. A system prompt with 40+ rules tends to produce worse overall compliance than one with 8–10 clear rules. The model's attention is finite; spreading it across too many constraints causes each one to be followed less reliably.

Rule of thumb: if you have more than 15 behavioral rules, consolidate or eliminate until you have the 8–10 that genuinely change outcomes.

Summary

Effective prompts are not longer prompts—they are more precisely structured prompts. The golden triangle:

Role: A specific expert description activates relevant knowledge patterns; vague descriptions produce vague outputs
Instructions: Positive, specific behavioral directives outperform descriptions and prohibition lists; explicit priority ordering prevents conflicts
Context: Provide the background a human expert would need; position important information at the start and end; use XML tags for complex multi-part context
Few-shot examples: Demonstrate the output you want; 3–5 representative examples usually suffice
Chain-of-thought: Explicitly instruct step-by-step reasoning for complex tasks; consider Extended Thinking for Opus/Sonnet on hard problems

The next chapter examines token economics in depth: how billing works, context window constraints, and strategies for maintaining quality on long-document tasks without blowing the budget.

Rate this chapter

4.5 / 5 (109 ratings)