Prompt Caching: Design for Reuse

November 21, 2025•3 min read

ai claude anthropic optimization engineering

Anthropic's prompt caching lets you reuse static context across API calls—tool definitions, system instructions, reference documents—reducing both costs and latency for repeated context. The key: structure prompts with reusable content first, variable data last.

TL;DR

Put static content (tools, instructions, examples) in the cacheable prefix. Add cache breakpoints with cache_control. Keep dynamic data (user queries, timestamps) in the suffix. Cache writes cost more initially, reads are significantly cheaper on subsequent calls. Minimum 1024 tokens for most models, 5-minute TTL refreshes on each use.

How It Works

Cache prefixes are created in order: tools, system, then messages. Mark the end of cacheable content with cache_control:

python

from anthropic import Anthropic
 
client = Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert Python debugger with access to these tools..."
        },
        {
            "type": "text",
            "text": "# Tool Documentation\n[5000 lines of tool docs]",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": f"Debug this error: {user_error}"
        }
    ]
)
 
# Check cache usage
print(response.usage.cache_creation_input_tokens)  # First call
print(response.usage.cache_read_input_tokens)      # Subsequent calls

First call writes to cache. Subsequent calls read from cache at a significantly reduced cost.

Design Patterns

Static Prefix, Dynamic Suffix

python

# Bad: dynamic content breaks cache
system = f"""You are a coding assistant.
Current time: {datetime.now()}
User timezone: {user.timezone}
 
Available tools:
{tool_definitions}
"""
 
# Good: static content cached, dynamic in messages
system = [
    {
        "type": "text",
        "text": f"""You are a coding assistant.
 
Available tools:
{tool_definitions}"""
    },
    {
        "type": "text",
        "text": reference_documentation,
        "cache_control": {"type": "ephemeral"}
    }
]
 
messages = [
    {
        "role": "user",
        "content": f"Time: {datetime.now()}, TZ: {user.timezone}\n\nQuery: {query}"
    }
]

Timestamps and user-specific data go in messages, not system prompt.

Multi-Level Caching

Use up to 4 cache breakpoints for different stability levels:

python

system = [
    {
        "type": "text",
        "text": "Core instructions that never change"
    },
    {
        "type": "text",
        "text": "Tool definitions (updated weekly)",
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": "Project context (updated daily)",
        "cache_control": {"type": "ephemeral"}
    }
]

This caches both stable core instructions and semi-stable project context.

Conversation History

Cache conversation history to speed up multi-turn conversations:

python

messages = [
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {"role": "user", "content": "Second question"},
    {"role": "assistant", "content": "Second answer"},
    {
        "role": "user",
        "content": "Third question",
        "cache_control": {"type": "ephemeral"}
    }
]

Previous turns cache, new question doesn't. Works well for Agent Skills loaded in system prompt.

Requirements and Limits

Minimum cache size:

Claude Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Sonnet 3.5, Opus 3: 1024 tokens
Claude Haiku 4.5, Haiku 3.5, Haiku 3: 2048 tokens

Cache lifetime:

Default: 5 minutes (refreshes on each use)
Extended: 1 hour (available at additional cost)

Breakpoints: Up to 4 cache breakpoints per request

When to Use Caching

Use prompt caching when:

Tool definitions or documentation exceed 1024 tokens and stay constant
Building conversational agents that maintain context across turns
Processing multiple files/queries with the same instructions
Agent skills or reference material loads in every request
Making multiple requests with the same prefix content

Skip caching when:

Context changes every request
Total cached content under 1024 tokens (won't cache)
Single-use prompts
Prefix varies across requests

References

Prompt Caching - Claude Docs - Official documentation
Prompt Caching Announcement - Anthropic blog post
Building Agent Skills - How skills work with caching
Progressive Disclosure in Agent Skills - Loading skills on demand