Prompt Caching: Design for Reuse

3 min read

Prompt Caching: Design for Reuse

Anthropic's prompt caching lets you reuse static context across API calls—tool definitions, system instructions, reference documents—reducing both costs and latency for repeated context. The key: structure prompts with reusable content first, variable data last.

TL;DR

Put static content (tools, instructions, examples) in the cacheable prefix. Add cache breakpoints with cache_control. Keep dynamic data (user queries, timestamps) in the suffix. Cache writes cost more initially, reads are significantly cheaper on subsequent calls. Minimum 1024 tokens for most models, 5-minute TTL refreshes on each use.

How It Works

Cache prefixes are created in order: tools, system, then messages. Mark the end of cacheable content with cache_control:

python
from anthropic import Anthropic
 
client = Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert Python debugger with access to these tools..."
        },
        {
            "type": "text",
            "text": "# Tool Documentation\n[5000 lines of tool docs]",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": f"Debug this error: {user_error}"
        }
    ]
)
 
# Check cache usage
print(response.usage.cache_creation_input_tokens)  # First call
print(response.usage.cache_read_input_tokens)      # Subsequent calls

First call writes to cache. Subsequent calls read from cache at a significantly reduced cost.

Design Patterns

Static Prefix, Dynamic Suffix

python
# Bad: dynamic content breaks cache
system = f"""You are a coding assistant.
Current time: {datetime.now()}
User timezone: {user.timezone}
 
Available tools:
{tool_definitions}
"""
 
# Good: static content cached, dynamic in messages
system = [
    {
        "type": "text",
        "text": f"""You are a coding assistant.
 
Available tools:
{tool_definitions}"""
    },
    {
        "type": "text",
        "text": reference_documentation,
        "cache_control": {"type": "ephemeral"}
    }
]
 
messages = [
    {
        "role": "user",
        "content": f"Time: {datetime.now()}, TZ: {user.timezone}\n\nQuery: {query}"
    }
]

Timestamps and user-specific data go in messages, not system prompt.

Multi-Level Caching

Use up to 4 cache breakpoints for different stability levels:

python
system = [
    {
        "type": "text",
        "text": "Core instructions that never change"
    },
    {
        "type": "text",
        "text": "Tool definitions (updated weekly)",
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": "Project context (updated daily)",
        "cache_control": {"type": "ephemeral"}
    }
]

This caches both stable core instructions and semi-stable project context.

Conversation History

Cache conversation history to speed up multi-turn conversations:

python
messages = [
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {"role": "user", "content": "Second question"},
    {"role": "assistant", "content": "Second answer"},
    {
        "role": "user",
        "content": "Third question",
        "cache_control": {"type": "ephemeral"}
    }
]

Previous turns cache, new question doesn't. Works well for Agent Skills loaded in system prompt.

Requirements and Limits

Minimum cache size:

  • Claude Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Sonnet 3.5, Opus 3: 1024 tokens
  • Claude Haiku 4.5, Haiku 3.5, Haiku 3: 2048 tokens

Cache lifetime:

  • Default: 5 minutes (refreshes on each use)
  • Extended: 1 hour (available at additional cost)

Breakpoints: Up to 4 cache breakpoints per request

When to Use Caching

Use prompt caching when:

  • Tool definitions or documentation exceed 1024 tokens and stay constant
  • Building conversational agents that maintain context across turns
  • Processing multiple files/queries with the same instructions
  • Agent skills or reference material loads in every request
  • Making multiple requests with the same prefix content

Skip caching when:

  • Context changes every request
  • Total cached content under 1024 tokens (won't cache)
  • Single-use prompts
  • Prefix varies across requests

References