Prompt Caching: Design for Reuse
Prompt Caching: Design for Reuse
Anthropic's prompt caching lets you reuse static context across API calls—tool definitions, system instructions, reference documents—reducing both costs and latency for repeated context. The key: structure prompts with reusable content first, variable data last.
TL;DR
Put static content (tools, instructions, examples) in the cacheable prefix. Add cache breakpoints with cache_control. Keep dynamic data (user queries, timestamps) in the suffix. Cache writes cost more initially, reads are significantly cheaper on subsequent calls. Minimum 1024 tokens for most models, 5-minute TTL refreshes on each use.
How It Works
Cache prefixes are created in order: tools, system, then messages. Mark the end of cacheable content with cache_control:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert Python debugger with access to these tools..."
},
{
"type": "text",
"text": "# Tool Documentation\n[5000 lines of tool docs]",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": f"Debug this error: {user_error}"
}
]
)
# Check cache usage
print(response.usage.cache_creation_input_tokens) # First call
print(response.usage.cache_read_input_tokens) # Subsequent callsFirst call writes to cache. Subsequent calls read from cache at a significantly reduced cost.
Design Patterns
Static Prefix, Dynamic Suffix
# Bad: dynamic content breaks cache
system = f"""You are a coding assistant.
Current time: {datetime.now()}
User timezone: {user.timezone}
Available tools:
{tool_definitions}
"""
# Good: static content cached, dynamic in messages
system = [
{
"type": "text",
"text": f"""You are a coding assistant.
Available tools:
{tool_definitions}"""
},
{
"type": "text",
"text": reference_documentation,
"cache_control": {"type": "ephemeral"}
}
]
messages = [
{
"role": "user",
"content": f"Time: {datetime.now()}, TZ: {user.timezone}\n\nQuery: {query}"
}
]Timestamps and user-specific data go in messages, not system prompt.
Multi-Level Caching
Use up to 4 cache breakpoints for different stability levels:
system = [
{
"type": "text",
"text": "Core instructions that never change"
},
{
"type": "text",
"text": "Tool definitions (updated weekly)",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Project context (updated daily)",
"cache_control": {"type": "ephemeral"}
}
]This caches both stable core instructions and semi-stable project context.
Conversation History
Cache conversation history to speed up multi-turn conversations:
messages = [
{"role": "user", "content": "First question"},
{"role": "assistant", "content": "First answer"},
{"role": "user", "content": "Second question"},
{"role": "assistant", "content": "Second answer"},
{
"role": "user",
"content": "Third question",
"cache_control": {"type": "ephemeral"}
}
]Previous turns cache, new question doesn't. Works well for Agent Skills loaded in system prompt.
Requirements and Limits
Minimum cache size:
- Claude Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Sonnet 3.5, Opus 3: 1024 tokens
- Claude Haiku 4.5, Haiku 3.5, Haiku 3: 2048 tokens
Cache lifetime:
- Default: 5 minutes (refreshes on each use)
- Extended: 1 hour (available at additional cost)
Breakpoints: Up to 4 cache breakpoints per request
When to Use Caching
Use prompt caching when:
- Tool definitions or documentation exceed 1024 tokens and stay constant
- Building conversational agents that maintain context across turns
- Processing multiple files/queries with the same instructions
- Agent skills or reference material loads in every request
- Making multiple requests with the same prefix content
Skip caching when:
- Context changes every request
- Total cached content under 1024 tokens (won't cache)
- Single-use prompts
- Prefix varies across requests
References
- Prompt Caching - Claude Docs - Official documentation
- Prompt Caching Announcement - Anthropic blog post
- Building Agent Skills - How skills work with caching
- Progressive Disclosure in Agent Skills - Loading skills on demand