Skip to main content
Prompt caching allows you to cache parts of your prompts that don’t change between requests, reducing costs and improving response times.

Pricing

Prompt caching uses a tiered pricing structure based on cache duration and usage type:
ModelBase Input Tokens5m Cache Writes1h Cache WritesCache Hits & RefreshesOutput Tokens
Claude Opus 4.5$5 / MTok$6.25 / MTok$10 / MTok$0.50 / MTok$25 / MTok
Claude Opus 4.1$15 / MTok$18.75 / MTok$30 / MTok$1.50 / MTok$75 / MTok
Claude Opus 4$15 / MTok$18.75 / MTok$30 / MTok$1.50 / MTok$75 / MTok
Claude Sonnet 4.5$3 / MTok$3.75 / MTok$6 / MTok$0.30 / MTok$15 / MTok
Claude Sonnet 4$3 / MTok$3.75 / MTok$6 / MTok$0.30 / MTok$15 / MTok
Claude Sonnet 3.7$3 / MTok$3.75 / MTok$6 / MTok$0.30 / MTok$15 / MTok
Claude Haiku 4.5$1 / MTok$1.25 / MTok$2 / MTok$0.10 / MTok$5 / MTok
Claude Haiku 3.5$0.80 / MTok$1 / MTok$1.6 / MTok$0.08 / MTok$4 / MTok
Claude Opus 3$15 / MTok$18.75 / MTok$30 / MTok$1.50 / MTok$75 / MTok
Claude Haiku 3$0.25 / MTok$0.30 / MTok$0.50 / MTok$0.03 / MTok$1.25 / MTok
The pricing multipliers are:
  • 5-minute cache writes: 1.25x the base input tokens price
  • 1-hour cache writes: 2x the base input tokens price
  • Cache reads: 0.1x the base input tokens price
Cache writes occur when content is first cached. Subsequent requests using the cached content are charged at the cache read rate (10% of base input price). The cache lifetime is refreshed each time the cached content is used.

Limitations

  • Maximum 4 cache breakpoints per request
  • Caches expire after 5 minutes by default
  • cache_control can only be inserted into text content blocks
Reserve cache breakpoints for large, static content like character cards, CSV data, RAG knowledge bases, book chapters, or extensive reference documentation.

Examples

System Message Caching

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a historian. You know the following book:'
        },
        {
          type: 'text',
          text: 'HUGE TEXT BODY',
          cache_control: {
            type: 'ephemeral'
          }
        }
      ]
    },
    {
      role: 'user',
      content: 'What triggered the collapse?'
    }
  ]
});

User Message Caching

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'user',
      content: [
        {type: 'text', text: 'Given the book below:'},
        {type: 'text', text: 'HUGE TEXT BODY', cache_control: {type: 'ephemeral'}},
        {type: 'text', text: 'Name all the characters'}
      ]
    }
  ]
});

Cache Control Options

cache_control.type
string
required
Cache type. Currently only "ephemeral" is supported.
cache_control.ttl
string
Time-to-live. Optional, defaults to 5 minutes. Format: "5m" or "1h".
Some models like claude-3-7-sonnet-20250219 do not support TTL in system messages. TTL is automatically stripped for these models.

Using TTL

Specify cache duration with the ttl parameter:
const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a helpful assistant. Use this knowledge base:'
        },
        {
          type: 'text',
          text: 'LARGE KNOWLEDGE BASE TEXT',
          cache_control: {
            type: 'ephemeral',
            ttl: '1h'  // Cache for 1 hour instead of default 5 minutes
          }
        }
      ]
    },
    {
      role: 'user',
      content: 'What is machine learning?'
    }
  ]
});

Usage Tracking

The API response includes cache usage in the usage object:
{
  "usage": {
    "input_tokens": 150,
    "output_tokens": 50,
    "cache_creation_input_tokens": 2000,
    "cache_read_input_tokens": 0,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 2000,
      "ephemeral_1h_input_tokens": 0
    }
  }
}
For cached requests, cache_read_input_tokens will be non-zero instead of cache_creation_input_tokens.

Best Practices

  1. Cache large, static content that doesn’t change frequently
  2. Monitor cache_read_input_tokens to verify cache hits
  3. Respect the 4 breakpoint limit - prioritize largest, most reused content
  4. Remember cache expiration - caches expire after 5 minutes (or your TTL)
For more details, see the Anthropic documentation.