Skip to main content
Prompt caching allows you to cache parts of your prompts that don’t change between requests, reducing costs and improving response times.

API Format

Use the `/v1/messages` endpoint with `cache_control` markers in message content blocks.
Both formats support the same caching functionality. The prompt_caching helper in the OpenAI format automatically converts to cache_control markers internally. Examples below are organized by format for clarity.

Pricing

Prompt caching uses a tiered pricing structure based on cache duration and usage type:
ModelBase Input Tokens5m Cache Writes1h Cache WritesCache Hits & RefreshesOutput Tokens
Claude Opus 4.5$5 / MTok$6.25 / MTok$10 / MTok$0.50 / MTok$25 / MTok
Claude Opus 4.1$15 / MTok$18.75 / MTok$30 / MTok$1.50 / MTok$75 / MTok
Claude Opus 4$15 / MTok$18.75 / MTok$30 / MTok$1.50 / MTok$75 / MTok
Claude Sonnet 4.5$3 / MTok$3.75 / MTok$6 / MTok$0.30 / MTok$15 / MTok
Claude Sonnet 4$3 / MTok$3.75 / MTok$6 / MTok$0.30 / MTok$15 / MTok
Claude Sonnet 3.7$3 / MTok$3.75 / MTok$6 / MTok$0.30 / MTok$15 / MTok
Claude Haiku 4.5$1 / MTok$1.25 / MTok$2 / MTok$0.10 / MTok$5 / MTok
Claude Haiku 3.5$0.80 / MTok$1 / MTok$1.6 / MTok$0.08 / MTok$4 / MTok
Claude Haiku 3$0.25 / MTok$0.30 / MTok$0.50 / MTok$0.03 / MTok$1.25 / MTok
The pricing multipliers are:
  • 5-minute cache writes: 1.25x the base input tokens price
  • 1-hour cache writes: 2x the base input tokens price
  • Cache reads: 0.1x the base input tokens price
Cache writes occur when content is first cached. Subsequent requests using the cached content are charged at the cache read rate (10% of base input price). The cache lifetime is refreshed each time the cached content is used.

Limitations

  • Maximum 4 cache breakpoints per request
  • Caches expire after 5 minutes by default
  • cache_control can only be inserted into text content blocks
Reserve cache breakpoints for large, static content like character cards, CSV data, RAG knowledge bases, book chapters, or extensive reference documentation.

Examples

System Message Caching

Anthropic Format (/v1/messages)

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a historian. You know the following book:'
        },
        {
          type: 'text',
          text: 'HUGE TEXT BODY',
          cache_control: {
            type: 'ephemeral'
          }
        }
      ]
    },
    {
      role: 'user',
      content: 'What triggered the collapse?'
    }
  ]
});

OpenAI Format (/v1/chat/completions)

const response = await fetch('https://api.electronhub.ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'claude-sonnet-4-5-20250929',
    messages: [
      {
        role: 'user',
        content: 'You are a historian. You know the following book: HUGE TEXT BODY. What triggered the collapse?'
      }
    ],
    max_tokens: 1000,
    prompt_caching: {
      enabled: true,
      ttl: '5m',
      cut_after_message_index: 0
    },
    stream_options: {
      include_usage: true
    }
  })
});

User Message Caching

Anthropic Format (/v1/messages)

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'user',
      content: [
        {type: 'text', text: 'Given the book below:'},
        {type: 'text', text: 'HUGE TEXT BODY', cache_control: {type: 'ephemeral'}},
        {type: 'text', text: 'Name all the characters'}
      ]
    }
  ]
});

OpenAI Format (/v1/chat/completions)

const response = await fetch('https://api.electronhub.ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'claude-sonnet-4-5-20250929',
    messages: [
      {
        role: 'user',
        content: [
          {type: 'text', text: 'Given the book below:'},
          {type: 'text', text: 'HUGE TEXT BODY', cache_control: {type: 'ephemeral'}},
          {type: 'text', text: 'Name all the characters'}
        ]
      }
    ],
    max_tokens: 1000,
    stream_options: {
      include_usage: true
    }
  })
});

Cache Control Options

Anthropic Format (/v1/messages)

cache_control.type
string
required
Cache type. Currently only "ephemeral" is supported.
cache_control.ttl
string
Time-to-live. Optional, defaults to 5 minutes. Format: "5m" or "1h".
Some models like claude-3-7-sonnet-20250219 do not support TTL in system messages. TTL is automatically stripped for these models.

OpenAI Format (/v1/chat/completions)

prompt_caching.enabled
boolean
required
Enable prompt caching. Set to true to enable caching.
prompt_caching.ttl
string
Time-to-live. Optional, defaults to "5m". Format: "5m" or "1h".
prompt_caching.cut_after_message_index
integer
required
Zero-based index of the last message to cache. All messages up to and including this index will be cached.
The prompt_caching helper automatically converts to cache_control markers in message content. You can also use cache_control markers directly in message content blocks (same as Anthropic format).

Using TTL

Specify cache duration with the ttl parameter:

Anthropic Format (/v1/messages)

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a helpful assistant. Use this knowledge base:'
        },
        {
          type: 'text',
          text: 'LARGE KNOWLEDGE BASE TEXT',
          cache_control: {
            type: 'ephemeral',
            ttl: '1h'  // Cache for 1 hour instead of default 5 minutes
          }
        }
      ]
    },
    {
      role: 'user',
      content: 'What is machine learning?'
    }
  ]
});

OpenAI Format (/v1/chat/completions)

const response = await fetch('https://api.electronhub.ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'claude-sonnet-4-5-20250929',
    messages: [
      {
        role: 'user',
        content: 'You are a helpful assistant. Use this knowledge base: LARGE KNOWLEDGE BASE TEXT. What is machine learning?'
      }
    ],
    max_tokens: 1000,
    prompt_caching: {
      enabled: true,
      ttl: '1h',  // Cache for 1 hour instead of default 5 minutes
      cut_after_message_index: 0
    },
    stream_options: {
      include_usage: true
    }
  })
});

Usage Tracking

The API response includes cache usage in the usage object. Both endpoints return cache usage information:

Anthropic Format (/v1/messages)

{
  "usage": {
    "input_tokens": 150,
    "output_tokens": 50,
    "cache_creation_input_tokens": 2000,
    "cache_read_input_tokens": 0,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 2000,
      "ephemeral_1h_input_tokens": 0
    }
  }
}

OpenAI Format (/v1/chat/completions)

{
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 50,
    "total_tokens": 200,
    "cache_creation_input_tokens": 2000,
    "cache_read_input_tokens": 0,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 2000,
      "ephemeral_1h_input_tokens": 0
    }
  }
}
For cached requests, cache_read_input_tokens will be non-zero instead of cache_creation_input_tokens.
The OpenAI format uses prompt_tokens and completion_tokens instead of input_tokens and output_tokens, but the cache-related fields are identical across both formats.

Best Practices

  1. Cache large, static content that doesn’t change frequently
  2. Monitor cache_read_input_tokens to verify cache hits
  3. Respect the 4 breakpoint limit - prioritize largest, most reused content
  4. Remember cache expiration - caches expire after 5 minutes (or your TTL)
For more details, see the Anthropic documentation.