Prompt Caching

Prompt caching allows you to cache parts of your prompts that don’t change between requests, reducing costs and improving response times.

API Format

Use the `/v1/messages` endpoint with `cache_control` markers in message content blocks.

Both formats support the same caching functionality. The prompt_caching helper in the OpenAI format automatically converts to cache_control markers internally. Examples below are organized by format for clarity.

Pricing

Prompt caching uses a tiered pricing structure based on cache duration and usage type:

Model	Base Input Tokens	5m Cache Writes	1h Cache Writes	Cache Hits & Refreshes	Output Tokens
Claude Opus 4.5	$5 / MTok	$6.25 / MTok	$10 / MTok	$0.50 / MTok	$25 / MTok
Claude Opus 4.1	$15 / MTok	$18.75 / MTok	$30 / MTok	$1.50 / MTok	$75 / MTok
Claude Opus 4	$15 / MTok	$18.75 / MTok	$30 / MTok	$1.50 / MTok	$75 / MTok
Claude Sonnet 4.5	$3 / MTok	$3.75 / MTok	$6 / MTok	$0.30 / MTok	$15 / MTok
Claude Sonnet 4	$3 / MTok	$3.75 / MTok	$6 / MTok	$0.30 / MTok	$15 / MTok
Claude Sonnet 3.7	$3 / MTok	$3.75 / MTok	$6 / MTok	$0.30 / MTok	$15 / MTok
Claude Haiku 4.5	$1 / MTok	$1.25 / MTok	$2 / MTok	$0.10 / MTok	$5 / MTok
Claude Haiku 3.5	$0.80 / MTok	$1 / MTok	$1.6 / MTok	$0.08 / MTok	$4 / MTok
Claude Haiku 3	$0.25 / MTok	$0.30 / MTok	$0.50 / MTok	$0.03 / MTok	$1.25 / MTok

The pricing multipliers are:

5-minute cache writes: 1.25x the base input tokens price
1-hour cache writes: 2x the base input tokens price
Cache reads: 0.1x the base input tokens price

Cache writes occur when content is first cached. Subsequent requests using the cached content are charged at the cache read rate (10% of base input price). The cache lifetime is refreshed each time the cached content is used.

Limitations

Maximum 4 cache breakpoints per request
Caches expire after 5 minutes by default
cache_control can only be inserted into text content blocks

Reserve cache breakpoints for large, static content like character cards, CSV data, RAG knowledge bases, book chapters, or extensive reference documentation.

Examples

System Message Caching

Anthropic Format (`/v1/messages`)

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a historian. You know the following book:'
        },
        {
          type: 'text',
          text: 'HUGE TEXT BODY',
          cache_control: {
            type: 'ephemeral'
          }
        }
      ]
    },
    {
      role: 'user',
      content: 'What triggered the collapse?'
    }
  ]
});

OpenAI Format (`/v1/chat/completions`)

const response = await fetch('https://api.electronhub.ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'claude-sonnet-4-5-20250929',
    messages: [
      {
        role: 'user',
        content: 'You are a historian. You know the following book: HUGE TEXT BODY. What triggered the collapse?'
      }
    ],
    max_tokens: 1000,
    prompt_caching: {
      enabled: true,
      ttl: '5m',
      cut_after_message_index: 0
    },
    stream_options: {
      include_usage: true
    }
  })
});

User Message Caching

Anthropic Format (`/v1/messages`)

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'user',
      content: [
        {type: 'text', text: 'Given the book below:'},
        {type: 'text', text: 'HUGE TEXT BODY', cache_control: {type: 'ephemeral'}},
        {type: 'text', text: 'Name all the characters'}
      ]
    }
  ]
});

OpenAI Format (`/v1/chat/completions`)

const response = await fetch('https://api.electronhub.ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'claude-sonnet-4-5-20250929',
    messages: [
      {
        role: 'user',
        content: [
          {type: 'text', text: 'Given the book below:'},
          {type: 'text', text: 'HUGE TEXT BODY', cache_control: {type: 'ephemeral'}},
          {type: 'text', text: 'Name all the characters'}
        ]
      }
    ],
    max_tokens: 1000,
    stream_options: {
      include_usage: true
    }
  })
});

Cache Control Options

Anthropic Format (`/v1/messages`)

cache_control.type

string

required

Cache type. Currently only "ephemeral" is supported.

cache_control.ttl

string

Time-to-live. Optional, defaults to 5 minutes. Format: "5m" or "1h".

Some models like claude-3-7-sonnet-20250219 do not support TTL in system messages. TTL is automatically stripped for these models.

OpenAI Format (`/v1/chat/completions`)

prompt_caching.enabled

boolean

required

Enable prompt caching. Set to true to enable caching.

prompt_caching.ttl

string

Time-to-live. Optional, defaults to "5m". Format: "5m" or "1h".

prompt_caching.cut_after_message_index

integer

required

Zero-based index of the last message to cache. All messages up to and including this index will be cached.

The prompt_caching helper automatically converts to cache_control markers in message content. You can also use cache_control markers directly in message content blocks (same as Anthropic format).

Using TTL

Specify cache duration with the ttl parameter:

Anthropic Format (`/v1/messages`)

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1000,
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a helpful assistant. Use this knowledge base:'
        },
        {
          type: 'text',
          text: 'LARGE KNOWLEDGE BASE TEXT',
          cache_control: {
            type: 'ephemeral',
            ttl: '1h'  // Cache for 1 hour instead of default 5 minutes
          }
        }
      ]
    },
    {
      role: 'user',
      content: 'What is machine learning?'
    }
  ]
});

OpenAI Format (`/v1/chat/completions`)

const response = await fetch('https://api.electronhub.ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'claude-sonnet-4-5-20250929',
    messages: [
      {
        role: 'user',
        content: 'You are a helpful assistant. Use this knowledge base: LARGE KNOWLEDGE BASE TEXT. What is machine learning?'
      }
    ],
    max_tokens: 1000,
    prompt_caching: {
      enabled: true,
      ttl: '1h',  // Cache for 1 hour instead of default 5 minutes
      cut_after_message_index: 0
    },
    stream_options: {
      include_usage: true
    }
  })
});

Usage Tracking

The API response includes cache usage in the usage object. Both endpoints return cache usage information:

Anthropic Format (`/v1/messages`)

{
  "usage": {
    "input_tokens": 150,
    "output_tokens": 50,
    "cache_creation_input_tokens": 2000,
    "cache_read_input_tokens": 0,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 2000,
      "ephemeral_1h_input_tokens": 0
    }
  }
}

OpenAI Format (`/v1/chat/completions`)

{
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 50,
    "total_tokens": 200,
    "cache_creation_input_tokens": 2000,
    "cache_read_input_tokens": 0,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 2000,
      "ephemeral_1h_input_tokens": 0
    }
  }
}

For cached requests, cache_read_input_tokens will be non-zero instead of cache_creation_input_tokens.

The OpenAI format uses prompt_tokens and completion_tokens instead of input_tokens and output_tokens, but the cache-related fields are identical across both formats.

Best Practices

Cache large, static content that doesn’t change frequently
Monitor cache_read_input_tokens to verify cache hits
Respect the 4 breakpoint limit - prioritize largest, most reused content
Remember cache expiration - caches expire after 5 minutes (or your TTL)

For more details, see the Anthropic documentation.

Getting Started

API Reference

Guides

Examples

Billing

API Format

Pricing

Limitations

Examples

System Message Caching

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

User Message Caching

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Cache Control Options

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Using TTL

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Usage Tracking

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Best Practices

Getting Started

API Reference

Guides

Examples

Billing

​API Format

​Pricing

​Limitations

​Examples

​System Message Caching

​Anthropic Format (/v1/messages)

​OpenAI Format (/v1/chat/completions)

​User Message Caching

​Anthropic Format (/v1/messages)

​OpenAI Format (/v1/chat/completions)

​Cache Control Options

​Anthropic Format (/v1/messages)

​OpenAI Format (/v1/chat/completions)

​Using TTL

​Anthropic Format (/v1/messages)

​OpenAI Format (/v1/chat/completions)

​Usage Tracking

​Anthropic Format (/v1/messages)

​OpenAI Format (/v1/chat/completions)

​Best Practices

API Format

Pricing

Limitations

Examples

System Message Caching

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

User Message Caching

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Cache Control Options

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Using TTL

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Usage Tracking

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Best Practices