Prompt caching

Model prompts often contain repetitive content, like system prompts and common instructions. OpenAI routes API requests to servers that recently processed the same prompt, making it cheaper and faster than processing a prompt from scratch. Prompt Caching can reduce latency by up to 80% and input token costs by up to 90%. Prompt Caching works automatically on all your API requests (no code changes required) and has no additional fees associated with it. Prompt Caching is enabled for all recent models, gpt-4o and newer.

This guide describes how Prompt Caching works in detail, so that you can optimize your prompts for lower latency and cost.

Structuring prompts

Cache hits are only possible for exact prefix matches within a prompt. To realize caching benefits, place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end. This also applies to images and tools, which must be identical between requests.

Prompt Caching visualization

How it works

Caching is enabled automatically for prompts that are 1024 tokens or longer. When you make an API request, the following steps occur:

Cache Routing:

Requests are routed to a machine based on a hash of the initial prefix of the prompt. The hash typically uses the first 256 tokens, though the exact length varies depending on the model.
If you provide the prompt_cache_key parameter, it is combined with the prefix hash, allowing you to influence routing and improve cache hit rates. This is especially beneficial when many requests share long, common prefixes.
If requests for the same prefix and prompt_cache_key combination exceed a certain rate (approximately 15 requests per minute), some may overflow and get routed to additional machines, reducing cache effectiveness.

Cache Lookup: The system checks if the initial portion (prefix) of your prompt exists in the cache on the selected machine.
Cache Hit: If a matching prefix is found, the system uses the cached result. This significantly decreases latency and reduces costs.
Cache Miss: If no matching prefix is found, the system processes your full prompt, caching the prefix afterward on that machine for future requests.

Prompt cache retention

Prompt Caching can either use in-memory or extended retention policies. When available, Extended Prompt Caching aims to retain the cache for longer, so that subsequent requests are more likely to match the cache.

Prompt cache pricing is the same for both retention policies.

To configure the prompt cache retention policy, set the prompt_cache_retention parameter on your Responses.create request (or chat.completions.create if using Chat Completions).

In-memory prompt cache retention

In-memory prompt cache retention is available for all models that support Prompt Caching.

When using the in-memory policy, cached prefixes generally remain active for 5 to 10 minutes of inactivity, up to a maximum of one hour. In-memory cached prefixes are only held within volatile GPU memory.

Extended prompt cache retention

Extended prompt cache retention is available for the following models:

gpt-5.2
gp5-5.1-codex-max
gpt-5.1
gpt-5.1-codex
gpt-5.1-codex-mini
gpt-5.1-chat-latest
gpt-5
gpt-5-codex
gpt-4.1

Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching.

key/value tensors are the intermediate representation from the model’s attention layers produced during prefill. Only the key/value tensors may be persisted in local storage; the original customer content, such as prompt text, is only retained in memory.

Configure per request

If you don’t specify a retention policy, the default is in_memory. Allowed values are in_memory and 24h.

1
2
3
4
5
{
  "model": "gpt-5.1",
  "input": "Your prompt goes here...",
  "prompt_cache_retention": "24h"
}

Requirements

Caching is available for prompts containing 1024 tokens or more.

All requests, including those with fewer than 1024 tokens, will display a cached_tokens field of the usage.prompt_tokens_details Response object or Chat object indicating how many of the prompt tokens were a cache hit. For requests under 1024 tokens, cached_tokens will be zero.

1
2
3
4
5
6
7
8
9
10
11
12
13
"usage": {
  "prompt_tokens": 2006,
  "completion_tokens": 300,
  "total_tokens": 2306,
  "prompt_tokens_details": {
    "cached_tokens": 1920
  },
  "completion_tokens_details": {
    "reasoning_tokens": 0,
    "accepted_prediction_tokens": 0,
    "rejected_prediction_tokens": 0
  }
}

What can be cached

Messages: The complete messages array, encompassing system, user, and assistant interactions.
Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
Tool use: Both the messages array and the list of available tools can be cached, contributing to the minimum 1024 token requirement.
Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.

Best practices

Structure prompts with static or repeated content at the beginning and dynamic, user-specific content at the end.
Use the prompt_cache_key parameter consistently across requests that share common prefixes. Select a granularity that keeps each unique prefix-prompt_cache_key combination below 15 requests per minute to avoid cache overflow.
Monitor your cache performance metrics, including cache hit rates, latency, and the proportion of tokens cached, to refine your strategy. You can monitor your cached token counts by logging the usage field results as shown above, or in the OpenAI Usage dashboard.
Maintain a steady stream of requests with identical prompt prefixes to minimize cache evictions and maximize caching benefits.

Frequently asked questions

How is data privacy maintained for caches?

Prompt caches are not shared between organizations. Only members of the same organization can access caches of identical prompts.
Does Prompt Caching affect output token generation or the final response of the API?

Prompt Caching does not influence the generation of output tokens or the final response provided by the API. Regardless of whether caching is used, the output generated will be identical. This is because only the prompt itself is cached, while the actual response is computed anew each time based on the cached prompt.
Is there a way to manually clear the cache?

Manual cache clearing is not currently available. Prompts that have not been encountered recently are automatically cleared from the cache. Typical cache evictions occur after 5-10 minutes of inactivity, though sometimes lasting up to a maximum of one hour during off-peak periods.
Will I be expected to pay extra for writing to Prompt Caching?

No. Caching happens automatically, with no explicit action needed or extra cost paid to use the caching feature.
Do cached prompts contribute to TPM rate limits?

Yes, as caching does not affect rate limits.
Does Prompt Caching work on Zero Data Retention requests?

In-memory cache retention is Zero Data Retention eligible. If you specify extended caching in the request, then that request is not considered Zero Data Retention eligible because the key/value tensors may be held in GPU-local storage, and the key-value tensors are derived from customer content. However, the extended caching request will not be blocked if Zero Data Retention is enabled for your project. The other Zero Data Retention still applies, such as excluding customer content from abuse logs and preventing use of store=True. See the Your data guide for more context on Zero Data Retention.
Does Prompt Caching work with Data Residency?

In-memory Prompt Caching is compatable with all Data Residency regions.

Extended caching is only compatible with Data Residency regions that include Regional Inference.

Search the API docs

Get started

Core concepts

Agents

Tools

Run and scale

Evaluation

Realtime API

Model optimization

Specialized models

Coding agents

Going live

Legacy APIs

Resources

Getting Started

Using Codex

Configuration

Administration

Automation

Learn

Releases

Categories

Topics