Demystifying Prompt Caching: What You Need to Know

Published On Thu Oct 03 2024

Comparing Prompt Caching: OpenAI, Anthropic, and Gemini - DEV ...

In recent years, the rapid development of large language models (LLMs) has led to significant increases in context window sizes. A context window refers to the amount of information a model can process at one time, and innovations like Retrieval-Augmented Generation (RAG), video, and image inputs have expanded the usable context length in LLMs. This evolution is aimed at handling more complex tasks and a wider range of information.

In response, major providers have introduced "prompt caching" for efficient prompt management. Prompt caching stores previously used prompts and their results for reuse, avoiding repeated processing of the same tasks. This leads to faster processing times and cost savings.

Comparing Prompt Caching Features

In this article, we will compare the prompt caching features of the key LLM providers: OpenAI, Anthropic, and Gemini, focusing on their specifications and differences.

Prompt caching is available in relatively new models. The default TTL is 5–10 minutes, but it can extend up to an hour during off-peak times. By default, the cache is stored for 5 minutes.

The default TTL is 1 hour, but you can specify a custom TTL (additional charges apply if extended). Input token costs are discounted by 50% across all models, while output token costs remain the same.

Discounts are as follows:

Gemini has a complex pricing structure with costs including
Unlike OpenAI and Anthropic, Gemini charges for storing cache. For details, refer to here, and for an example cost calculation, visit this page.

No code changes are necessary once a prompt exceeds 1,024 tokens, it is automatically added to the cache. Cache hits occur in 128-token increments after 1,024 tokens (e.g., 1,024, 1,152, 1,280...). You need to explicitly call prompt caching to use the feature.

Minimum tokens for cache usage: For Gemini, you must first create a cache using CachedContent.create, and then specify it when defining the model. Minimum tokens for cache usage in Gemini is 32,768. Static content used for caching should be placed at the beginning of the prompt to maximize cache hit rates, as cache searches start from the beginning of the prompt.

In Anthropic’s case, explicit cache additions are required, and with a short TTL of 5 minutes, it is best to cache frequently used elements like system instructions, tool definitions, and RAG contexts.

Gemini offers longer TTLs, but since cache storage incurs costs, it is recommended to cache large-scale content like code repositories, long videos, or extensive documents.