Prompt caching¶
SecondBrain runs many turns per session that share a stable prefix — the system prompt, the tool catalog, and (in agentic loops) the same conversation history with one extra user message tacked on. Re-tokenising that prefix every turn is the dominant cost on Claude and an avoidable one on OpenAI and Bedrock as well. The provider layer attaches cache breakpoints automatically so callers do not have to think about it.
What the provider layer does for you¶
For each request, the Anthropic provider inserts up to three cache breakpoints (within Anthropic's 4-breakpoint per-request cap):
- Tools —
cache_controlon the last tool definition; caches the entire tool array. - System —
cache_controlon the last stable system block (see below); caches the static instructions and any preceding stable blocks. - Last
tool_result—cache_controlon the most recent tool-call result in the conversation; extends the cached prefix turn-over-turn in an agentic loop, so the next request reads the entire reasoning chain from cache.
Bedrock Converse mirrors this with cachePoint blocks (Claude and Nova
model families). OpenAI and Azure-OpenAI receive a deterministic
prompt_cache_key (a SHA-256 of the stable prefix) so requests with
matching prefixes pin to the same backend in OpenAI's load-balanced
fleet.
Gemini and other OpenAI-compatible providers (Groq, Mistral, Fireworks)
do not get cache directives — Gemini requires explicit cachedContent
resource creation that SecondBrain does not yet drive, and the compatible
servers reject unknown payload fields.
The cache_stable convention¶
The single most important thing a caller needs to know: mark static
system content with cache_stable=True.
In _to_anthropic_messages, the cache breakpoint lands on the last
system block flagged cache_stable. If no block is flagged, it falls
back to the last system block (legacy behaviour). The Bedrock provider
applies the same rule.
This matters because the chat turn preparer and several agent paths append per-turn metadata after the base system prompt — today's date, session id, retrieved memory, the context pack, skill prompts. Without the flag, the breakpoint lands on a dynamic block that rotates every turn, and the cache hit rate collapses to zero.
messages = [
# Stable: large, identical across turns → cache breakpoint anchors here
{"role": "system", "content": SYSTEM_PROMPT, "cache_stable": True},
# Dynamic: per-turn metadata, never cached
{"role": "system", "content": f"Today is {date.today().isoformat()}"},
{"role": "system", "content": retrieved_memory_text},
*history,
{"role": "user", "content": user_message},
]
If you build a callsite that injects multiple system blocks, mark the stable portion explicitly. If you only ever send one system block, the fallback handles it.
TTL: 5 minutes by default, 1 hour for workflows¶
The default cache TTL is 5 minutes (Anthropic's ephemeral default). Cache writes cost 1.25× input tokens; reads cost 0.1× — break-even is two reads.
For workflows with multi-minute pauses between steps (e.g. an automation
that waits on an external job), opt into the 1-hour tier by setting
SB_ANTHROPIC_CACHE_TTL=1h. The provider then:
- Adds
"ttl": "1h"to everycache_controlblock it emits. - Appends
extended-cache-ttl-2025-04-11to theanthropic-betaheader. - Costs writes at 2× input (break-even after 10 reads — only use 1h if the prefix will be reused ≥10× before expiry).
brain.providers.cost.estimate_cost_usd takes a matching
cache_write_ttl parameter so cost telemetry stays accurate.
A Bedrock equivalent (SB_BEDROCK_PROMPT_CACHE) controls force-enable /
force-disable per model; per-TTL pricing is not yet exposed because
Bedrock currently only has the default tier.
What gets billed (Anthropic)¶
The response usage object carries three numbers that the
ProviderResult surfaces verbatim:
| Field on response | What it means | Price multiplier |
|---|---|---|
input_tokens |
Uncached input | 1.0× |
cache_creation_input_tokens |
Wrote new cache this turn | 1.25× / 2.0× |
cache_read_input_tokens |
Read existing cache | 0.1× |
To verify caching is working in the wild:
- Turn 1 of a session:
cache_creation_input_tokens > 0,cache_read_input_tokens == 0. The system prompt + tools were written. - Turn 2 onwards:
cache_read_input_tokens > 0(the prefix from turn 1 is being re-read),cache_creation_input_tokensmay be small (writing the new extended prefix that ends in this turn's tool_result). - After 5 minutes of idle: reads drop back to 0 and writes spike again.
sb chat displays cache read / write per turn in its end-of-turn
summary, and the cost tracker accumulates the totals across the session.
Adding a new caller¶
If you're writing a new agent path that calls provider.generate()
directly:
- Build messages with a single
cache_stable=Truesystem block holding your stable instructions. - Put any per-call metadata in additional system or user blocks after it.
- Don't otherwise touch
cache_control— the provider handles it.
If you're using generate_with_prompt() via the PromptStore, the
rendered system prompt is treated as the stable block by callers that
combine it with per-turn metadata via override_messages. See
brain/data_agent/agent.py:3957 for the canonical pattern.
Common mistakes¶
- Embedding a timestamp or session id in the stable system prompt. This rotates the hash and busts the cache every call. Move it to a separate dynamic block placed after the stable one.
- Reordering tools between requests. Tool order affects the cache key; keep registries deterministic.
- Treating "supported" in
provider_diagnosticsas guaranteed caching. Gemini and Bedrock report cache-read tokens but only the Anthropic and OpenAI/Azure paths actually drive cache writes from SecondBrain today.
Provider parity matrix¶
| Provider family | Cache writes driven | Cache reads observed | Notes |
|---|---|---|---|
| Anthropic | ✅ (cache_control) | ✅ | 5m default; SB_ANTHROPIC_CACHE_TTL=1h |
| OpenAI / Azure | ✅ (prompt_cache_key) | ✅ | Automatic; key pins routing |
| Bedrock (Claude) | ✅ (cachePoint) | ✅ | Model-gated; SB_BEDROCK_PROMPT_CACHE |
| Bedrock (Nova) | ✅ (cachePoint) | ✅ | Same as Claude on Bedrock |
| Bedrock (other) | ❌ | ✅ | cachePoint rejected by validation |
| Gemini | ❌ | ✅ (if cached ext.) | Needs cachedContent provisioning |
| OpenAI-compatible | ❌ | partial | Server-dependent |
| Ollama / local | ❌ | n/a | Local inference; cache N/A |