Prompt caching¶

SecondBrain runs many turns per session that share a stable prefix — the system prompt, the tool catalog, and (in agentic loops) the same conversation history with one extra user message tacked on. Re-tokenising that prefix every turn is the dominant cost on Claude and an avoidable one on OpenAI and Bedrock as well. The provider layer attaches cache breakpoints automatically so callers do not have to think about it.

What the provider layer does for you¶

For each request, the Anthropic provider inserts up to three cache breakpoints (within Anthropic's 4-breakpoint per-request cap):

Tools — cache_control on the last tool definition; caches the entire tool array.
System — cache_control on the last stable system block (see below); caches the static instructions and any preceding stable blocks.
Last tool_result — cache_control on the most recent tool-call result in the conversation; extends the cached prefix turn-over-turn in an agentic loop, so the next request reads the entire reasoning chain from cache.

Bedrock Converse mirrors this with cachePoint blocks (Claude and Nova model families). OpenAI and Azure-OpenAI receive a deterministic prompt_cache_key (a SHA-256 of the stable prefix) so requests with matching prefixes pin to the same backend in OpenAI's load-balanced fleet.

Gemini and other OpenAI-compatible providers (Groq, Mistral, Fireworks) do not get cache directives — Gemini requires explicit cachedContent resource creation that SecondBrain does not yet drive, and the compatible servers reject unknown payload fields.

The `cache_stable` convention¶

The single most important thing a caller needs to know: mark static system content with cache_stable=True.

In _to_anthropic_messages, the cache breakpoint lands on the last system block flagged cache_stable. If no block is flagged, it falls back to the last system block (legacy behaviour). The Bedrock provider applies the same rule.

This matters because the chat turn preparer and several agent paths append per-turn metadata after the base system prompt — today's date, session id, retrieved memory, the context pack, skill prompts. Without the flag, the breakpoint lands on a dynamic block that rotates every turn, and the cache hit rate collapses to zero.

messages = [
    # Stable: large, identical across turns → cache breakpoint anchors here
    {"role": "system", "content": SYSTEM_PROMPT, "cache_stable": True},
    # Dynamic: per-turn metadata, never cached
    {"role": "system", "content": f"Today is {date.today().isoformat()}"},
    {"role": "system", "content": retrieved_memory_text},
    *history,
    {"role": "user", "content": user_message},
]

If you build a callsite that injects multiple system blocks, mark the stable portion explicitly. If you only ever send one system block, the fallback handles it.

TTL: 5 minutes by default, 1 hour for workflows¶

The default cache TTL is 5 minutes (Anthropic's ephemeral default). Cache writes cost 1.25× input tokens; reads cost 0.1× — break-even is two reads.

For workflows with multi-minute pauses between steps (e.g. an automation that waits on an external job), opt into the 1-hour tier by setting SB_ANTHROPIC_CACHE_TTL=1h. The provider then:

Adds "ttl": "1h" to every cache_control block it emits.
Appends extended-cache-ttl-2025-04-11 to the anthropic-beta header.
Costs writes at 2× input (break-even after 10 reads — only use 1h if the prefix will be reused ≥10× before expiry).

brain.providers.cost.estimate_cost_usd takes a matching cache_write_ttl parameter so cost telemetry stays accurate.

A Bedrock equivalent (SB_BEDROCK_PROMPT_CACHE) controls force-enable / force-disable per model; per-TTL pricing is not yet exposed because Bedrock currently only has the default tier.

What gets billed (Anthropic)¶

The response usage object carries three numbers that the ProviderResult surfaces verbatim:

Field on response	What it means	Price multiplier
`input_tokens`	Uncached input	1.0×
`cache_creation_input_tokens`	Wrote new cache this turn	1.25× / 2.0×
`cache_read_input_tokens`	Read existing cache	0.1×

To verify caching is working in the wild:

Turn 1 of a session: cache_creation_input_tokens > 0, cache_read_input_tokens == 0. The system prompt + tools were written.
Turn 2 onwards: cache_read_input_tokens > 0 (the prefix from turn 1 is being re-read), cache_creation_input_tokens may be small (writing the new extended prefix that ends in this turn's tool_result).
After 5 minutes of idle: reads drop back to 0 and writes spike again.

sb chat displays cache read / write per turn in its end-of-turn summary, and the cost tracker accumulates the totals across the session.

Adding a new caller¶

If you're writing a new agent path that calls provider.generate() directly:

Build messages with a single cache_stable=True system block holding your stable instructions.
Put any per-call metadata in additional system or user blocks after it.
Don't otherwise touch cache_control — the provider handles it.

If you're using generate_with_prompt() via the PromptStore, the rendered system prompt is treated as the stable block by callers that combine it with per-turn metadata via override_messages. See brain/data_agent/agent.py:3957 for the canonical pattern.

Common mistakes¶

Embedding a timestamp or session id in the stable system prompt. This rotates the hash and busts the cache every call. Move it to a separate dynamic block placed after the stable one.
Reordering tools between requests. Tool order affects the cache key; keep registries deterministic.
Treating "supported" in provider_diagnostics as guaranteed caching. Gemini and Bedrock report cache-read tokens but only the Anthropic and OpenAI/Azure paths actually drive cache writes from SecondBrain today.

Provider parity matrix¶

Provider family	Cache writes driven	Cache reads observed	Notes
Anthropic	✅ (cache_control)	✅	5m default; `SB_ANTHROPIC_CACHE_TTL=1h`
OpenAI / Azure	✅ (prompt_cache_key)	✅	Automatic; key pins routing
Bedrock (Claude)	✅ (cachePoint)	✅	Model-gated; `SB_BEDROCK_PROMPT_CACHE`
Bedrock (Nova)	✅ (cachePoint)	✅	Same as Claude on Bedrock
Bedrock (other)	❌	✅	cachePoint rejected by validation
Gemini	❌	✅ (if cached ext.)	Needs cachedContent provisioning
OpenAI-compatible	❌	partial	Server-dependent
Ollama / local	❌	n/a	Local inference; cache N/A