Memory API v1 — Published Quality Scorecard¶
The Memory API's brand promise is AI you can audit: every memory-bearing
response carries a Citation envelope (defined in contracts/memory_api_v1.yaml),
and every claim traces back to a chunk you can read. That promise is only
credible if we publish quality numbers. This page captures the current
scorecard and the methodology that produced it.
Reproduce locally:
make eval-memory-api. The runner lives atbrain/evals/retrieval/runner.py; the goldens atbrain/evals/fixtures/retrieval/seed_goldens.jsonl.
Headline numbers¶
Run as of 2026-05-08, against the seed retrieval goldens (30 cases, 47 corpus items, hash-mode embeddings for reproducibility):
| Metric | Value | What it means |
|---|---|---|
citation_recall@10 |
1.00 | Every relevant chunk made it into the top 10 |
citation_precision@10 |
0.17 | Roughly 1.7 of the top 10 results are relevant on average |
ndcg@10 |
0.94 | Highly-relevant chunks are ranked near the top |
mrr |
0.94 | The first relevant chunk shows up at rank ~1.07 on average |
p50_ms |
0.3 ms | Median retrieval latency (hash mode; production embedders are slower) |
p95_ms |
0.4 ms | 95th-percentile retrieval latency (hash mode) |
| n_cases | 30 | Number of golden queries scored |
Reading the numbers honestly¶
Recall is high because the seed goldens use substring-relevance, which is generous — any chunk containing one of the expected substrings counts as relevant. Real-world recall on chunk-id-pinned goldens will be lower; building those goldens is the next quality slice.
Precision is low because the goldens have a small relevant set per query
(typically 1–3 chunks) but we return the full top-10. Precision climbs
sharply as we tune top_k to the query type — see the adaptive-k logic in
brain/retrieve/hybrid.py::_adaptive_k.
Latency here is hash-mode synthetic — bge-m3 / Voyage / Cohere embedders add 30–200 ms per query depending on backend. The published number is the Memory API's retrieval-side overhead; the embedder is a separate budget.
nDCG = 0.94 and MRR = 0.94 are the more meaningful numbers. They mean the Memory API is putting the right answers near the top of the result list, which is what callers feel.
Dogfood scorecard — SecondBrain on SecondBrain¶
The seed goldens above are synthetic — short queries against a generic corpus. To prove the Memory API works on the kind of corpus adopters actually have, we dogfood it on SecondBrain's own docs and source code. 25 questions a real adopter would ask ("How does multi-tenancy gate behavior?", "What is the chunk_hash format?", "How do I migrate Chroma → LanceDB?"); each query has distinctive substrings (function names, contract terms, decision keys) that uniquely identify the answer chunk.
Reproduce locally:
make eval-dogfood. Goldens atbrain/evals/fixtures/dogfood/sb_self_query.jsonl; harness atbrain/evals/dogfood/runner.py.
Run as of 2026-05-08 — corpus = 30 files (docs + key source) split into ~566 chunks; 25 queries. Comparison against a ripgrep baseline so the Memory API's improvement over "just grep" is honest:
| Metric | Memory API | Grep baseline | Memory API / Grep |
|---|---|---|---|
citation_recall@10 |
0.92 | 0.44 | 2.1× |
citation_precision@10 |
0.33 | 0.10 | 3.5× |
ndcg@10 |
0.68 | 0.29 | 2.3× |
mrr |
0.57 | 0.24 | 2.4× |
p50_ms |
0.9 ms | 14.9 ms | (Memory API 16× faster) |
p95_ms |
1.3 ms | 19.2 ms | (Memory API 15× faster) |
What the numbers mean¶
- The Memory API finds 92% of relevant chunks in the top 10, vs 44% for grep.
Two-thirds of the gap (44% → 92%) is the API's tokenized scoring
picking up partial-match queries that grep's whole-token regex misses
(e.g. "near-duplicate detection" matching
_find_near_duplicates). - Precision is 3.5× higher because the Memory API's top-10 is densely relevant (3.3 of 10 on average), where grep's top-10 is mostly noise from common tokens (1.0 of 10 on average).
- MRR = 0.57 means the first relevant chunk shows up at rank ~1.75 on average — the Memory API puts the right answer in the top two slots more often than not. Grep's MRR = 0.24 means the right answer is typically rank 4–5 if it's in the top 10 at all.
- Latency here is in-memory token-overlap retrieval, not LanceDB + cross-encoder. The production Memory API adds 30–200 ms for the embedder pass; it's still well under grep's subprocess overhead for this corpus size.
Honest caveats¶
- The dogfood harness uses a simplified token-overlap retriever (TF + set-overlap) rather than the live LanceDB + reranker stack. That's a deliberate choice — the dogfood story is about whether SB's retrieval signals (chunking, scoring, semantic boundaries) beat grep on a real corpus, not whether bge-m3 embeddings are good.
- Substring relevance is generous: any chunk containing a relevant
substring counts. A chunk-id-pinned variant of the goldens (where
the expected
chunk_hashis fixed) would tighten precision further; on the to-do list once the SB corpus stops churning daily. - The grep baseline is
rg --max-count=3with a regex of the query's longer tokens. A more aggressive grep (full file content, bigger context windows, manual ranking by line proximity) could close some of the gap.
The headline assertion tests/evals/test_dogfood_eval.py::test_dogfood_runner_returns_memory_api_better_than_grep
runs in CI and fails if the Memory API ever drops below grep on any of the
four quality metrics, or if citation_recall@10 < 0.80. That's the
floor we commit to keeping.
Cognitive uplift — what each layer contributes¶
The Memory API surface is the public face; underneath it run the Antahkarana cognitive stack and Autotune evolutionary prompt optimizer. They are the engines that turn a static RAG layer into a self-improving memory. This section attributes Memory API quality improvements to the specific layer that produced them.
Every layer emits a Prometheus counter on /metrics with a stable
name and label set, so the question "what is the system actually
doing for me this week" has a concrete answer rather than a marketing
one.
| Layer | Role | Prometheus metric | What it measures |
|---|---|---|---|
| Chitta | memory crystallization | secondbrain_chitta_synthesis_total{category} |
Compound memory clusters created from access patterns. Each cluster is a synthesized higher-level memory that the system would otherwise have lost in granular noise. |
| Manas | predictive context | secondbrain_manas_preload_total{outcome} |
Predicted-context preloads. outcome=hit means Manas surfaced something a query later asked for; miss means it preloaded but nothing matched; skipped means predictive was disabled. The hit/miss ratio is the felt-latency win. |
| Viveka | quality gate | secondbrain_viveka_block_total{reason} |
Proposals rejected by the discrimination layer. Each block is a low-quality / overbroad / duplicate / unsafe memory or prompt mutation that didn't pollute the index. |
| Karma | consequence ledger | secondbrain_karma_outcome_total{outcome} |
Decisions with measured outcomes attached (positive / negative / neutral / pending). The proportion of decisions with non-pending outcomes is the audit health metric. |
| Autotune | prompt evolution | secondbrain_autotune_mutation_total{outcome} |
Lifecycle states of prompt mutations: proposed → landed (passed confirmation) or reverted (regression caught) or rejected (Viveka or operator gated). The landed/proposed ratio is the autotune yield. |
Reading the layers honestly¶
- Chitta synthesis runs on a schedule; expect single-digit clusters per day on a normal vault, more after a heavy ingest week. A long zero stretch usually means the source memories haven't crossed the importance threshold — not a bug.
- Manas hit rate below 30% is fine — the goal isn't perfect prediction, it's catching the obvious follow-on topics. Above 50% is a strong signal the user has predictable workflows.
- Viveka blocks should be small but non-zero. Zero blocks means the gate isn't firing (suspicious). Many blocks means upstream generation quality dropped.
- Karma outcome ratio rises slowly — most decisions take days to show outcomes. Track the trend, not the absolute number.
- Autotune yield (landed / proposed) is the most direct self-improvement metric. A 7-day zero is fine if no mutations were proposed; persistent landed=0 with proposed>0 means confirmation testing is consistently catching regressions, which is good but worth investigating.
Wiring status¶
| Layer | Counter wired | Notes |
|---|---|---|
| Chitta | ✅ wired in CompoundMemorySynthesizer.synthesize |
Bumps when synthesis pass creates new clusters |
| Manas | ✅ wired in PredictiveContextLoader.preload |
hit/miss/skipped per call |
| Viveka | ✅ wired in VivekaGate.evaluate |
Bumps once per failed check on a denial |
| Karma | ✅ wired in KarmaLedger.record |
Bumps per ledger insert; "pending" when outcome unset |
| Autotune | ✅ wired in PromptMutationStrategy.apply_mutation |
Bumps outcome=landed after a prompt write |
All five layer counters now fire from real call sites; per-tenant
attribution is on under SB_MULTI_TENANT=1 (workspace label on every
counter, separate SQLite file per tenant under state/cognitive/<ws>/).
The closed loop — Karma → Autotune¶
Beyond observation, regrets in the Karma ledger now trigger both
research direction AND measurable fixtures in Autotune. When
karma_autotune_bridge_enabled=True on AntahkaranaConfig, every
KarmaEntry with regret=True and a known action_type fans out
into two artifacts:
IdeaRecord(source="derived") — what to try. The lesson becomes the hypothesis; autotune's mutation strategies pick it up on the nextsb autotune run <lane>.BenchmarkCandidate(origin="real_failure",status="pending_triage") — how to measure success. The regret's input becomes a probe; after operator triage + enrichment + promotion to a pack, it becomes an evaluation case the lane runs against on every attempt.
Together: regrets are both direction and fixture. Without the fixture half, autotune has lessons but no failing cases to verify mutation impact — exactly what the production dogfood revealed (2026-05-08): mutations rephrased the regret lessons but reverted because baseline = 1.0 with no failing case in the pack.
Karma action_type |
Autotune lane (spec name) |
|---|---|
search_more, direct_answer, buddhi_decision, viveka_evaluation |
repl_prompt |
travel_query, synthesize, briefing_build, repl, prompt_response |
repl_prompt |
vault_query, vault_search, travel_search, grounded_answer |
vault_retrieval |
memory_promotion, memory_retrieval_action, memory_search |
memory_retrieval |
Lane values are autotune spec names (snake_case, no dots) — the
runner queries IdeaMemory.next_for_lane(lane) by spec name. An
earlier round used the dotted evaluator name and silently orphaned
the ideas; the regression is now pinned by tests.
Unknown action types skip cleanly. Each idea / candidate is
dedupe-tagged karma:<action_id> so a regret fans out at most once
per (action_id, lane) pair.
Counter:
secondbrain_karma_to_autotune_total{action_type, lane} exposes the
closed-loop signal directly.
Environment Episodes → Autotune¶
Replayable environment episodes are another closed-loop signal. Low normalized
reward or incomplete task groups in environment_episodes are surfaced by
SelfImprovementPlanner as environment_score targets. The
EnvironmentReplayBridge turns those targets into derived IdeaRecord entries
for existing lanes such as repl_prompt, and the self-improvement orchestrator
can register matching Sankalpa goals before running bounded autotune attempts.
This is intentionally not a dedicated environment mutation lane yet. The environment subsystem supplies replayable pressure and evidence; lane contracts still decide what files may mutate and how a candidate is scored.
The quality plane also treats environment task groups as scenario suites. A
failed environment:<env_id>:<task_id> suite can block
sb quality gate --surface environments and can be promoted into an
internal_replay:environment_<env_id> suite for future regression pressure.
The self-improvement bridge also queues pending autotune benchmark candidates
for those weak environment groups so lane evaluators can gain measurable
real-failure fixtures after enrichment. Promoting any quality replay case that
maps to an autotune lane also creates or reuses a pending
quality_replay_case benchmark candidate, keeping the quality replay backlog
and autotune triage queue linked by source ids. The candidate id is stored on
the replay case metadata and replay evidence link for later audit. Pending
real-failure candidates become benchmark_pressure targets in
sb autotune self-diagnose, giving the self-improvement planner a deterministic
lane signal from the quality replay queue.
Trace process contracts are another quality input. A trace can declare
metadata.process_contract or metadata.process_contracts with required tools,
events, decisions, evidence labels, forbidden tools, and ordered sequences. The
quality summary scores those contracts as dimensions.tool_correctness; failed
contracts emit process_contract_failed:<id> regression flags, block the
relevant gate, and seed replay-case signatures even when the trace status is
completed. Decision-trace replay cases re-evaluate the same contract and
persist the observed process evidence plus process_contract_score in the eval
case metrics. When promoted, contract-backed replay cases seed autotune
benchmark candidates with executable expectations derived from the contract,
including required tools, forbidden tools, and local-context properties when the
contract requires evidence.
Replay execution is explicit. sb quality run-replay <case-id> records a fresh
eval result against the replay suite and links it back to the replay case
evidence chain. Each replay execution gets a new eval run id; the proposal or
promotion eval id remains source lineage so repeated executions are auditable as
independent evidence. Replay execution is adapter-based: environment episodes
replay into fresh fixture environments and score terminal status plus normalized
reward; enriched repl/chat cases execute through the repl_prompt runtime
prompt evaluator when expected outputs, properties, tools, or schema assertions
exist; retrieval replays execute against local memory/source-evidence FTS when
expected_refs exist. Prompt replays without executable assertions fall back to
the decision-trace adapter, which loads the persisted trace envelope and records
a failed replay result when the source trace still carries failed, blocked,
error, or failure-signature evidence. Replay families without an executable
adapter are marked not_runnable and recorded as failed evidence instead of
passing implicitly.
Autotune benchmark promotion is strict by default: a candidate must be reviewed,
fully enriched, linked to source evidence, and backed by executable assertions.
Property assertions are accepted only when a lane evaluator has a registered
contract for that property; unknown or replay-only properties fail closed instead
of becoming no-op passes. Operators can use sb autotune benchmark promote
--force for exceptional migrations, and the force reason list is persisted in
case metadata.
Autotune run promotion is also evidence-gated. Each candidate run records a
scientific_evidence summary comparing paired baseline and candidate cases:
sample size, primary metric delta, case wins/losses/ties, mean case delta, and a
95% lower confidence bound. Kept runs whose required paired-case evidence is
underpowered do not create automatic promotion bundles; they remain reviewable
run artifacts instead of being treated as proven self-improvement.
When sb autotune improve runs, the self-improvement cycle is also persisted
with target evidence, source episode ids, and seeded benchmark candidate ids,
then projected into quality as a self_improvement run. The persisted result
includes the aggregate Antahkarana outcome for the cycle and the target-to-pack
execution plan for every selected target. sb autotune improve --explain-plan
prints that plan without mutating state, including source benchmark candidate
ids, replay case ids, environment episode ids, required pack, and confirmation
pack. Promotion gates enforce the same plan: a kept run cannot get a
PromotionBundle when required replay cases were not rerun, failed, the wrong
pack was evaluated, source benchmark candidate coverage is missing, unresolved
failed replay suites remain, candidate properties lack executable contracts, or
force-promoted candidates lack explicit review. High-severity target pressure
escalates to the hard pack, and the derived confirmation pack is used by the
confirmation evaluator rather than only being stored as metadata.
Each selected target also becomes a self_improvement_experiments record with
hypothesis, source evidence, control, intervention, trial runs, confirmation
evidence, closure decision, metrics, links, deterministic reward components,
and verdict. Intervention records include exact candidate commits, files
touched, diff summaries, mutation strategies, models, and token usage when run
evidence provides them. The reward record keeps task reward, promotion-gate
reward, Antahkarana outcome reward, pressure-reduction reward, regression
penalty, safety penalty, complexity penalty, weights, and a bounded overall
score together with the experiment.
Cycle statuses are outcome-oriented: success, partial, no_effect,
invalid, blocked, failed, or dry_run. Quality projections include those
experiment ids and payloads, derive the projected self_improvement score from
experiment rewards, and emit explicit regression flags for missing experiments,
malformed rewards, failed or underpowered verdicts, and non-dry-run low
rewards. Those flags carry surface targets, so a mixed cycle can block only the
affected lane surface plus autotune and all. Severe flags such as failed,
missing, malformed, or safety-penalized experiments block the quality gate;
underpowered, inconclusive, or not-run experiments route to HITL. This makes the
self-improvement claim inspectable as a scientific experiment. sb autotune
cycle show <cycle-id>, sb autotune benchmark unresolved --lane <lane>, and
sb quality replay-results expose the loop state for operators. Benchmark
candidates carry resolution_status, resolved_by_run_id, resolved_at, and
resolution_evidence; fixed sources stop contributing active pressure, while
still-failing sources keep recurrence metadata. If the cycle promotes a run for
a karma_regret target,
matching active Karma regrets are marked purified so the loop records that the
pressure was addressed. Prior non-dry-run cycles that repeatedly stall without
promotion are also scanned by sb autotune self-diagnose as cycle_stall
targets.
Reproduce¶
# Hermetic, no effect on real state:
.venv/bin/python scripts/dogfood_karma_autotune.py
# Real ledger, dry-run (default) or commit:
.venv/bin/python scripts/dogfood_karma_real.py
.venv/bin/python scripts/dogfood_karma_real.py --commit
# Heuristic enrichment of every pending karma candidate:
sb autotune benchmark karma-enrich --lane repl_prompt
# Promote enriched cases to a pack, then run autotune:
sb autotune benchmark promote <candidate_id> --pack recent_failures
sb autotune run repl_prompt --pack recent_failures
Vocabulary extension — 2026-05-08¶
The earlier round's bottleneck was the prompt evaluator's controlled
vocabulary: complete_answer just checked for ≥5 output words, so
karma cases passed by default and the metric stuck at 1.0. Resolved
by adding three property keys with strict-evidence semantics:
| Key | Failure pattern it catches |
|---|---|
no_generic_deflection |
Output deflects with "more context required" / "could you clarify" on a declarative input |
addresses_factual_query |
Output is bland or abstract ("you should consider...") with zero factual indicators (digits, time/date words, direct-answer openers) |
identifies_specific_target |
Output drifts into "let me search broadly" without naming the actual target |
Strict semantic on the latter two: they require positive evidence of the right behavior, not just absence of the wrong one. A bland canned output that says nothing concrete FAILS — proving the prompt addressed the query needs the answer to actually contain the answer.
The karma enricher's pattern table now maps the three production
regret lessons onto these keys (with complete_answer still firing
additively as a sanity check):
| Karma lesson trigger | Now expects |
|---|---|
| "directly address the factual query" / "specific answer" | addresses_factual_query |
| "first identify the specific [meeting/target]" | identifies_specific_target |
| "more context required is a dead-end" / "declarative … specific suggestion" | no_generic_deflection |
Real-data result: After re-enriching the 3 production karma
candidates with the extended vocabulary and re-promoting them to
recent_failures, autotune's baseline on the lane dropped from
1.0 → 0.948718. The karma cases now register as failures; the
loop has measurable headroom for mutations to demonstrate
improvement.
Runtime extension — 2026-05-08¶
The vocabulary extension moved the bottleneck down to the
deterministic runtime stub: DeterministicPromptProvider produced
the same canned output regardless of prompt edits for scenarios it
didn't recognize. So the metric stayed flat at 0.923 — every
mutation produced the same output, by definition.
Resolved by adding three runtime scenarios that vary output by system-prompt content:
| Scenario | Trigger phrases (good output) | Bad output |
|---|---|---|
factual_query_answer |
"directly address", "factual query", "specific answer" | abstract you should consider... |
narrow_target_first |
"identify the specific", "first identify", "narrow down" | broad-search opener (Let me search broadly...) |
no_generic_deflection |
"specific suggestion", "proactive suggestion", "be proactive" | I'd need more context... |
_benchmark_case_to_runtime_case maps the corresponding controlled-
vocabulary keys to these scenarios so karma-derived cases route
correctly: addresses_factual_query → factual_query_answer,
identifies_specific_target → narrow_target_first,
no_generic_deflection → no_generic_deflection.
End-to-end demonstration:
prompt_task_success_rate (recent_failures pack)
─────────────────────────
Before karma cases: 1.000 (no failing cases)
+ karma cases, weak vocab: 1.000 (complete_answer too lenient)
+ extended vocab, strict semantics: 0.948 (factual / target /
deflection failures
register against bland
canned output)
+ runtime branches for new scenarios: 0.923 (bad outputs from new
scenarios add more failures)
+ trigger phrases inserted into prompt: 1.000 ← measurable WIN
("first identify the
specific", "directly
address", "be proactive",
"specific suggestion")
The 7.7-point gap between 0.923 and 1.0 is the measurable headroom autotune can climb when its mutation strategy aligns its edits to the trigger phrases.
Honest limit — mutation alignment, not the eval surface¶
The autonomous mutation generator hasn't yet learned to insert the
trigger phrases reliably; observed mutations on --dry-run are
mostly YAML reformatting and unrelated content. The eval surface is
ready (vocabulary + runtime + benchmark cases all wired and proven
climbable); what's left is the strategy that proposes prompt edits.
Two paths forward (out of scope for this phase):
- Hint via
--idea: the runner already accepts a free-form idea that biases candidate ordering. Operators can pass karma lessons directly as hints during a tuning session. - Strategy alignment: the prompt-mutation strategy's LLM
provider/template should consume
IdeaMemoryentries on the lane (the karma-derived ideas are already there) and steer mutations toward those phrases. Today the strategy ignores IdeaMemory in the run loop.
The full causal chain is now demonstrably wired:
KarmaLedger.record(regret)
├─ propose_idea_from_karma → IdeaRecord (direction)
└─ propose_benchmark_candidate → BenchmarkCandidate (fixture)
↓
karma-enrich CLI → controlled-vocab properties
↓
promote --pack → BenchmarkCase in recent_failures
↓
extended vocabulary → property fails on bland output
↓
runtime scenarios → output varies by prompt content
↓
autotune run → metric climb headroom (0.923→1.0)
# Counter scrape from a running Memory API:
make quickstart-docker
sleep 60
curl -s http://localhost:8765/metrics | grep -E '(chitta|manas|viveka|karma|autotune)'
Citation envelope invariants (always-on, gated by tests)¶
These aren't measured — they're enforced. The contract test suite at
tests/memory/test_memory_api_contract.py and the end-to-end test at
tests/memory/test_memory_api_v1_e2e.py assert that every memory-bearing response
across the Memory API v1 route set carries:
- a non-empty
chunk_hash(16-hex stable content-addressed identifier) - a parseable
retrieved_atISO-8601 timestamp - a
score ∈ [0, 1]when present (sigmoid-normalised at the Memory API boundary) - a
source_pathpointing into the workspace vault
Grounded answers additionally enforce the citation density gate: an
answer is rewritten with an [citation_gate:insufficient_evidence] marker
if it carries fewer than min_citations valid citations. This is on by
default; opt out with enforce_citations=False on GroundedEnvironment.
Method¶
make eval-memory-api
# Internally: sb embeddings benchmark --goldens brain/evals/fixtures/retrieval/seed_goldens.jsonl --json
The runner:
- Indexes the corpus into a fresh in-memory store
- Runs each query through the configured embedder + retriever
- Computes recall@k, precision@k, nDCG@10, MRR, and per-query latency
- Emits the Memory API scorecard via
RetrievalEvalRunner.to_memory_api_scorecard()
Goldens are JSONL records of the form:
relevant_chunk_ids is also supported for chunk-id-pinned goldens — the
preferred long-term form, since substrings drift with content edits.
What's coming¶
- Chunk-id-pinned goldens for tighter precision/recall measurement
- Per-route latency budget in
tests/memory/test_memory_api_v1_e2e.pyso the Memory API carries a published p50/p95 budget per HTTP route, not just retrieval - Provider matrix — same goldens across bge-m3 / Voyage / Cohere / cross-encoder rerank, published as a comparison table in this doc