Skip to content

Memory API v1 — Published Quality Scorecard

The Memory API's brand promise is AI you can audit: every memory-bearing response carries a Citation envelope (defined in contracts/memory_api_v1.yaml), and every claim traces back to a chunk you can read. That promise is only credible if we publish quality numbers. This page captures the current scorecard and the methodology that produced it.

Reproduce locally: make eval-memory-api. The runner lives at brain/evals/retrieval/runner.py; the goldens at brain/evals/fixtures/retrieval/seed_goldens.jsonl.

Headline numbers

Run as of 2026-05-08, against the seed retrieval goldens (30 cases, 47 corpus items, hash-mode embeddings for reproducibility):

Metric Value What it means
citation_recall@10 1.00 Every relevant chunk made it into the top 10
citation_precision@10 0.17 Roughly 1.7 of the top 10 results are relevant on average
ndcg@10 0.94 Highly-relevant chunks are ranked near the top
mrr 0.94 The first relevant chunk shows up at rank ~1.07 on average
p50_ms 0.3 ms Median retrieval latency (hash mode; production embedders are slower)
p95_ms 0.4 ms 95th-percentile retrieval latency (hash mode)
n_cases 30 Number of golden queries scored

Reading the numbers honestly

Recall is high because the seed goldens use substring-relevance, which is generous — any chunk containing one of the expected substrings counts as relevant. Real-world recall on chunk-id-pinned goldens will be lower; building those goldens is the next quality slice.

Precision is low because the goldens have a small relevant set per query (typically 1–3 chunks) but we return the full top-10. Precision climbs sharply as we tune top_k to the query type — see the adaptive-k logic in brain/retrieve/hybrid.py::_adaptive_k.

Latency here is hash-mode synthetic — bge-m3 / Voyage / Cohere embedders add 30–200 ms per query depending on backend. The published number is the Memory API's retrieval-side overhead; the embedder is a separate budget.

nDCG = 0.94 and MRR = 0.94 are the more meaningful numbers. They mean the Memory API is putting the right answers near the top of the result list, which is what callers feel.

Dogfood scorecard — SecondBrain on SecondBrain

The seed goldens above are synthetic — short queries against a generic corpus. To prove the Memory API works on the kind of corpus adopters actually have, we dogfood it on SecondBrain's own docs and source code. 25 questions a real adopter would ask ("How does multi-tenancy gate behavior?", "What is the chunk_hash format?", "How do I migrate Chroma → LanceDB?"); each query has distinctive substrings (function names, contract terms, decision keys) that uniquely identify the answer chunk.

Reproduce locally: make eval-dogfood. Goldens at brain/evals/fixtures/dogfood/sb_self_query.jsonl; harness at brain/evals/dogfood/runner.py.

Run as of 2026-05-08 — corpus = 30 files (docs + key source) split into ~566 chunks; 25 queries. Comparison against a ripgrep baseline so the Memory API's improvement over "just grep" is honest:

Metric Memory API Grep baseline Memory API / Grep
citation_recall@10 0.92 0.44 2.1×
citation_precision@10 0.33 0.10 3.5×
ndcg@10 0.68 0.29 2.3×
mrr 0.57 0.24 2.4×
p50_ms 0.9 ms 14.9 ms (Memory API 16× faster)
p95_ms 1.3 ms 19.2 ms (Memory API 15× faster)

What the numbers mean

  • The Memory API finds 92% of relevant chunks in the top 10, vs 44% for grep. Two-thirds of the gap (44% → 92%) is the API's tokenized scoring picking up partial-match queries that grep's whole-token regex misses (e.g. "near-duplicate detection" matching _find_near_duplicates).
  • Precision is 3.5× higher because the Memory API's top-10 is densely relevant (3.3 of 10 on average), where grep's top-10 is mostly noise from common tokens (1.0 of 10 on average).
  • MRR = 0.57 means the first relevant chunk shows up at rank ~1.75 on average — the Memory API puts the right answer in the top two slots more often than not. Grep's MRR = 0.24 means the right answer is typically rank 4–5 if it's in the top 10 at all.
  • Latency here is in-memory token-overlap retrieval, not LanceDB + cross-encoder. The production Memory API adds 30–200 ms for the embedder pass; it's still well under grep's subprocess overhead for this corpus size.

Honest caveats

  • The dogfood harness uses a simplified token-overlap retriever (TF + set-overlap) rather than the live LanceDB + reranker stack. That's a deliberate choice — the dogfood story is about whether SB's retrieval signals (chunking, scoring, semantic boundaries) beat grep on a real corpus, not whether bge-m3 embeddings are good.
  • Substring relevance is generous: any chunk containing a relevant substring counts. A chunk-id-pinned variant of the goldens (where the expected chunk_hash is fixed) would tighten precision further; on the to-do list once the SB corpus stops churning daily.
  • The grep baseline is rg --max-count=3 with a regex of the query's longer tokens. A more aggressive grep (full file content, bigger context windows, manual ranking by line proximity) could close some of the gap.

The headline assertion tests/evals/test_dogfood_eval.py::test_dogfood_runner_returns_memory_api_better_than_grep runs in CI and fails if the Memory API ever drops below grep on any of the four quality metrics, or if citation_recall@10 < 0.80. That's the floor we commit to keeping.

Cognitive uplift — what each layer contributes

The Memory API surface is the public face; underneath it run the Antahkarana cognitive stack and Autotune evolutionary prompt optimizer. They are the engines that turn a static RAG layer into a self-improving memory. This section attributes Memory API quality improvements to the specific layer that produced them.

Every layer emits a Prometheus counter on /metrics with a stable name and label set, so the question "what is the system actually doing for me this week" has a concrete answer rather than a marketing one.

Layer Role Prometheus metric What it measures
Chitta memory crystallization secondbrain_chitta_synthesis_total{category} Compound memory clusters created from access patterns. Each cluster is a synthesized higher-level memory that the system would otherwise have lost in granular noise.
Manas predictive context secondbrain_manas_preload_total{outcome} Predicted-context preloads. outcome=hit means Manas surfaced something a query later asked for; miss means it preloaded but nothing matched; skipped means predictive was disabled. The hit/miss ratio is the felt-latency win.
Viveka quality gate secondbrain_viveka_block_total{reason} Proposals rejected by the discrimination layer. Each block is a low-quality / overbroad / duplicate / unsafe memory or prompt mutation that didn't pollute the index.
Karma consequence ledger secondbrain_karma_outcome_total{outcome} Decisions with measured outcomes attached (positive / negative / neutral / pending). The proportion of decisions with non-pending outcomes is the audit health metric.
Autotune prompt evolution secondbrain_autotune_mutation_total{outcome} Lifecycle states of prompt mutations: proposedlanded (passed confirmation) or reverted (regression caught) or rejected (Viveka or operator gated). The landed/proposed ratio is the autotune yield.

Reading the layers honestly

  • Chitta synthesis runs on a schedule; expect single-digit clusters per day on a normal vault, more after a heavy ingest week. A long zero stretch usually means the source memories haven't crossed the importance threshold — not a bug.
  • Manas hit rate below 30% is fine — the goal isn't perfect prediction, it's catching the obvious follow-on topics. Above 50% is a strong signal the user has predictable workflows.
  • Viveka blocks should be small but non-zero. Zero blocks means the gate isn't firing (suspicious). Many blocks means upstream generation quality dropped.
  • Karma outcome ratio rises slowly — most decisions take days to show outcomes. Track the trend, not the absolute number.
  • Autotune yield (landed / proposed) is the most direct self-improvement metric. A 7-day zero is fine if no mutations were proposed; persistent landed=0 with proposed>0 means confirmation testing is consistently catching regressions, which is good but worth investigating.

Wiring status

Layer Counter wired Notes
Chitta ✅ wired in CompoundMemorySynthesizer.synthesize Bumps when synthesis pass creates new clusters
Manas ✅ wired in PredictiveContextLoader.preload hit/miss/skipped per call
Viveka ✅ wired in VivekaGate.evaluate Bumps once per failed check on a denial
Karma ✅ wired in KarmaLedger.record Bumps per ledger insert; "pending" when outcome unset
Autotune ✅ wired in PromptMutationStrategy.apply_mutation Bumps outcome=landed after a prompt write

All five layer counters now fire from real call sites; per-tenant attribution is on under SB_MULTI_TENANT=1 (workspace label on every counter, separate SQLite file per tenant under state/cognitive/<ws>/).

The closed loop — Karma → Autotune

Beyond observation, regrets in the Karma ledger now trigger both research direction AND measurable fixtures in Autotune. When karma_autotune_bridge_enabled=True on AntahkaranaConfig, every KarmaEntry with regret=True and a known action_type fans out into two artifacts:

  1. IdeaRecord (source="derived") — what to try. The lesson becomes the hypothesis; autotune's mutation strategies pick it up on the next sb autotune run <lane>.
  2. BenchmarkCandidate (origin="real_failure", status="pending_triage") — how to measure success. The regret's input becomes a probe; after operator triage + enrichment + promotion to a pack, it becomes an evaluation case the lane runs against on every attempt.

Together: regrets are both direction and fixture. Without the fixture half, autotune has lessons but no failing cases to verify mutation impact — exactly what the production dogfood revealed (2026-05-08): mutations rephrased the regret lessons but reverted because baseline = 1.0 with no failing case in the pack.

Karma action_type Autotune lane (spec name)
search_more, direct_answer, buddhi_decision, viveka_evaluation repl_prompt
travel_query, synthesize, briefing_build, repl, prompt_response repl_prompt
vault_query, vault_search, travel_search, grounded_answer vault_retrieval
memory_promotion, memory_retrieval_action, memory_search memory_retrieval

Lane values are autotune spec names (snake_case, no dots) — the runner queries IdeaMemory.next_for_lane(lane) by spec name. An earlier round used the dotted evaluator name and silently orphaned the ideas; the regression is now pinned by tests.

Unknown action types skip cleanly. Each idea / candidate is dedupe-tagged karma:<action_id> so a regret fans out at most once per (action_id, lane) pair.

Counter: secondbrain_karma_to_autotune_total{action_type, lane} exposes the closed-loop signal directly.

Environment Episodes → Autotune

Replayable environment episodes are another closed-loop signal. Low normalized reward or incomplete task groups in environment_episodes are surfaced by SelfImprovementPlanner as environment_score targets. The EnvironmentReplayBridge turns those targets into derived IdeaRecord entries for existing lanes such as repl_prompt, and the self-improvement orchestrator can register matching Sankalpa goals before running bounded autotune attempts.

This is intentionally not a dedicated environment mutation lane yet. The environment subsystem supplies replayable pressure and evidence; lane contracts still decide what files may mutate and how a candidate is scored.

The quality plane also treats environment task groups as scenario suites. A failed environment:<env_id>:<task_id> suite can block sb quality gate --surface environments and can be promoted into an internal_replay:environment_<env_id> suite for future regression pressure. The self-improvement bridge also queues pending autotune benchmark candidates for those weak environment groups so lane evaluators can gain measurable real-failure fixtures after enrichment. Promoting any quality replay case that maps to an autotune lane also creates or reuses a pending quality_replay_case benchmark candidate, keeping the quality replay backlog and autotune triage queue linked by source ids. The candidate id is stored on the replay case metadata and replay evidence link for later audit. Pending real-failure candidates become benchmark_pressure targets in sb autotune self-diagnose, giving the self-improvement planner a deterministic lane signal from the quality replay queue.

Trace process contracts are another quality input. A trace can declare metadata.process_contract or metadata.process_contracts with required tools, events, decisions, evidence labels, forbidden tools, and ordered sequences. The quality summary scores those contracts as dimensions.tool_correctness; failed contracts emit process_contract_failed:<id> regression flags, block the relevant gate, and seed replay-case signatures even when the trace status is completed. Decision-trace replay cases re-evaluate the same contract and persist the observed process evidence plus process_contract_score in the eval case metrics. When promoted, contract-backed replay cases seed autotune benchmark candidates with executable expectations derived from the contract, including required tools, forbidden tools, and local-context properties when the contract requires evidence.

Replay execution is explicit. sb quality run-replay <case-id> records a fresh eval result against the replay suite and links it back to the replay case evidence chain. Each replay execution gets a new eval run id; the proposal or promotion eval id remains source lineage so repeated executions are auditable as independent evidence. Replay execution is adapter-based: environment episodes replay into fresh fixture environments and score terminal status plus normalized reward; enriched repl/chat cases execute through the repl_prompt runtime prompt evaluator when expected outputs, properties, tools, or schema assertions exist; retrieval replays execute against local memory/source-evidence FTS when expected_refs exist. Prompt replays without executable assertions fall back to the decision-trace adapter, which loads the persisted trace envelope and records a failed replay result when the source trace still carries failed, blocked, error, or failure-signature evidence. Replay families without an executable adapter are marked not_runnable and recorded as failed evidence instead of passing implicitly.

Autotune benchmark promotion is strict by default: a candidate must be reviewed, fully enriched, linked to source evidence, and backed by executable assertions. Property assertions are accepted only when a lane evaluator has a registered contract for that property; unknown or replay-only properties fail closed instead of becoming no-op passes. Operators can use sb autotune benchmark promote --force for exceptional migrations, and the force reason list is persisted in case metadata.

Autotune run promotion is also evidence-gated. Each candidate run records a scientific_evidence summary comparing paired baseline and candidate cases: sample size, primary metric delta, case wins/losses/ties, mean case delta, and a 95% lower confidence bound. Kept runs whose required paired-case evidence is underpowered do not create automatic promotion bundles; they remain reviewable run artifacts instead of being treated as proven self-improvement.

When sb autotune improve runs, the self-improvement cycle is also persisted with target evidence, source episode ids, and seeded benchmark candidate ids, then projected into quality as a self_improvement run. The persisted result includes the aggregate Antahkarana outcome for the cycle and the target-to-pack execution plan for every selected target. sb autotune improve --explain-plan prints that plan without mutating state, including source benchmark candidate ids, replay case ids, environment episode ids, required pack, and confirmation pack. Promotion gates enforce the same plan: a kept run cannot get a PromotionBundle when required replay cases were not rerun, failed, the wrong pack was evaluated, source benchmark candidate coverage is missing, unresolved failed replay suites remain, candidate properties lack executable contracts, or force-promoted candidates lack explicit review. High-severity target pressure escalates to the hard pack, and the derived confirmation pack is used by the confirmation evaluator rather than only being stored as metadata.

Each selected target also becomes a self_improvement_experiments record with hypothesis, source evidence, control, intervention, trial runs, confirmation evidence, closure decision, metrics, links, deterministic reward components, and verdict. Intervention records include exact candidate commits, files touched, diff summaries, mutation strategies, models, and token usage when run evidence provides them. The reward record keeps task reward, promotion-gate reward, Antahkarana outcome reward, pressure-reduction reward, regression penalty, safety penalty, complexity penalty, weights, and a bounded overall score together with the experiment. Cycle statuses are outcome-oriented: success, partial, no_effect, invalid, blocked, failed, or dry_run. Quality projections include those experiment ids and payloads, derive the projected self_improvement score from experiment rewards, and emit explicit regression flags for missing experiments, malformed rewards, failed or underpowered verdicts, and non-dry-run low rewards. Those flags carry surface targets, so a mixed cycle can block only the affected lane surface plus autotune and all. Severe flags such as failed, missing, malformed, or safety-penalized experiments block the quality gate; underpowered, inconclusive, or not-run experiments route to HITL. This makes the self-improvement claim inspectable as a scientific experiment. sb autotune cycle show <cycle-id>, sb autotune benchmark unresolved --lane <lane>, and sb quality replay-results expose the loop state for operators. Benchmark candidates carry resolution_status, resolved_by_run_id, resolved_at, and resolution_evidence; fixed sources stop contributing active pressure, while still-failing sources keep recurrence metadata. If the cycle promotes a run for a karma_regret target, matching active Karma regrets are marked purified so the loop records that the pressure was addressed. Prior non-dry-run cycles that repeatedly stall without promotion are also scanned by sb autotune self-diagnose as cycle_stall targets.

Reproduce

# Hermetic, no effect on real state:
.venv/bin/python scripts/dogfood_karma_autotune.py

# Real ledger, dry-run (default) or commit:
.venv/bin/python scripts/dogfood_karma_real.py
.venv/bin/python scripts/dogfood_karma_real.py --commit

# Heuristic enrichment of every pending karma candidate:
sb autotune benchmark karma-enrich --lane repl_prompt

# Promote enriched cases to a pack, then run autotune:
sb autotune benchmark promote <candidate_id> --pack recent_failures
sb autotune run repl_prompt --pack recent_failures

Vocabulary extension — 2026-05-08

The earlier round's bottleneck was the prompt evaluator's controlled vocabulary: complete_answer just checked for ≥5 output words, so karma cases passed by default and the metric stuck at 1.0. Resolved by adding three property keys with strict-evidence semantics:

Key Failure pattern it catches
no_generic_deflection Output deflects with "more context required" / "could you clarify" on a declarative input
addresses_factual_query Output is bland or abstract ("you should consider...") with zero factual indicators (digits, time/date words, direct-answer openers)
identifies_specific_target Output drifts into "let me search broadly" without naming the actual target

Strict semantic on the latter two: they require positive evidence of the right behavior, not just absence of the wrong one. A bland canned output that says nothing concrete FAILS — proving the prompt addressed the query needs the answer to actually contain the answer.

The karma enricher's pattern table now maps the three production regret lessons onto these keys (with complete_answer still firing additively as a sanity check):

Karma lesson trigger Now expects
"directly address the factual query" / "specific answer" addresses_factual_query
"first identify the specific [meeting/target]" identifies_specific_target
"more context required is a dead-end" / "declarative … specific suggestion" no_generic_deflection

Real-data result: After re-enriching the 3 production karma candidates with the extended vocabulary and re-promoting them to recent_failures, autotune's baseline on the lane dropped from 1.0 → 0.948718. The karma cases now register as failures; the loop has measurable headroom for mutations to demonstrate improvement.

Runtime extension — 2026-05-08

The vocabulary extension moved the bottleneck down to the deterministic runtime stub: DeterministicPromptProvider produced the same canned output regardless of prompt edits for scenarios it didn't recognize. So the metric stayed flat at 0.923 — every mutation produced the same output, by definition.

Resolved by adding three runtime scenarios that vary output by system-prompt content:

Scenario Trigger phrases (good output) Bad output
factual_query_answer "directly address", "factual query", "specific answer" abstract you should consider...
narrow_target_first "identify the specific", "first identify", "narrow down" broad-search opener (Let me search broadly...)
no_generic_deflection "specific suggestion", "proactive suggestion", "be proactive" I'd need more context...

_benchmark_case_to_runtime_case maps the corresponding controlled- vocabulary keys to these scenarios so karma-derived cases route correctly: addresses_factual_queryfactual_query_answer, identifies_specific_targetnarrow_target_first, no_generic_deflectionno_generic_deflection.

End-to-end demonstration:

                     prompt_task_success_rate (recent_failures pack)
                     ─────────────────────────
Before karma cases:                       1.000   (no failing cases)
+ karma cases, weak vocab:                1.000   (complete_answer too lenient)
+ extended vocab, strict semantics:       0.948   (factual / target /
                                                   deflection failures
                                                   register against bland
                                                   canned output)
+ runtime branches for new scenarios:     0.923   (bad outputs from new
                                                   scenarios add more failures)
+ trigger phrases inserted into prompt:   1.000   ← measurable WIN
                                                   ("first identify the
                                                    specific", "directly
                                                    address", "be proactive",
                                                    "specific suggestion")

The 7.7-point gap between 0.923 and 1.0 is the measurable headroom autotune can climb when its mutation strategy aligns its edits to the trigger phrases.

Honest limit — mutation alignment, not the eval surface

The autonomous mutation generator hasn't yet learned to insert the trigger phrases reliably; observed mutations on --dry-run are mostly YAML reformatting and unrelated content. The eval surface is ready (vocabulary + runtime + benchmark cases all wired and proven climbable); what's left is the strategy that proposes prompt edits.

Two paths forward (out of scope for this phase):

  1. Hint via --idea: the runner already accepts a free-form idea that biases candidate ordering. Operators can pass karma lessons directly as hints during a tuning session.
  2. Strategy alignment: the prompt-mutation strategy's LLM provider/template should consume IdeaMemory entries on the lane (the karma-derived ideas are already there) and steer mutations toward those phrases. Today the strategy ignores IdeaMemory in the run loop.

The full causal chain is now demonstrably wired:

KarmaLedger.record(regret)
    ├─ propose_idea_from_karma         → IdeaRecord (direction)
    └─ propose_benchmark_candidate     → BenchmarkCandidate (fixture)
                karma-enrich CLI       → controlled-vocab properties
                promote --pack         → BenchmarkCase in recent_failures
                extended vocabulary    → property fails on bland output
                runtime scenarios      → output varies by prompt content
                autotune run           → metric climb headroom (0.923→1.0)
# Counter scrape from a running Memory API:
make quickstart-docker
sleep 60
curl -s http://localhost:8765/metrics | grep -E '(chitta|manas|viveka|karma|autotune)'

Citation envelope invariants (always-on, gated by tests)

These aren't measured — they're enforced. The contract test suite at tests/memory/test_memory_api_contract.py and the end-to-end test at tests/memory/test_memory_api_v1_e2e.py assert that every memory-bearing response across the Memory API v1 route set carries:

  • a non-empty chunk_hash (16-hex stable content-addressed identifier)
  • a parseable retrieved_at ISO-8601 timestamp
  • a score ∈ [0, 1] when present (sigmoid-normalised at the Memory API boundary)
  • a source_path pointing into the workspace vault

Grounded answers additionally enforce the citation density gate: an answer is rewritten with an [citation_gate:insufficient_evidence] marker if it carries fewer than min_citations valid citations. This is on by default; opt out with enforce_citations=False on GroundedEnvironment.

Method

make eval-memory-api
# Internally: sb embeddings benchmark --goldens brain/evals/fixtures/retrieval/seed_goldens.jsonl --json

The runner:

  1. Indexes the corpus into a fresh in-memory store
  2. Runs each query through the configured embedder + retriever
  3. Computes recall@k, precision@k, nDCG@10, MRR, and per-query latency
  4. Emits the Memory API scorecard via RetrievalEvalRunner.to_memory_api_scorecard()

Goldens are JSONL records of the form:

{"id": "case-1", "query": "...", "relevant_substrings": ["..."]}

relevant_chunk_ids is also supported for chunk-id-pinned goldens — the preferred long-term form, since substrings drift with content edits.

What's coming

  • Chunk-id-pinned goldens for tighter precision/recall measurement
  • Per-route latency budget in tests/memory/test_memory_api_v1_e2e.py so the Memory API carries a published p50/p95 budget per HTTP route, not just retrieval
  • Provider matrix — same goldens across bge-m3 / Voyage / Cohere / cross-encoder rerank, published as a comparison table in this doc