Quality Control Plane¶
The quality control plane unifies runtime traces, eval suites, grounded trajectories, autotune runs, and conversation improvement reports into one operator surface.
Use it when you need one answer to these questions:
- is the system healthy right now
- what regressed
- which surface is under-covered
- should this change be allowed, degraded, escalated to HITL, or blocked
Primary Entry Points¶
Use sb quality as the default operator entrypoint:
sb quality summary
sb quality runs
sb quality show <quality-run-id>
sb quality suites
sb quality improvement-issues
sb quality replay-cases
sb quality promote-replay <replay-case-id>
sb quality run-replay <replay-case-id>
sb quality gate --surface all
sb quality harness audit --json
sb quality harness review --role reliability --changed-only --json
sb quality harness scan --changed-only --json
sb quality harness process --json
sb quality harness revalidate --json
sb quality harness enrich --json
sb quality harness status --json
sb quality harness export --format json
sb quality harness gc --last 7d --json
The same materialized control-plane data is exposed in sb serve at:
GET /quality/summaryGET /quality/runsGET /quality/runs/{quality_run_id}GET /quality/suitesGET /quality/improvement-issuesGET /quality/replay-casesGET /quality/replay-resultsPOST /quality/replay-cases/{case_id}/promotePOST /quality/replay-cases/{case_id}/runGET /quality/gates/latestGET /autotune/cycles/{cycle_id}GET /autotune/benchmark/unresolved?lane=<lane>
The browser UI now prefers the /quality route for run health and drill-down. Existing /runs and /traces routes remain available for compatibility.
Operating Harness Checks¶
sb quality harness is the offline operating-harness layer. It does not call
live providers and does not edit files. It reports the local constraints and
review pressure that help coding agents work predictably:
auditchecks coverage gaps, oversized agent-hostile files, missing remediation hints, skill validation metadata, docs hygiene, and pending replay coverage.review --role <role>runs a deterministic role check. Supported roles arereliability,security-policy,runtime-contract,frontend-ux,docs-help, andtest-strategy.gcturns repeated failures, replay backlog, missing validation metadata, and weak deterministic checks into human-reviewable improvement proposals.scan,process,revalidate,enrich,status, andexportform the persisted local pipeline for candidate findings. Scan stores matcher candidates in the work database, process materializes deduplicated findings, revalidate marks current verdicts, enrich adds recent git context, status reports counts, and export emits JSON or per-finding Markdown.
Harness reports are structured for agents and CI. Every finding includes an
id, severity, surface, role, evidence payload, remediation fix_hint,
and gate recommendation. Audit reports also separate coverage gaps, stale rules,
oversized agent-hostile source files, missing remediation hints, and missing
replay/eval coverage so follow-up agents can route work without reading the
whole repository.
Trace Process Contracts¶
Runtime traces can declare expected process evidence in
metadata.process_contract or metadata.process_contracts. The quality plane
evaluates those contracts during sb quality summary, sb quality runs, and
sb quality gate, so a completed trace can still fail quality if it skipped a
required tool, event, decision, evidence item, or ordered step sequence.
Supported contract keys:
{
"id": "grounded_answer",
"required_tools": ["search_notes"],
"required_events": ["eval.response_faithfulness"],
"required_decisions": ["verify_answer"],
"required_evidence": ["source:policy"],
"required_sequence": ["tool:search_notes", "event:eval.response_faithfulness"],
"forbidden_tools": ["unsafe_write"]
}
Failed contracts set dimensions.tool_correctness, add a
process_contract_failed:<id> regression flag, block the relevant quality gate,
and seed replay backlog pressure for the affected trace. This keeps process
drift visible even when the final text looks correct. Running
sb quality run-replay <case-id> on one of these cases records the contract
checks, observed process evidence, and process_contract_score in the replay
result. Promoting one of these cases into benchmark pressure carries executable
contract expectations into the seeded candidate, such as required or forbidden
tools and local-context requirements.
The persisted harness pipeline is deterministic and offline-safe. It uses a
local matcher registry with noise tiers (precise, normal, noisy) for
runtime reliability, security-policy, prompt-boundary, integration side-effect,
filesystem-boundary, frontend, docs/help, and test-strategy pressure. Pipeline
state lives in the work database tables quality_harness_runs and
quality_harness_file_records; BLOCK findings are folded into the composite
quality gate.
Docs hygiene turns the agent-operability rules into checks. It verifies that
AGENTS.md stays a routing table, docs/INDEX.md matches the current docs
tree, active plans keep required lifecycle metadata, and active plans are listed
in both MkDocs nav and the component test-entrypoint table.
Review roles are intentionally narrow:
| Role | Primary pressure |
|---|---|
reliability |
timeouts, network/process calls, and brittle runtime behavior |
security-policy |
unsafe execution patterns and policy-sensitive code |
runtime-contract |
stream contracts, schema drift, and agent-hostile source files |
frontend-ux |
changed UI files that need ergonomic and visual review |
docs-help |
user-facing CLI changes without docs or schema refreshes |
test-strategy |
source changes that lack focused tests |
The harness GC command is advisory in v1. It reads existing quality summaries,
replay backlog, repeated regressions, missing fix hints, and skill validation
metadata, then emits proposals for humans to approve. It never edits AGENTS.md,
skills, docs, or replay cases on its own.
Multi-agent campaign descriptors live with background sessions. They describe a goal, role assignments, token/time budgets, success metrics, review roles, and merge gates. Write-capable role assignments default to task-workspace isolation so campaign runners can coordinate multiple agents without sharing one mutable checkout.
Canonical Quality Model¶
Phase 1 keeps raw stores where they already live and adds one normalized quality layer on top:
- runtime DB:
events,tool_calls,decision_traces - work DB: eval runs, autotune runs, improvement artifacts, materialized quality summaries
- grounded store: persisted grounded reasoning trajectories
- work DB environment tables: replayable
environment_episodesand steps
Each control-plane record is summarized into a canonical QualityRunSummary keyed by QualityRunRef.
Quality dimensions¶
completion_ratefailure_rateblocked_ratelatency_p50_mslatency_p95_msretrieval_qualitygroundednessfaithfulnesstool_correctnesstool_restraintclarityoverall_score
Eval Families¶
Every major route or lane should map into one or more of these families:
| Family | What it covers | Typical sources |
|---|---|---|
| Runtime health | trace coverage, failures, blocked runs, latency, approval friction | decision traces, runtime events, tool calls |
| Deterministic regressions | golden SQL, workflow, and artifact checks | sb eval run, artifact evals |
| Behavioral judges | runtime prompt judges and session judges | eval.* events, conversation improvement judges |
| Scenario / trajectory evals | retrieval-heavy and replay-heavy scenarios | grounded evals, autotune packs, simulation guards |
Any surface without at least one deterministic or scenario check is treated as a coverage_gap.
Coverage Matrix¶
The control plane tracks an explicit coverage matrix in code.
Current surfaces:
| Surface | Route / lane | Source of truth | Deterministic checks | Judge checks | Scenario checks | Gate inputs |
|---|---|---|---|---|---|---|
| chat | repl_prompt |
runtime decision traces + prompt judges + improvement runs | improvement synthetic guards | faithfulness / conversation judges | session replay and improvement runs | trace health + judge aggregates + improvement guards |
| retrieval | vault_retrieval |
traces + grounded trajectories + prompt judges | retrieval-oriented eval suites | retrieval quality / faithfulness | grounded eval families | trace health + grounded suites + judges |
| cognition | antahkarana_loop |
Antahkarana impulse/impression traces + autotune experience artifacts | full-loop cognitive fixtures | optional prompt/layer judges | cognitive-loop benchmark packs | layer outcome rates + safety blocks + trace health |
| environments | sb env / /environments |
persisted environment episodes in work.db |
replay comparison and verifier checks | none by default | environment manifests and stored episodes | normalized reward + terminal status + replay result |
| data | SQL / workflow surfaces | eval harness summaries | golden eval suites | optional runtime judges | artifact scenarios | trace health + deterministic suite status |
| autotune | benchmark lanes | autotune runs + scientific evidence + self-improvement experiments | confirmation guards + paired-case evidence + reward components | judge summaries | benchmark packs, confirmations, and self-improvement experiments | trace health + benchmark, scientific-evidence, reward, and confirmation outcomes |
Gate Policy¶
The composite gate wraps the existing release-gate evaluator instead of replacing it.
Possible decisions:
ALLOW: deterministic checks pass and no degrade or block condition is activeDEGRADE: latency regressed, blocked rate regressed, or judge scores droppedHITL: not enough sample size or coverage confidence to trust an automated decision, including underpowered or incomplete self-improvement experiment evidenceBLOCK: relevant deterministic or scenario suites failed, trace coverage is broken, runtime failure spikes crossed severe thresholds, or a relevant self-improvement experiment is missing, malformed, failed, or safety-blocked
Relevant surface routing:
- chat / prompt changes: prompt judges + improvement guards + trace health
- retrieval changes: retrieval evals + grounded evals + trace health
- SQL / data changes: golden eval harness + trace health
- autotune changes: benchmark packs + confirmation metrics + self-improvement experiment rewards + trace health
Autotune promotion evidence is explicit. Candidate runs carry a
scientific_evidence metadata payload with paired baseline/candidate sample
size, primary effect, case wins/losses/ties, mean case delta, and a 95% lower
confidence bound. The promotion gate treats required but underpowered evidence
as a block, which routes weak or tiny samples to human review instead of
creating an automatic PromotionBundle.
Quality gates consume self-improvement regression flags by surface. A failed
retrieval-lane experiment can block retrieval, autotune, and all without
polluting the chat gate. Severe experiment flags block; underpowered,
inconclusive, or not-run experiment flags route to HITL; lower non-dry-run
rewards degrade unless a stronger block condition is present.
Signal Sources And Join Order¶
Signals are normalized from:
- decision traces
- trace process contracts declared in trace metadata
- runtime
eval.*events - eval harness runs
- grounded trajectories
- environment episodes
- autotune runs
- improvement run reports
- self-improvement cycle history
Environment task groups are summarized as environment:<env_id>:<task_id>
scenario suites. Failed or low-reward groups can enter the replay backlog and
be promoted into internal_replay:environment_<env_id> suites, which are gate
inputs for sb quality gate --surface environments. They also seed pending
autotune benchmark candidates as real-failure fixtures, giving prompt lanes
measurable pressure after operator enrichment and promotion.
Self-improvement cycles are persisted as self_improvement_cycles records and
surface as self_improvement quality runs. Their metadata links targets,
source environment episode ids, seeded ideas, seeded benchmark candidate ids,
Sankalpa goals, and executed autotune runs so the harness can audit the full
feedback chain. Each cycle also records an aggregate Antahkarana outcome. Each
selected target becomes a durable self_improvement_experiments row with the
hypothesis, intervention, source evidence, control baseline, trial runs,
confirmation evidence, measured metrics, deterministic reward components,
links, and verdict. The reward payload breaks the claim into task reward,
promotion-gate reward, Antahkarana outcome reward, regression penalty, safety
penalty, complexity penalty, weights, and a bounded overall score. Quality
summaries derive the self_improvement run score and regression flags from
those experiment rewards and verdicts. Missing referenced experiments,
malformed reward payloads, failed or underpowered verdicts, and non-dry-run low
rewards surface as explicit flags instead of silent gaps. The summaries also
include the experiment ids and compact experiment payloads, so a claimed
self-improvement can be audited as a falsifiable experiment instead of a loose
run note. Promoted cycles that were driven by karma_regret targets purify the
matching active Karma regrets, which gives the next diagnosis pass a
deterministic closure signal. Repeated non-dry-run cycles that end without
promotion or with errors become cycle_stall planner targets so the
self-improvement loop can diagnose its own failed improvement attempts.
Join precedence is:
trace_idsession_id- explicit
run_idorbatch_id - standalone suite artifact when no run-level join exists
New eval or improvement flows should emit trace_id whenever a run-level trace exists.
Baseline Prompt Judges¶
All non-grader prompt templates now receive a default runtime judge bundle unless a prompt explicitly opts out with metadata.judge_policy: none.
Baseline bundle:
response_relevanceresponse_completenessresponse_clarityresponse_faithfulness
response_faithfulness remains context-gated and only runs when grounded context is available. This keeps judge coverage broad without forcing unsupported faithfulness checks on prompts that have no context payload.
From Session Failure To Improvement Pressure¶
The intended operator loop is:
- A runtime trace, judge event, grounded rollout, or environment episode shows a failure or regression.
- The control plane materializes that as a quality run with regression flags, suite health impact, a proposed replay case, and a named improvement issue when the failure signature recurs or needs gated follow-up.
- Improvement and autotune artifacts linked by
trace_id,session_id, orrun_idare folded into the same quality run. - Operators inspect
/qualityorsb quality show <id>to see the trace timeline, eval outcomes, approvals, replay-case proposal, and linked improvement/autotune artifacts. - The named issue clusters source traces, replay cases, severity, suspected root cause, evaluator proposal, offline replay path, and gate status so the work remains reviewable instead of becoming an isolated bug note.
- The replay case is promoted into a private/internal scenario suite for the affected route or lane.
- Prompt, tool, routing, MCP, browser, and memory changes are gated against the replay suite before promotion.
- Recurring signatures increase benchmark pressure until the failure stops recurring in fresh traces.
This is deliberately trace-to-eval-to-fix, not public-benchmark chasing. Public benchmarks can still be useful smoke signals, but private replay cases from real failures are the control-plane source of truth.
sb quality summary --json now exposes both improvement_issues and
replay_backlog. An improvement issue is the operator-facing cluster: it names
the recurring failure, keeps trace/replay evidence attached, proposes an online
monitor and offline replay path, and records whether the current quality gate is
blocking or escalating the work. A replay backlog entry is the executable case
that can become release-gating pressure.
Each replay backlog entry includes:
- source
quality_run_id,trace_id, andsession_id - route or lane target
- failure signatures
- target internal replay suite
- priority derived from severity and recurrence
sb quality show <quality-run-id> --json includes the replay proposal for that
run when the run failed, blocked, or carried regression flags.
Promote a replay proposal when it should become release-gating pressure:
sb quality replay-cases --status proposed --json
sb quality promote-replay replay:trace:<trace-id> --notes "guard regression before prompt changes"
sb quality run-replay replay:trace:<trace-id> --json
Promotion creates an internal_replay:<route-or-lane> eval run with a proposed
case and, when the replay maps to an autotune lane, a pending
BenchmarkCandidate with source_type="quality_replay_case". Proposed
internal replay suites are treated as HITL by the composite gate for matching
surfaces until they are executed and resolved. sb quality run-replay executes
the best available replay adapter and writes a measured passed or failed
eval run. Each execution creates a new eval run and preserves the proposal or
promotion eval id as source lineage instead of overwriting prior evidence. The
environment adapter replays stored environment episodes into fresh fixture
environments. Enriched repl/chat replay cases run through the real
repl_prompt runtime prompt evaluator when they carry executable expected
outputs, properties, tools, or schema assertions. Retrieval replay cases run
against local memory/source-evidence FTS when expected_refs are present. If a
prompt replay has no executable benchmark assertion, the decision-trace adapter
loads the persisted trace envelope and turns historical failed or blocked traces
into measured failed evidence. Replay families without an executable adapter
still fail visibly as not_runnable instead of passing by convention. Failed
replay suites block the gate like any other scenario failure. The benchmark
candidate stays in triage until an operator enriches it and promotes it into an
autotune pack. The candidate id is persisted on the replay case metadata and
evidence links so later quality views can trace the handoff. Pending
real-failure benchmark candidates also appear as benchmark_pressure targets
in sb autotune self-diagnose, so replay promotion can pull the lane into the
self-improvement planner even before another autotune run fails.
Self-improvement cycles execute against the evidence that selected the target. Use the explain surface before mutating anything:
The plan records source benchmark candidates, replay case ids, environment
episode ids, required pack, and confirmation pack. When a target came from
recent replay or benchmark pressure, the run uses that target-bound pack rather
than a generic smoke pack; high-severity pressure escalates to hard, and the
derived confirmation pack is passed into the confirmation evaluator. Promotion
bundles are blocked when required source replay cases were not rerun, did not
pass, the effective pack does not match the target plan, source benchmark
candidate coverage is missing, unresolved failed replay suites remain,
candidate properties lack executable contracts, or force-promoted candidates
lack explicit review.
After a cycle:
sb autotune cycle show <cycle-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb quality replay-results --json
Benchmark candidates keep resolution_status (open, fixed,
still_failing, invalid, or superseded), resolved_by_run_id, and
resolution_evidence. Fixed candidates stop contributing active
benchmark_pressure; failed attempts are marked still_failing and keep
recurrence metadata so they become more visible instead of disappearing.
Benchmark promotion is fail-closed. Candidates must be reviewed, completely
enriched, carry source provenance, and define at least one executable assertion
whose property names are registered for the target lane. Replay-only labels such
as quality_replay_regression_fixed remain metadata until a replay adapter can
execute them. sb autotune benchmark promote --force is available for audited
operator exceptions and records the blocked validation issues on the promoted
case.
Typical Workflow¶
Check overall posture:
Inspect recent runs:
Inspect suite health:
Inspect operating-harness pressure before or after agent-generated changes:
sb quality harness audit --changed-only --json
sb quality harness review --role runtime-contract --changed-only --json
sb quality harness gc --last 7d --json
Inspect the full-history autotune experience that mutation planners can use:
sb autotune experience list repl_prompt --json
sb autotune experience frontier repl_prompt --json
sb autotune experience show repl_prompt <run-id> --json
sb autotune experience context repl_prompt --json
sb autotune experience context antahkarana_loop --json
Experience manifests include raw per-case JSONL traces plus a compact
trace_summary.json. For Antahkarana runs, that summary captures layer outcome
counts, salience/confidence tuning signals, block layers, and failure patterns
such as over-processing, under-attention, missing layers, or missed blocks.
Run the Antahkarana lane directly when a change touches attention routing, confidence thresholds, safety gates, memory consolidation, or the full cognitive loop:
Run underlying sources directly when deeper validation is needed:
sb eval run --json
sb grounded eval --json
sb autotune report repl_prompt --last 5 --json
sb autotune report antahkarana_loop --last 5 --json
sb improve conversations reports <batch-id> --json
Phase 2 Direction¶
Phase 2 is the active-learning layer on top of the same control plane:
- promote replay backlog entries into executable private eval suites
- seed benchmark pressure from real failures across more routes
- rank top regressions by impact and recurrence
- gate prompt, tool, and routing changes against the replay suites they touch
- use composite quality scores for lane health
- make grounded low-score families influence related prompt and retrieval lanes