Skip to content

Quality Control Plane

The quality control plane unifies runtime traces, eval suites, grounded trajectories, autotune runs, and conversation improvement reports into one operator surface.

Use it when you need one answer to these questions:

  • is the system healthy right now
  • what regressed
  • which surface is under-covered
  • should this change be allowed, degraded, escalated to HITL, or blocked

Primary Entry Points

Use sb quality as the default operator entrypoint:

sb quality summary
sb quality runs
sb quality show <quality-run-id>
sb quality suites
sb quality improvement-issues
sb quality replay-cases
sb quality promote-replay <replay-case-id>
sb quality run-replay <replay-case-id>
sb quality gate --surface all
sb quality harness audit --json
sb quality harness review --role reliability --changed-only --json
sb quality harness scan --changed-only --json
sb quality harness process --json
sb quality harness revalidate --json
sb quality harness enrich --json
sb quality harness status --json
sb quality harness export --format json
sb quality harness gc --last 7d --json

The same materialized control-plane data is exposed in sb serve at:

  • GET /quality/summary
  • GET /quality/runs
  • GET /quality/runs/{quality_run_id}
  • GET /quality/suites
  • GET /quality/improvement-issues
  • GET /quality/replay-cases
  • GET /quality/replay-results
  • POST /quality/replay-cases/{case_id}/promote
  • POST /quality/replay-cases/{case_id}/run
  • GET /quality/gates/latest
  • GET /autotune/cycles/{cycle_id}
  • GET /autotune/benchmark/unresolved?lane=<lane>

The browser UI now prefers the /quality route for run health and drill-down. Existing /runs and /traces routes remain available for compatibility.

Operating Harness Checks

sb quality harness is the offline operating-harness layer. It does not call live providers and does not edit files. It reports the local constraints and review pressure that help coding agents work predictably:

  • audit checks coverage gaps, oversized agent-hostile files, missing remediation hints, skill validation metadata, docs hygiene, and pending replay coverage.
  • review --role <role> runs a deterministic role check. Supported roles are reliability, security-policy, runtime-contract, frontend-ux, docs-help, and test-strategy.
  • gc turns repeated failures, replay backlog, missing validation metadata, and weak deterministic checks into human-reviewable improvement proposals.
  • scan, process, revalidate, enrich, status, and export form the persisted local pipeline for candidate findings. Scan stores matcher candidates in the work database, process materializes deduplicated findings, revalidate marks current verdicts, enrich adds recent git context, status reports counts, and export emits JSON or per-finding Markdown.

Harness reports are structured for agents and CI. Every finding includes an id, severity, surface, role, evidence payload, remediation fix_hint, and gate recommendation. Audit reports also separate coverage gaps, stale rules, oversized agent-hostile source files, missing remediation hints, and missing replay/eval coverage so follow-up agents can route work without reading the whole repository.

Trace Process Contracts

Runtime traces can declare expected process evidence in metadata.process_contract or metadata.process_contracts. The quality plane evaluates those contracts during sb quality summary, sb quality runs, and sb quality gate, so a completed trace can still fail quality if it skipped a required tool, event, decision, evidence item, or ordered step sequence.

Supported contract keys:

{
  "id": "grounded_answer",
  "required_tools": ["search_notes"],
  "required_events": ["eval.response_faithfulness"],
  "required_decisions": ["verify_answer"],
  "required_evidence": ["source:policy"],
  "required_sequence": ["tool:search_notes", "event:eval.response_faithfulness"],
  "forbidden_tools": ["unsafe_write"]
}

Failed contracts set dimensions.tool_correctness, add a process_contract_failed:<id> regression flag, block the relevant quality gate, and seed replay backlog pressure for the affected trace. This keeps process drift visible even when the final text looks correct. Running sb quality run-replay <case-id> on one of these cases records the contract checks, observed process evidence, and process_contract_score in the replay result. Promoting one of these cases into benchmark pressure carries executable contract expectations into the seeded candidate, such as required or forbidden tools and local-context requirements.

The persisted harness pipeline is deterministic and offline-safe. It uses a local matcher registry with noise tiers (precise, normal, noisy) for runtime reliability, security-policy, prompt-boundary, integration side-effect, filesystem-boundary, frontend, docs/help, and test-strategy pressure. Pipeline state lives in the work database tables quality_harness_runs and quality_harness_file_records; BLOCK findings are folded into the composite quality gate.

Docs hygiene turns the agent-operability rules into checks. It verifies that AGENTS.md stays a routing table, docs/INDEX.md matches the current docs tree, active plans keep required lifecycle metadata, and active plans are listed in both MkDocs nav and the component test-entrypoint table.

Review roles are intentionally narrow:

Role Primary pressure
reliability timeouts, network/process calls, and brittle runtime behavior
security-policy unsafe execution patterns and policy-sensitive code
runtime-contract stream contracts, schema drift, and agent-hostile source files
frontend-ux changed UI files that need ergonomic and visual review
docs-help user-facing CLI changes without docs or schema refreshes
test-strategy source changes that lack focused tests

The harness GC command is advisory in v1. It reads existing quality summaries, replay backlog, repeated regressions, missing fix hints, and skill validation metadata, then emits proposals for humans to approve. It never edits AGENTS.md, skills, docs, or replay cases on its own.

Multi-agent campaign descriptors live with background sessions. They describe a goal, role assignments, token/time budgets, success metrics, review roles, and merge gates. Write-capable role assignments default to task-workspace isolation so campaign runners can coordinate multiple agents without sharing one mutable checkout.

Canonical Quality Model

Phase 1 keeps raw stores where they already live and adds one normalized quality layer on top:

  • runtime DB: events, tool_calls, decision_traces
  • work DB: eval runs, autotune runs, improvement artifacts, materialized quality summaries
  • grounded store: persisted grounded reasoning trajectories
  • work DB environment tables: replayable environment_episodes and steps

Each control-plane record is summarized into a canonical QualityRunSummary keyed by QualityRunRef.

Quality dimensions

  • completion_rate
  • failure_rate
  • blocked_rate
  • latency_p50_ms
  • latency_p95_ms
  • retrieval_quality
  • groundedness
  • faithfulness
  • tool_correctness
  • tool_restraint
  • clarity
  • overall_score

Eval Families

Every major route or lane should map into one or more of these families:

Family What it covers Typical sources
Runtime health trace coverage, failures, blocked runs, latency, approval friction decision traces, runtime events, tool calls
Deterministic regressions golden SQL, workflow, and artifact checks sb eval run, artifact evals
Behavioral judges runtime prompt judges and session judges eval.* events, conversation improvement judges
Scenario / trajectory evals retrieval-heavy and replay-heavy scenarios grounded evals, autotune packs, simulation guards

Any surface without at least one deterministic or scenario check is treated as a coverage_gap.

Coverage Matrix

The control plane tracks an explicit coverage matrix in code.

Current surfaces:

Surface Route / lane Source of truth Deterministic checks Judge checks Scenario checks Gate inputs
chat repl_prompt runtime decision traces + prompt judges + improvement runs improvement synthetic guards faithfulness / conversation judges session replay and improvement runs trace health + judge aggregates + improvement guards
retrieval vault_retrieval traces + grounded trajectories + prompt judges retrieval-oriented eval suites retrieval quality / faithfulness grounded eval families trace health + grounded suites + judges
cognition antahkarana_loop Antahkarana impulse/impression traces + autotune experience artifacts full-loop cognitive fixtures optional prompt/layer judges cognitive-loop benchmark packs layer outcome rates + safety blocks + trace health
environments sb env / /environments persisted environment episodes in work.db replay comparison and verifier checks none by default environment manifests and stored episodes normalized reward + terminal status + replay result
data SQL / workflow surfaces eval harness summaries golden eval suites optional runtime judges artifact scenarios trace health + deterministic suite status
autotune benchmark lanes autotune runs + scientific evidence + self-improvement experiments confirmation guards + paired-case evidence + reward components judge summaries benchmark packs, confirmations, and self-improvement experiments trace health + benchmark, scientific-evidence, reward, and confirmation outcomes

Gate Policy

The composite gate wraps the existing release-gate evaluator instead of replacing it.

Possible decisions:

  • ALLOW: deterministic checks pass and no degrade or block condition is active
  • DEGRADE: latency regressed, blocked rate regressed, or judge scores dropped
  • HITL: not enough sample size or coverage confidence to trust an automated decision, including underpowered or incomplete self-improvement experiment evidence
  • BLOCK: relevant deterministic or scenario suites failed, trace coverage is broken, runtime failure spikes crossed severe thresholds, or a relevant self-improvement experiment is missing, malformed, failed, or safety-blocked

Relevant surface routing:

  • chat / prompt changes: prompt judges + improvement guards + trace health
  • retrieval changes: retrieval evals + grounded evals + trace health
  • SQL / data changes: golden eval harness + trace health
  • autotune changes: benchmark packs + confirmation metrics + self-improvement experiment rewards + trace health

Autotune promotion evidence is explicit. Candidate runs carry a scientific_evidence metadata payload with paired baseline/candidate sample size, primary effect, case wins/losses/ties, mean case delta, and a 95% lower confidence bound. The promotion gate treats required but underpowered evidence as a block, which routes weak or tiny samples to human review instead of creating an automatic PromotionBundle.

Quality gates consume self-improvement regression flags by surface. A failed retrieval-lane experiment can block retrieval, autotune, and all without polluting the chat gate. Severe experiment flags block; underpowered, inconclusive, or not-run experiment flags route to HITL; lower non-dry-run rewards degrade unless a stronger block condition is present.

Signal Sources And Join Order

Signals are normalized from:

  • decision traces
  • trace process contracts declared in trace metadata
  • runtime eval.* events
  • eval harness runs
  • grounded trajectories
  • environment episodes
  • autotune runs
  • improvement run reports
  • self-improvement cycle history

Environment task groups are summarized as environment:<env_id>:<task_id> scenario suites. Failed or low-reward groups can enter the replay backlog and be promoted into internal_replay:environment_<env_id> suites, which are gate inputs for sb quality gate --surface environments. They also seed pending autotune benchmark candidates as real-failure fixtures, giving prompt lanes measurable pressure after operator enrichment and promotion.

Self-improvement cycles are persisted as self_improvement_cycles records and surface as self_improvement quality runs. Their metadata links targets, source environment episode ids, seeded ideas, seeded benchmark candidate ids, Sankalpa goals, and executed autotune runs so the harness can audit the full feedback chain. Each cycle also records an aggregate Antahkarana outcome. Each selected target becomes a durable self_improvement_experiments row with the hypothesis, intervention, source evidence, control baseline, trial runs, confirmation evidence, measured metrics, deterministic reward components, links, and verdict. The reward payload breaks the claim into task reward, promotion-gate reward, Antahkarana outcome reward, regression penalty, safety penalty, complexity penalty, weights, and a bounded overall score. Quality summaries derive the self_improvement run score and regression flags from those experiment rewards and verdicts. Missing referenced experiments, malformed reward payloads, failed or underpowered verdicts, and non-dry-run low rewards surface as explicit flags instead of silent gaps. The summaries also include the experiment ids and compact experiment payloads, so a claimed self-improvement can be audited as a falsifiable experiment instead of a loose run note. Promoted cycles that were driven by karma_regret targets purify the matching active Karma regrets, which gives the next diagnosis pass a deterministic closure signal. Repeated non-dry-run cycles that end without promotion or with errors become cycle_stall planner targets so the self-improvement loop can diagnose its own failed improvement attempts.

Join precedence is:

  1. trace_id
  2. session_id
  3. explicit run_id or batch_id
  4. standalone suite artifact when no run-level join exists

New eval or improvement flows should emit trace_id whenever a run-level trace exists.

Baseline Prompt Judges

All non-grader prompt templates now receive a default runtime judge bundle unless a prompt explicitly opts out with metadata.judge_policy: none.

Baseline bundle:

  • response_relevance
  • response_completeness
  • response_clarity
  • response_faithfulness

response_faithfulness remains context-gated and only runs when grounded context is available. This keeps judge coverage broad without forcing unsupported faithfulness checks on prompts that have no context payload.

From Session Failure To Improvement Pressure

The intended operator loop is:

  1. A runtime trace, judge event, grounded rollout, or environment episode shows a failure or regression.
  2. The control plane materializes that as a quality run with regression flags, suite health impact, a proposed replay case, and a named improvement issue when the failure signature recurs or needs gated follow-up.
  3. Improvement and autotune artifacts linked by trace_id, session_id, or run_id are folded into the same quality run.
  4. Operators inspect /quality or sb quality show <id> to see the trace timeline, eval outcomes, approvals, replay-case proposal, and linked improvement/autotune artifacts.
  5. The named issue clusters source traces, replay cases, severity, suspected root cause, evaluator proposal, offline replay path, and gate status so the work remains reviewable instead of becoming an isolated bug note.
  6. The replay case is promoted into a private/internal scenario suite for the affected route or lane.
  7. Prompt, tool, routing, MCP, browser, and memory changes are gated against the replay suite before promotion.
  8. Recurring signatures increase benchmark pressure until the failure stops recurring in fresh traces.

This is deliberately trace-to-eval-to-fix, not public-benchmark chasing. Public benchmarks can still be useful smoke signals, but private replay cases from real failures are the control-plane source of truth.

sb quality summary --json now exposes both improvement_issues and replay_backlog. An improvement issue is the operator-facing cluster: it names the recurring failure, keeps trace/replay evidence attached, proposes an online monitor and offline replay path, and records whether the current quality gate is blocking or escalating the work. A replay backlog entry is the executable case that can become release-gating pressure.

Each replay backlog entry includes:

  • source quality_run_id, trace_id, and session_id
  • route or lane target
  • failure signatures
  • target internal replay suite
  • priority derived from severity and recurrence

sb quality show <quality-run-id> --json includes the replay proposal for that run when the run failed, blocked, or carried regression flags.

Promote a replay proposal when it should become release-gating pressure:

sb quality replay-cases --status proposed --json
sb quality promote-replay replay:trace:<trace-id> --notes "guard regression before prompt changes"
sb quality run-replay replay:trace:<trace-id> --json

Promotion creates an internal_replay:<route-or-lane> eval run with a proposed case and, when the replay maps to an autotune lane, a pending BenchmarkCandidate with source_type="quality_replay_case". Proposed internal replay suites are treated as HITL by the composite gate for matching surfaces until they are executed and resolved. sb quality run-replay executes the best available replay adapter and writes a measured passed or failed eval run. Each execution creates a new eval run and preserves the proposal or promotion eval id as source lineage instead of overwriting prior evidence. The environment adapter replays stored environment episodes into fresh fixture environments. Enriched repl/chat replay cases run through the real repl_prompt runtime prompt evaluator when they carry executable expected outputs, properties, tools, or schema assertions. Retrieval replay cases run against local memory/source-evidence FTS when expected_refs are present. If a prompt replay has no executable benchmark assertion, the decision-trace adapter loads the persisted trace envelope and turns historical failed or blocked traces into measured failed evidence. Replay families without an executable adapter still fail visibly as not_runnable instead of passing by convention. Failed replay suites block the gate like any other scenario failure. The benchmark candidate stays in triage until an operator enriches it and promotes it into an autotune pack. The candidate id is persisted on the replay case metadata and evidence links so later quality views can trace the handoff. Pending real-failure benchmark candidates also appear as benchmark_pressure targets in sb autotune self-diagnose, so replay promotion can pull the lane into the self-improvement planner even before another autotune run fails.

Self-improvement cycles execute against the evidence that selected the target. Use the explain surface before mutating anything:

sb autotune improve --explain-plan --json

The plan records source benchmark candidates, replay case ids, environment episode ids, required pack, and confirmation pack. When a target came from recent replay or benchmark pressure, the run uses that target-bound pack rather than a generic smoke pack; high-severity pressure escalates to hard, and the derived confirmation pack is passed into the confirmation evaluator. Promotion bundles are blocked when required source replay cases were not rerun, did not pass, the effective pack does not match the target plan, source benchmark candidate coverage is missing, unresolved failed replay suites remain, candidate properties lack executable contracts, or force-promoted candidates lack explicit review.

After a cycle:

sb autotune cycle show <cycle-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb quality replay-results --json

Benchmark candidates keep resolution_status (open, fixed, still_failing, invalid, or superseded), resolved_by_run_id, and resolution_evidence. Fixed candidates stop contributing active benchmark_pressure; failed attempts are marked still_failing and keep recurrence metadata so they become more visible instead of disappearing.

Benchmark promotion is fail-closed. Candidates must be reviewed, completely enriched, carry source provenance, and define at least one executable assertion whose property names are registered for the target lane. Replay-only labels such as quality_replay_regression_fixed remain metadata until a replay adapter can execute them. sb autotune benchmark promote --force is available for audited operator exceptions and records the blocked validation issues on the promoted case.

Typical Workflow

Check overall posture:

sb quality summary
sb quality gate --surface retrieval

Inspect recent runs:

sb quality runs --limit 25
sb quality show trace:<trace-id> --json

Inspect suite health:

sb quality suites --json
sb quality replay-cases --json

Inspect operating-harness pressure before or after agent-generated changes:

sb quality harness audit --changed-only --json
sb quality harness review --role runtime-contract --changed-only --json
sb quality harness gc --last 7d --json

Inspect the full-history autotune experience that mutation planners can use:

sb autotune experience list repl_prompt --json
sb autotune experience frontier repl_prompt --json
sb autotune experience show repl_prompt <run-id> --json
sb autotune experience context repl_prompt --json
sb autotune experience context antahkarana_loop --json

Experience manifests include raw per-case JSONL traces plus a compact trace_summary.json. For Antahkarana runs, that summary captures layer outcome counts, salience/confidence tuning signals, block layers, and failure patterns such as over-processing, under-attention, missing layers, or missed blocks.

Run the Antahkarana lane directly when a change touches attention routing, confidence thresholds, safety gates, memory consolidation, or the full cognitive loop:

sb autotune bench antahkarana_loop --pack smoke --json

Run underlying sources directly when deeper validation is needed:

sb eval run --json
sb grounded eval --json
sb autotune report repl_prompt --last 5 --json
sb autotune report antahkarana_loop --last 5 --json
sb improve conversations reports <batch-id> --json

Phase 2 Direction

Phase 2 is the active-learning layer on top of the same control plane:

  • promote replay backlog entries into executable private eval suites
  • seed benchmark pressure from real failures across more routes
  • rank top regressions by impact and recurrence
  • gate prompt, tool, and routing changes against the replay suites they touch
  • use composite quality scores for lane health
  • make grounded low-score families influence related prompt and retrieval lanes