Quality Control Plane¶

The quality control plane unifies runtime traces, eval suites, grounded trajectories, autotune runs, and conversation improvement reports into one operator surface.

Use it when you need one answer to these questions:

is the system healthy right now
what regressed
which surface is under-covered
should this change be allowed, degraded, escalated to HITL, or blocked

Primary Entry Points¶

Use sb quality as the default operator entrypoint:

sb quality summary
sb quality runs
sb quality show <quality-run-id>
sb quality suites
sb quality improvement-issues
sb quality replay-cases
sb quality promote-replay <replay-case-id>
sb quality run-replay <replay-case-id>
sb quality gate --surface all
sb quality harness audit --json
sb quality harness review --role reliability --changed-only --json
sb quality harness scan --changed-only --json
sb quality harness process --json
sb quality harness revalidate --json
sb quality harness enrich --json
sb quality harness status --json
sb quality harness export --format json
sb quality harness gc --last 7d --json

The same materialized control-plane data is exposed in sb serve at:

GET /quality/summary
GET /quality/runs
GET /quality/runs/{quality_run_id}
GET /quality/suites
GET /quality/improvement-issues
GET /quality/replay-cases
GET /quality/replay-results
POST /quality/replay-cases/{case_id}/promote
POST /quality/replay-cases/{case_id}/run
GET /quality/gates/latest
GET /autotune/cycles/{cycle_id}
GET /autotune/benchmark/unresolved?lane=<lane>

The browser UI now prefers the /quality route for run health and drill-down. Existing /runs and /traces routes remain available for compatibility.

Operating Harness Checks¶

sb quality harness is the offline operating-harness layer. It does not call live providers and does not edit files. It reports the local constraints and review pressure that help coding agents work predictably:

audit checks coverage gaps, oversized agent-hostile files, missing remediation hints, skill validation metadata, docs hygiene, and pending replay coverage.
review --role <role> runs a deterministic role check. Supported roles are reliability, security-policy, runtime-contract, frontend-ux, docs-help, and test-strategy.
gc turns repeated failures, replay backlog, missing validation metadata, and weak deterministic checks into human-reviewable improvement proposals.
scan, process, revalidate, enrich, status, and export form the persisted local pipeline for candidate findings. Scan stores matcher candidates in the work database, process materializes deduplicated findings, revalidate marks current verdicts, enrich adds recent git context, status reports counts, and export emits JSON or per-finding Markdown.

Harness reports are structured for agents and CI. Every finding includes an id, severity, surface, role, evidence payload, remediation fix_hint, and gate recommendation. Audit reports also separate coverage gaps, stale rules, oversized agent-hostile source files, missing remediation hints, and missing replay/eval coverage so follow-up agents can route work without reading the whole repository.

Trace Process Contracts¶

Runtime traces can declare expected process evidence in metadata.process_contract or metadata.process_contracts. The quality plane evaluates those contracts during sb quality summary, sb quality runs, and sb quality gate, so a completed trace can still fail quality if it skipped a required tool, event, decision, evidence item, or ordered step sequence.

Supported contract keys:

{
  "id": "grounded_answer",
  "required_tools": ["search_notes"],
  "required_events": ["eval.response_faithfulness"],
  "required_decisions": ["verify_answer"],
  "required_evidence": ["source:policy"],
  "required_sequence": ["tool:search_notes", "event:eval.response_faithfulness"],
  "forbidden_tools": ["unsafe_write"]
}

Failed contracts set dimensions.tool_correctness, add a process_contract_failed:<id> regression flag, block the relevant quality gate, and seed replay backlog pressure for the affected trace. This keeps process drift visible even when the final text looks correct. Running sb quality run-replay <case-id> on one of these cases records the contract checks, observed process evidence, and process_contract_score in the replay result. Promoting one of these cases into benchmark pressure carries executable contract expectations into the seeded candidate, such as required or forbidden tools and local-context requirements.

The persisted harness pipeline is deterministic and offline-safe. It uses a local matcher registry with noise tiers (precise, normal, noisy) for runtime reliability, security-policy, prompt-boundary, integration side-effect, filesystem-boundary, frontend, docs/help, and test-strategy pressure. Pipeline state lives in the work database tables quality_harness_runs and quality_harness_file_records; BLOCK findings are folded into the composite quality gate.

Docs hygiene turns the agent-operability rules into checks. It verifies that AGENTS.md stays a routing table, docs/INDEX.md matches the current docs tree, active plans keep required lifecycle metadata, and active plans are listed in both MkDocs nav and the component test-entrypoint table.

Review roles are intentionally narrow:

Role	Primary pressure
`reliability`	timeouts, network/process calls, and brittle runtime behavior
`security-policy`	unsafe execution patterns and policy-sensitive code
`runtime-contract`	stream contracts, schema drift, and agent-hostile source files
`frontend-ux`	changed UI files that need ergonomic and visual review
`docs-help`	user-facing CLI changes without docs or schema refreshes
`test-strategy`	source changes that lack focused tests

The harness GC command is advisory in v1. It reads existing quality summaries, replay backlog, repeated regressions, missing fix hints, and skill validation metadata, then emits proposals for humans to approve. It never edits AGENTS.md, skills, docs, or replay cases on its own.

Multi-agent campaign descriptors live with background sessions. They describe a goal, role assignments, token/time budgets, success metrics, review roles, and merge gates. Write-capable role assignments default to task-workspace isolation so campaign runners can coordinate multiple agents without sharing one mutable checkout.

Canonical Quality Model¶

Phase 1 keeps raw stores where they already live and adds one normalized quality layer on top:

runtime DB: events, tool_calls, decision_traces
work DB: eval runs, autotune runs, improvement artifacts, materialized quality summaries
grounded store: persisted grounded reasoning trajectories
work DB environment tables: replayable environment_episodes and steps

Each control-plane record is summarized into a canonical QualityRunSummary keyed by QualityRunRef.

Quality dimensions¶

completion_rate
failure_rate
blocked_rate
latency_p50_ms
latency_p95_ms
retrieval_quality
groundedness
faithfulness
tool_correctness
tool_restraint
clarity
overall_score

Eval Families¶

Every major route or lane should map into one or more of these families:

Family	What it covers	Typical sources
Runtime health	trace coverage, failures, blocked runs, latency, approval friction	decision traces, runtime events, tool calls
Deterministic regressions	golden SQL, workflow, and artifact checks	`sb eval run`, artifact evals
Behavioral judges	runtime prompt judges and session judges	`eval.*` events, conversation improvement judges
Scenario / trajectory evals	retrieval-heavy and replay-heavy scenarios	grounded evals, autotune packs, simulation guards

Any surface without at least one deterministic or scenario check is treated as a coverage_gap.

Coverage Matrix¶

The control plane tracks an explicit coverage matrix in code.

Current surfaces:

Surface	Route / lane	Source of truth	Deterministic checks	Judge checks	Scenario checks	Gate inputs
chat	`repl_prompt`	runtime decision traces + prompt judges + improvement runs	improvement synthetic guards	faithfulness / conversation judges	session replay and improvement runs	trace health + judge aggregates + improvement guards
retrieval	`vault_retrieval`	traces + grounded trajectories + prompt judges	retrieval-oriented eval suites	retrieval quality / faithfulness	grounded eval families	trace health + grounded suites + judges
cognition	`antahkarana_loop`	Antahkarana impulse/impression traces + autotune experience artifacts	full-loop cognitive fixtures	optional prompt/layer judges	cognitive-loop benchmark packs	layer outcome rates + safety blocks + trace health
environments	`sb env` / `/environments`	persisted environment episodes in `work.db`	replay comparison and verifier checks	none by default	environment manifests and stored episodes	normalized reward + terminal status + replay result
data	SQL / workflow surfaces	eval harness summaries	golden eval suites	optional runtime judges	artifact scenarios	trace health + deterministic suite status
autotune	benchmark lanes	autotune runs + scientific evidence + self-improvement experiments	confirmation guards + paired-case evidence + reward components	judge summaries	benchmark packs, confirmations, and self-improvement experiments	trace health + benchmark, scientific-evidence, reward, and confirmation outcomes

Gate Policy¶

The composite gate wraps the existing release-gate evaluator instead of replacing it.

Possible decisions:

ALLOW: deterministic checks pass and no degrade or block condition is active
DEGRADE: latency regressed, blocked rate regressed, or judge scores dropped
HITL: not enough sample size or coverage confidence to trust an automated decision, including underpowered or incomplete self-improvement experiment evidence
BLOCK: relevant deterministic or scenario suites failed, trace coverage is broken, runtime failure spikes crossed severe thresholds, or a relevant self-improvement experiment is missing, malformed, failed, or safety-blocked

Relevant surface routing:

chat / prompt changes: prompt judges + improvement guards + trace health
retrieval changes: retrieval evals + grounded evals + trace health
SQL / data changes: golden eval harness + trace health
autotune changes: benchmark packs + confirmation metrics + self-improvement experiment rewards + trace health

Autotune promotion evidence is explicit. Candidate runs carry a scientific_evidence metadata payload with paired baseline/candidate sample size, primary effect, case wins/losses/ties, mean case delta, and a 95% lower confidence bound. The promotion gate treats required but underpowered evidence as a block, which routes weak or tiny samples to human review instead of creating an automatic PromotionBundle.

Quality gates consume self-improvement regression flags by surface. A failed retrieval-lane experiment can block retrieval, autotune, and all without polluting the chat gate. Severe experiment flags block; underpowered, inconclusive, or not-run experiment flags route to HITL; lower non-dry-run rewards degrade unless a stronger block condition is present.

Signal Sources And Join Order¶

Signals are normalized from:

decision traces
trace process contracts declared in trace metadata
runtime eval.* events
eval harness runs
grounded trajectories
environment episodes
autotune runs
improvement run reports
self-improvement cycle history

Environment task groups are summarized as environment:<env_id>:<task_id> scenario suites. Failed or low-reward groups can enter the replay backlog and be promoted into internal_replay:environment_<env_id> suites, which are gate inputs for sb quality gate --surface environments. They also seed pending autotune benchmark candidates as real-failure fixtures, giving prompt lanes measurable pressure after operator enrichment and promotion.

Self-improvement cycles are persisted as self_improvement_cycles records and surface as self_improvement quality runs. Their metadata links targets, source environment episode ids, seeded ideas, seeded benchmark candidate ids, Sankalpa goals, and executed autotune runs so the harness can audit the full feedback chain. Each cycle also records an aggregate Antahkarana outcome. Each selected target becomes a durable self_improvement_experiments row with the hypothesis, intervention, source evidence, control baseline, trial runs, confirmation evidence, measured metrics, deterministic reward components, links, and verdict. The reward payload breaks the claim into task reward, promotion-gate reward, Antahkarana outcome reward, regression penalty, safety penalty, complexity penalty, weights, and a bounded overall score. Quality summaries derive the self_improvement run score and regression flags from those experiment rewards and verdicts. Missing referenced experiments, malformed reward payloads, failed or underpowered verdicts, and non-dry-run low rewards surface as explicit flags instead of silent gaps. The summaries also include the experiment ids and compact experiment payloads, so a claimed self-improvement can be audited as a falsifiable experiment instead of a loose run note. Promoted cycles that were driven by karma_regret targets purify the matching active Karma regrets, which gives the next diagnosis pass a deterministic closure signal. Repeated non-dry-run cycles that end without promotion or with errors become cycle_stall planner targets so the self-improvement loop can diagnose its own failed improvement attempts.

Join precedence is:

trace_id
session_id
explicit run_id or batch_id
standalone suite artifact when no run-level join exists

New eval or improvement flows should emit trace_id whenever a run-level trace exists.

Baseline Prompt Judges¶

All non-grader prompt templates now receive a default runtime judge bundle unless a prompt explicitly opts out with metadata.judge_policy: none.

Baseline bundle:

response_relevance
response_completeness
response_clarity
response_faithfulness

response_faithfulness remains context-gated and only runs when grounded context is available. This keeps judge coverage broad without forcing unsupported faithfulness checks on prompts that have no context payload.

From Session Failure To Improvement Pressure¶

The intended operator loop is:

A runtime trace, judge event, grounded rollout, or environment episode shows a failure or regression.
The control plane materializes that as a quality run with regression flags, suite health impact, a proposed replay case, and a named improvement issue when the failure signature recurs or needs gated follow-up.
Improvement and autotune artifacts linked by trace_id, session_id, or run_id are folded into the same quality run.
Operators inspect /quality or sb quality show <id> to see the trace timeline, eval outcomes, approvals, replay-case proposal, and linked improvement/autotune artifacts.
The named issue clusters source traces, replay cases, severity, suspected root cause, evaluator proposal, offline replay path, and gate status so the work remains reviewable instead of becoming an isolated bug note.
The replay case is promoted into a private/internal scenario suite for the affected route or lane.
Prompt, tool, routing, MCP, browser, and memory changes are gated against the replay suite before promotion.
Recurring signatures increase benchmark pressure until the failure stops recurring in fresh traces.

This is deliberately trace-to-eval-to-fix, not public-benchmark chasing. Public benchmarks can still be useful smoke signals, but private replay cases from real failures are the control-plane source of truth.

sb quality summary --json now exposes both improvement_issues and replay_backlog. An improvement issue is the operator-facing cluster: it names the recurring failure, keeps trace/replay evidence attached, proposes an online monitor and offline replay path, and records whether the current quality gate is blocking or escalating the work. A replay backlog entry is the executable case that can become release-gating pressure.

Each replay backlog entry includes:

source quality_run_id, trace_id, and session_id
route or lane target
failure signatures
target internal replay suite
priority derived from severity and recurrence

sb quality show <quality-run-id> --json includes the replay proposal for that run when the run failed, blocked, or carried regression flags.

Promote a replay proposal when it should become release-gating pressure:

sb quality replay-cases --status proposed --json
sb quality promote-replay replay:trace:<trace-id> --notes "guard regression before prompt changes"
sb quality run-replay replay:trace:<trace-id> --json

Promotion creates an internal_replay:<route-or-lane> eval run with a proposed case and, when the replay maps to an autotune lane, a pending BenchmarkCandidate with source_type="quality_replay_case". Proposed internal replay suites are treated as HITL by the composite gate for matching surfaces until they are executed and resolved. sb quality run-replay executes the best available replay adapter and writes a measured passed or failed eval run. Each execution creates a new eval run and preserves the proposal or promotion eval id as source lineage instead of overwriting prior evidence. The environment adapter replays stored environment episodes into fresh fixture environments. Enriched repl/chat replay cases run through the real repl_prompt runtime prompt evaluator when they carry executable expected outputs, properties, tools, or schema assertions. Retrieval replay cases run against local memory/source-evidence FTS when expected_refs are present. If a prompt replay has no executable benchmark assertion, the decision-trace adapter loads the persisted trace envelope and turns historical failed or blocked traces into measured failed evidence. Replay families without an executable adapter still fail visibly as not_runnable instead of passing by convention. Failed replay suites block the gate like any other scenario failure. The benchmark candidate stays in triage until an operator enriches it and promotes it into an autotune pack. The candidate id is persisted on the replay case metadata and evidence links so later quality views can trace the handoff. Pending real-failure benchmark candidates also appear as benchmark_pressure targets in sb autotune self-diagnose, so replay promotion can pull the lane into the self-improvement planner even before another autotune run fails.

Self-improvement cycles execute against the evidence that selected the target. Use the explain surface before mutating anything:

sb autotune improve --explain-plan --json

The plan records source benchmark candidates, replay case ids, environment episode ids, required pack, and confirmation pack. When a target came from recent replay or benchmark pressure, the run uses that target-bound pack rather than a generic smoke pack; high-severity pressure escalates to hard, and the derived confirmation pack is passed into the confirmation evaluator. Promotion bundles are blocked when required source replay cases were not rerun, did not pass, the effective pack does not match the target plan, source benchmark candidate coverage is missing, unresolved failed replay suites remain, candidate properties lack executable contracts, or force-promoted candidates lack explicit review.

After a cycle:

sb autotune cycle show <cycle-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb quality replay-results --json

Benchmark candidates keep resolution_status (open, fixed, still_failing, invalid, or superseded), resolved_by_run_id, and resolution_evidence. Fixed candidates stop contributing active benchmark_pressure; failed attempts are marked still_failing and keep recurrence metadata so they become more visible instead of disappearing.

Benchmark promotion is fail-closed. Candidates must be reviewed, completely enriched, carry source provenance, and define at least one executable assertion whose property names are registered for the target lane. Replay-only labels such as quality_replay_regression_fixed remain metadata until a replay adapter can execute them. sb autotune benchmark promote --force is available for audited operator exceptions and records the blocked validation issues on the promoted case.

Typical Workflow¶

Check overall posture:

sb quality summary
sb quality gate --surface retrieval

Inspect recent runs:

sb quality runs --limit 25
sb quality show trace:<trace-id> --json

Inspect suite health:

sb quality suites --json
sb quality replay-cases --json

Inspect operating-harness pressure before or after agent-generated changes:

sb quality harness audit --changed-only --json
sb quality harness review --role runtime-contract --changed-only --json
sb quality harness gc --last 7d --json

Inspect the full-history autotune experience that mutation planners can use:

sb autotune experience list repl_prompt --json
sb autotune experience frontier repl_prompt --json
sb autotune experience show repl_prompt <run-id> --json
sb autotune experience context repl_prompt --json
sb autotune experience context antahkarana_loop --json

Experience manifests include raw per-case JSONL traces plus a compact trace_summary.json. For Antahkarana runs, that summary captures layer outcome counts, salience/confidence tuning signals, block layers, and failure patterns such as over-processing, under-attention, missing layers, or missed blocks.

Run the Antahkarana lane directly when a change touches attention routing, confidence thresholds, safety gates, memory consolidation, or the full cognitive loop:

sb autotune bench antahkarana_loop --pack smoke --json

Run underlying sources directly when deeper validation is needed:

sb eval run --json
sb grounded eval --json
sb autotune report repl_prompt --last 5 --json
sb autotune report antahkarana_loop --last 5 --json
sb improve conversations reports <batch-id> --json

Phase 2 Direction¶

Phase 2 is the active-learning layer on top of the same control plane:

promote replay backlog entries into executable private eval suites
seed benchmark pressure from real failures across more routes
rank top regressions by impact and recurrence
gate prompt, tool, and routing changes against the replay suites they touch
use composite quality scores for lane health
make grounded low-score families influence related prompt and retrieval lanes