Environment And Improvement Loop User Guide¶

SecondBrain's improvement loop is a local, auditable harness for turning real failures into measured changes. It combines four planes:

Environments: replayable task episodes with reset, action, observation, reward, persistence, export, and replay.
Quality: a control plane that summarizes runtime health, environment suites, replay cases, benchmark pressure, and self-improvement experiments.
Autotune: bounded lane-specific mutation and evaluation with worktrees, benchmark packs, paired-case evidence, and promotion gates.
Antahkarana: the cognitive loop that records goals, regret, strategy priors, cycle outcomes, and closure.

The design follows the same core shape used by modern RL environment guides: define a task, reset a bounded world, expose a small action space, return structured observations, score rewards, stop episodes deterministically, and keep rollouts replayable. SecondBrain applies that shape to local agent work rather than online training: every episode, replay, candidate, run, gate, and closure decision is stored as inspectable evidence.

Mental Model¶

The loop is not "run a model and hope it improves." It is a scientific control system:

flowchart LR
  A["Runtime or environment failure"] --> B["Replayable evidence"]
  B --> C["Quality replay case or benchmark candidate"]
  C --> D["Improvement target"]
  D --> E["Target execution plan"]
  E --> F["Autotune baseline on required pack"]
  F --> G["Bounded mutation in a git worktree"]
  G --> H["Treatment on same evidence"]
  H --> I["Promotion gate"]
  I --> J["Source closure"]
  J --> K["Cycle lineage and quality summary"]

Every successful cycle must answer:

Invariant	What SecondBrain records
Hypothesis	The pressure being fixed, usually from a candidate, replay case, environment task, regret, or stalled cycle.
Source evidence	Candidate ids, replay case ids, environment episode ids, or cycle ids.
Control measurement	Baseline score on the required pack or replay before mutation.
Intervention	Mutation strategy, files touched, candidate branch, model/provider if any.
Treatment measurement	Candidate score on the same evidence.
Confirmation	Paired-case scientific evidence, replay obligations, and promotion gate metadata.
Closure	`fixed`, `still_failing`, `invalid`, `superseded`, or escalated for review.

What Counts As An Environment¶

An environment is a bounded local task executor. The contract lives in brain/environments/models.py and brain/environments/base.py.

Contract	Purpose
`EnvironmentSpec`	Static environment metadata: id, action types, observation types, max steps, default task.
`TaskSpec`	Goal, success conditions, max steps, and initial values for one task.
`Action`	One structured action submitted to the environment.
`Observation`	Structured state returned by reset or step.
`RewardResult`	Total and normalized reward plus named components.
`StepResult`	Result of applying one action, including reward and terminal flags.
`EpisodeRecord`	Persisted envelope for reset observation, steps, final state, reward, and metadata.

Current built-in environments:

Environment	Use
`counter`	Pure deterministic fixture for lifecycle, reward, persistence, export, and replay checks.
`workspace`	Root-bounded file-task environment backed by `SessionEnv`, useful for small coding or file-verifier tasks.

The important boundary: environments are not the production MCP/tool plane. They are the replayable measurement plane. Production tool execution stays in the agent runtime; environment episodes preserve trajectory and reward evidence for evaluation and improvement.

Run Environment Episodes¶

List available environments:

sb env list --json

Run a simple counter episode:

sb env run counter --target 3 --json

Run a workspace verifier episode:

sb env run workspace \
  --target-path answer.txt \
  --answer "ready" \
  --json

Run from a task manifest:

sb env run --task-file examples/env-task.yaml --json

List and inspect persisted episodes:

sb env episodes --env-id workspace --json
sb env show <episode_id> --json

Replay a stored episode against current environment code:

sb env replay <episode_id> --json

Export a trajectory for replay, analysis, or trainer ingestion:

sb env export <episode_id> --format openenv-json --output rollout.json
sb env export <episode_id> --format steps-jsonl --output rollout.steps.jsonl

The export path is data interchange. Replay is the regression check. If replay diverges on reward, terminal status, step count, or stable state signature, the environment implementation changed in a way that needs review.

Task Manifests¶

Task manifests are the preferred way to make an environment task repeatable. They are JSON or YAML files loaded by sb env run --task-file.

schema_version: secondbrain.environment_task.v1
env_id: workspace
task:
  task_id: workspace.write_answer
  goal: Write the expected answer file.
  success_conditions:
    - answer.txt equals expected_text
reset_options:
  target_path: answer.txt
  expected_text: "ready\n"
verifiers:
  - type: file_exists
    name: answer_exists
    path: answer.txt
  - type: file_equals
    name: answer_exact
    path: answer.txt
    expected_text: "ready\n"
actions:
  - type: write_file
    payload:
      path: answer.txt
      content: "ready\n"
  - type: submit
metadata:
  suite: local-fixture

Workspace verifiers are declarative and do not run shell commands:

file_exists
file_equals
file_contains
file_matches_regex

Each verifier emits one reward component. The normalized reward is the weighted pass score divided by total verifier weight. This keeps task success inspectable instead of hiding it behind a single opaque scalar.

From Environment Failure To Improvement Pressure¶

Low-reward or incomplete environment task groups are not dead ends. They feed the improvement loop through two bridges:

SelfImprovementPlanner scans environment_episodes in work.db.
EnvironmentReplayBridge turns weak task groups into:
an autotune idea, and
a BenchmarkCandidate with source_type="environment_episode".

Environment pressure includes:

env_id
task_id
latest failed or weak episode_id
average normalized reward
completion count
target lane
suggested mutation strategy

The current mapping routes counter and workspace to repl_prompt because their failures usually indicate prompt or policy behavior around structured actions, submit timing, and direct answers. Dedicated environment-specific lanes can be added later when they have deterministic evaluators and mutation surfaces.

Quality Replay Cases¶

Quality replay cases are the bridge between observed failures and executable evidence:

sb quality replay-cases --json
sb quality run-replay <case-id> --json
sb quality replay-results --json
sb quality promote-replay <case-id> --json

Replay execution uses adapter contracts in brain/quality/replay_execution.py:

Adapter	Source
`runtime_prompt`	Prompt-backed quality cases linked to benchmark cases.
`retrieval_replay`	Retrieval cases with expected references.
`environment_replay`	Environment episodes replayed through environment fixtures.

Replay results are persisted as measured pass/fail evidence. Promotion gates can block a candidate when a required replay case was not rerun, did not pass, or belongs to an unresolved failed suite.

Benchmark Candidates And Packs¶

Benchmark candidates are pressure sources. They are not automatically trusted until they carry executable expectations.

Common workflow:

sb autotune benchmark unresolved --lane repl_prompt --json
sb autotune benchmark show <candidate-id> --json
sb autotune benchmark enrich <candidate-id> ...
sb autotune benchmark promote <candidate-id> --pack recent_failures
sb autotune benchmark report repl_prompt --json

Important fields:

Field	Meaning
`source_type`	Where the pressure came from: replay case, environment episode, runtime event, etc.
`source_ref`	Pointer to the replay case id, episode id, trace id, or artifact.
`expected_properties`	Executable behavior contract for the lane evaluator.
`output_schema`	Optional schema contract such as `json_object`.
`severity`	Escalates pack selection and review pressure.
`pack_suggestions`	Hints for `recent_failures`, `hard`, or regression packs.
`resolution_status`	Lifecycle status: `open`, `fixed`, `still_failing`, `invalid`, or `superseded`.

Pack selection is target-bound:

low/default evidence can run smoke;
recent real failures run recent_failures;
high-severity evidence escalates to hard;
broad or risky changes can escalate to regression.

The source evidence determines the pack. A run cannot claim improvement by passing an unrelated smoke pack when the pressure came from a hard failure.

Autotune Lanes¶

An autotune lane is a bounded mutation/evaluation contract. Lane specs live in brain/autotune/specs/.

For example, repl_prompt defines:

mutable prompt file: brain/prompts/specs/agents/agent.profile.default.v1.yaml
frozen implementation paths
evaluator: prompt.repl
benchmark packs: smoke, core, recent_failures, hard, regression
acceptance thresholds
guard metrics
confirmation policy

Run a standalone benchmark:

sb autotune bench repl_prompt --pack hard --json

Run one bounded lane attempt:

sb autotune run repl_prompt --pack hard --ignore-pause --json

Run the full self-improvement loop:

sb autotune improve --lane repl_prompt --max-lanes 1 --pack smoke --explain-plan --json
sb autotune improve --lane repl_prompt --max-lanes 1 --pack smoke --ignore-pause --json

The --explain-plan command is the safest first step. It shows:

selected target;
target key;
required pack;
confirmation pack;
source candidate ids;
replay case ids;
environment episode ids;
unsupported candidate properties;
force-promoted candidates that need explicit review.

True Agent Harness Improvement Loop¶

The full loop is implemented by SelfImprovementOrchestrator in brain/autotune/orchestrator.py.

One cycle does this:

Diagnose pressure with SelfImprovementPlanner.
Seed replay/environment/LLM ideas where appropriate.
Register Sankalpa goals for selected targets.
Advance meta-learning from prior autotune history.
Build a target execution plan for the selected target.
Run source replay obligations.
Run autotune against the required evidence-bound pack.
Persist run evidence, scientific evidence, promotion gate metadata, and cycle lineage.
Update source closure when promotion succeeds or failed attempts continue.
Record Antahkarana cycle outcome and quality projection.

Target sources include:

Karma regrets;
autotune failure rates;
grounded trajectory scores;
environment scores;
Sankalpa goal gaps;
stalled self-improvement cycles;
benchmark pressure.

The loop can improve itself. Repeated partial, no-effect, blocked, or failed cycles become cycle_stall targets so the planner can diagnose failure of the improvement process itself.

Scientific Promotion Gate¶

Autotune does not promote just because a run was "kept." Promotion requires evidence.

brain/autotune/scientific_evidence.py builds paired-case evidence from baseline and candidate evaluations:

baseline primary metric;
candidate primary metric;
primary delta in the lane's metric direction;
paired case count;
case wins/losses/ties;
mean case delta;
standard error;
95 percent lower confidence bound;
promotion readiness and blockers.

The promotion gate can block when:

the run was not kept;
baseline or candidate metric is missing;
guard metrics regressed;
confirmation failed;
paired evidence is underpowered;
lower confidence bound is not positive;
source replay obligations are missing or failed;
touched lane has unresolved failed replay suites;
candidate properties are unsupported;
force-promoted candidates lack explicit review.

This is why a tiny improvement on one case may be kept for review but not promoted automatically. Scientific promotion needs enough paired evidence to avoid treating noise as progress.

Source Closure¶

After a promoted run clears the gate, source pressure is closed:

sb autotune benchmark show <candidate-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json

Closure fields:

resolved_by_run_id
resolved_at
resolution_status
resolution_evidence

Fixed candidates stop appearing as active benchmark_pressure. Failed attempts increment recurrence metadata instead of disappearing. This gives the next diagnosis pass a durable memory of what was tried.

Operator Runbook¶

Use this runbook when you want one falsifiable improvement loop.

1. Start from a clean tree¶

Autotune creates git worktrees. The run guard requires a clean repo:

git status --short --branch

Commit or stash unrelated edits before running a treatment cycle.

2. Inspect current pressure¶

sb quality summary --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb quality replay-cases --json

3. Run a control benchmark¶

sb autotune bench repl_prompt --pack hard --json

Save the result when demonstrating a loop:

sb autotune bench repl_prompt --pack hard --json \
  > out/checkins/<name>/control-hard-benchmark.json

4. Explain the target-bound plan¶

sb autotune improve \
  --lane repl_prompt \
  --max-lanes 1 \
  --pack smoke \
  --explain-plan \
  --json

Do not continue if the selected target is not the evidence you intend to test. If an earlier failed attempt created cycle_stall, sankalpa_gap, or autotune_failures pressure, use a fresh state for a demonstration or close the higher-priority pressure first.

5. Run treatment¶

sb autotune improve \
  --lane repl_prompt \
  --max-lanes 1 \
  --pack smoke \
  --ignore-pause \
  --json

The run may create candidate branches such as:

autotune/repl_prompt/cand-<run-id>

6. Inspect cycle and closure¶

sb autotune cycle show <cycle-id> --json
sb autotune benchmark show <candidate-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb autotune report repl_prompt --last 4 --json

7. Capture a checkin¶

A good checkin folder contains:

seeded source evidence;
explain-plan output;
control benchmark;
treatment cycle JSON;
cycle show JSON;
source candidate after closure;
unresolved-after output;
quality summary;
candidate branch diff;
README.md with the invariant summary.

Example from a successful local run:

out/checkins/true-improvement-loop-20260514-110100/

That run selected benchmark_pressure, ran the hard pack, improved the score from 0.786667 to 1.0, passed the scientific promotion gate with a positive lower confidence bound, closed the source candidate as fixed, and left no unresolved benchmark pressure for that lane in the isolated run state.

How To Interpret Outcomes¶

Outcome	Meaning
`success`	Source pressure fixed and promotion gate passed.
`partial`	Mutation ran, but source is still open or gate did not pass.
`no_effect`	No measurable gain or no pressure reduction.
`invalid`	Evidence was not executable.
`blocked`	Safety, replay, quality, or promotion gate blocked the change.
`failed`	Runtime or infrastructure error.

Treat partial as useful evidence, not success. A kept run with an underpowered gate is a candidate for review; it is not a closed loop.

Implementation Map¶

Area	Files
Environment contracts	`brain/environments/models.py`, `brain/environments/base.py`
Built-in environments	`brain/environments/fixtures.py`, `brain/environments/workspace_env.py`
Manifests	`brain/environments/manifest.py`
Episode store	`brain/environments/store.py`
Replay	`brain/environments/replay.py`
Export	`brain/environments/export.py`
Serve routes	`brain/serve/routers/environments.py`
Environment to autotune bridge	`brain/autotune/environment_bridge.py`
Planner	`brain/autotune/self_improve.py`
Orchestrator	`brain/autotune/orchestrator.py`
Runner and worktrees	`brain/autotune/runner.py`, `brain/autotune/worktrees.py`, `brain/autotune/gitops.py`
Promotion gate	`brain/autotune/promotion_gate.py`
Scientific evidence	`brain/autotune/scientific_evidence.py`
Scientific experiments	`brain/autotune/scientific_experiments.py`
Rewards	`brain/autotune/self_improvement_rewards.py`
Cycle lineage	`brain/autotune/cycle_lineage.py`
Replay execution	`brain/quality/replay_execution.py`
Quality projection	`brain/quality/service.py`, `brain/quality/self_improvement_cycles.py`

Common Failure Modes¶

Failure	Diagnosis	Fix
`git working tree must be clean before autotune runs`	Autotune refuses to mutate from a dirty checkout.	Commit, stash, or move unrelated edits before treatment.
Worktree creation denied	Sandbox or filesystem blocked `.git/worktrees`.	Run in an environment that allows git worktree metadata.
`explain-plan` selects the wrong target	Earlier attempts created higher-priority stall or goal pressure.	Close that pressure, or use a fresh isolated state for a demonstration.
Kept run does not promote	Scientific evidence or replay obligation blocked the promotion.	Add enough paired executable cases, rerun required replay cases, or route to human review.
Candidate remains unresolved	Source closure did not run because gate failed or no source ids were covered.	Inspect `target_execution_plan`, `covered_candidate_ids`, and gate metadata.
Replay case is not runnable	No adapter can execute the replay or assertions are incomplete.	Add expected properties, expected refs, or environment episode metadata.

Build A New Environment¶

Use this checklist when adding an environment:

Define an EnvironmentSpec.
Implement BaseEnvironment._reset and _step.
Keep actions structured and serializable.
Return observations with stable state values.
Emit reward components with names and reasons.
Terminate or truncate deterministically.
Persist reset options needed for replay.
Add replay comparison coverage.
Add export coverage if trainer ingestion matters.
Decide whether weak episodes should route to an existing autotune lane or a new dedicated lane.

The environment is ready for the improvement loop only when a stored failure can be replayed and converted into executable benchmark or replay evidence.