Skip to content

Environment And Improvement Loop User Guide

SecondBrain's improvement loop is a local, auditable harness for turning real failures into measured changes. It combines four planes:

  • Environments: replayable task episodes with reset, action, observation, reward, persistence, export, and replay.
  • Quality: a control plane that summarizes runtime health, environment suites, replay cases, benchmark pressure, and self-improvement experiments.
  • Autotune: bounded lane-specific mutation and evaluation with worktrees, benchmark packs, paired-case evidence, and promotion gates.
  • Antahkarana: the cognitive loop that records goals, regret, strategy priors, cycle outcomes, and closure.

The design follows the same core shape used by modern RL environment guides: define a task, reset a bounded world, expose a small action space, return structured observations, score rewards, stop episodes deterministically, and keep rollouts replayable. SecondBrain applies that shape to local agent work rather than online training: every episode, replay, candidate, run, gate, and closure decision is stored as inspectable evidence.

Mental Model

The loop is not "run a model and hope it improves." It is a scientific control system:

flowchart LR
  A["Runtime or environment failure"] --> B["Replayable evidence"]
  B --> C["Quality replay case or benchmark candidate"]
  C --> D["Improvement target"]
  D --> E["Target execution plan"]
  E --> F["Autotune baseline on required pack"]
  F --> G["Bounded mutation in a git worktree"]
  G --> H["Treatment on same evidence"]
  H --> I["Promotion gate"]
  I --> J["Source closure"]
  J --> K["Cycle lineage and quality summary"]

Every successful cycle must answer:

Invariant What SecondBrain records
Hypothesis The pressure being fixed, usually from a candidate, replay case, environment task, regret, or stalled cycle.
Source evidence Candidate ids, replay case ids, environment episode ids, or cycle ids.
Control measurement Baseline score on the required pack or replay before mutation.
Intervention Mutation strategy, files touched, candidate branch, model/provider if any.
Treatment measurement Candidate score on the same evidence.
Confirmation Paired-case scientific evidence, replay obligations, and promotion gate metadata.
Closure fixed, still_failing, invalid, superseded, or escalated for review.

What Counts As An Environment

An environment is a bounded local task executor. The contract lives in brain/environments/models.py and brain/environments/base.py.

Contract Purpose
EnvironmentSpec Static environment metadata: id, action types, observation types, max steps, default task.
TaskSpec Goal, success conditions, max steps, and initial values for one task.
Action One structured action submitted to the environment.
Observation Structured state returned by reset or step.
RewardResult Total and normalized reward plus named components.
StepResult Result of applying one action, including reward and terminal flags.
EpisodeRecord Persisted envelope for reset observation, steps, final state, reward, and metadata.

Current built-in environments:

Environment Use
counter Pure deterministic fixture for lifecycle, reward, persistence, export, and replay checks.
workspace Root-bounded file-task environment backed by SessionEnv, useful for small coding or file-verifier tasks.

The important boundary: environments are not the production MCP/tool plane. They are the replayable measurement plane. Production tool execution stays in the agent runtime; environment episodes preserve trajectory and reward evidence for evaluation and improvement.

Run Environment Episodes

List available environments:

sb env list --json

Run a simple counter episode:

sb env run counter --target 3 --json

Run a workspace verifier episode:

sb env run workspace \
  --target-path answer.txt \
  --answer "ready" \
  --json

Run from a task manifest:

sb env run --task-file examples/env-task.yaml --json

List and inspect persisted episodes:

sb env episodes --env-id workspace --json
sb env show <episode_id> --json

Replay a stored episode against current environment code:

sb env replay <episode_id> --json

Export a trajectory for replay, analysis, or trainer ingestion:

sb env export <episode_id> --format openenv-json --output rollout.json
sb env export <episode_id> --format steps-jsonl --output rollout.steps.jsonl

The export path is data interchange. Replay is the regression check. If replay diverges on reward, terminal status, step count, or stable state signature, the environment implementation changed in a way that needs review.

Task Manifests

Task manifests are the preferred way to make an environment task repeatable. They are JSON or YAML files loaded by sb env run --task-file.

schema_version: secondbrain.environment_task.v1
env_id: workspace
task:
  task_id: workspace.write_answer
  goal: Write the expected answer file.
  success_conditions:
    - answer.txt equals expected_text
reset_options:
  target_path: answer.txt
  expected_text: "ready\n"
verifiers:
  - type: file_exists
    name: answer_exists
    path: answer.txt
  - type: file_equals
    name: answer_exact
    path: answer.txt
    expected_text: "ready\n"
actions:
  - type: write_file
    payload:
      path: answer.txt
      content: "ready\n"
  - type: submit
metadata:
  suite: local-fixture

Workspace verifiers are declarative and do not run shell commands:

  • file_exists
  • file_equals
  • file_contains
  • file_matches_regex

Each verifier emits one reward component. The normalized reward is the weighted pass score divided by total verifier weight. This keeps task success inspectable instead of hiding it behind a single opaque scalar.

From Environment Failure To Improvement Pressure

Low-reward or incomplete environment task groups are not dead ends. They feed the improvement loop through two bridges:

  1. SelfImprovementPlanner scans environment_episodes in work.db.
  2. EnvironmentReplayBridge turns weak task groups into:
  3. an autotune idea, and
  4. a BenchmarkCandidate with source_type="environment_episode".

Environment pressure includes:

  • env_id
  • task_id
  • latest failed or weak episode_id
  • average normalized reward
  • completion count
  • target lane
  • suggested mutation strategy

The current mapping routes counter and workspace to repl_prompt because their failures usually indicate prompt or policy behavior around structured actions, submit timing, and direct answers. Dedicated environment-specific lanes can be added later when they have deterministic evaluators and mutation surfaces.

Quality Replay Cases

Quality replay cases are the bridge between observed failures and executable evidence:

sb quality replay-cases --json
sb quality run-replay <case-id> --json
sb quality replay-results --json
sb quality promote-replay <case-id> --json

Replay execution uses adapter contracts in brain/quality/replay_execution.py:

Adapter Source
runtime_prompt Prompt-backed quality cases linked to benchmark cases.
retrieval_replay Retrieval cases with expected references.
environment_replay Environment episodes replayed through environment fixtures.

Replay results are persisted as measured pass/fail evidence. Promotion gates can block a candidate when a required replay case was not rerun, did not pass, or belongs to an unresolved failed suite.

Benchmark Candidates And Packs

Benchmark candidates are pressure sources. They are not automatically trusted until they carry executable expectations.

Common workflow:

sb autotune benchmark unresolved --lane repl_prompt --json
sb autotune benchmark show <candidate-id> --json
sb autotune benchmark enrich <candidate-id> ...
sb autotune benchmark promote <candidate-id> --pack recent_failures
sb autotune benchmark report repl_prompt --json

Important fields:

Field Meaning
source_type Where the pressure came from: replay case, environment episode, runtime event, etc.
source_ref Pointer to the replay case id, episode id, trace id, or artifact.
expected_properties Executable behavior contract for the lane evaluator.
output_schema Optional schema contract such as json_object.
severity Escalates pack selection and review pressure.
pack_suggestions Hints for recent_failures, hard, or regression packs.
resolution_status Lifecycle status: open, fixed, still_failing, invalid, or superseded.

Pack selection is target-bound:

  • low/default evidence can run smoke;
  • recent real failures run recent_failures;
  • high-severity evidence escalates to hard;
  • broad or risky changes can escalate to regression.

The source evidence determines the pack. A run cannot claim improvement by passing an unrelated smoke pack when the pressure came from a hard failure.

Autotune Lanes

An autotune lane is a bounded mutation/evaluation contract. Lane specs live in brain/autotune/specs/.

For example, repl_prompt defines:

  • mutable prompt file: brain/prompts/specs/agents/agent.profile.default.v1.yaml
  • frozen implementation paths
  • evaluator: prompt.repl
  • benchmark packs: smoke, core, recent_failures, hard, regression
  • acceptance thresholds
  • guard metrics
  • confirmation policy

Run a standalone benchmark:

sb autotune bench repl_prompt --pack hard --json

Run one bounded lane attempt:

sb autotune run repl_prompt --pack hard --ignore-pause --json

Run the full self-improvement loop:

sb autotune improve --lane repl_prompt --max-lanes 1 --pack smoke --explain-plan --json
sb autotune improve --lane repl_prompt --max-lanes 1 --pack smoke --ignore-pause --json

The --explain-plan command is the safest first step. It shows:

  • selected target;
  • target key;
  • required pack;
  • confirmation pack;
  • source candidate ids;
  • replay case ids;
  • environment episode ids;
  • unsupported candidate properties;
  • force-promoted candidates that need explicit review.

True Agent Harness Improvement Loop

The full loop is implemented by SelfImprovementOrchestrator in brain/autotune/orchestrator.py.

One cycle does this:

  1. Diagnose pressure with SelfImprovementPlanner.
  2. Seed replay/environment/LLM ideas where appropriate.
  3. Register Sankalpa goals for selected targets.
  4. Advance meta-learning from prior autotune history.
  5. Build a target execution plan for the selected target.
  6. Run source replay obligations.
  7. Run autotune against the required evidence-bound pack.
  8. Persist run evidence, scientific evidence, promotion gate metadata, and cycle lineage.
  9. Update source closure when promotion succeeds or failed attempts continue.
  10. Record Antahkarana cycle outcome and quality projection.

Target sources include:

  • Karma regrets;
  • autotune failure rates;
  • grounded trajectory scores;
  • environment scores;
  • Sankalpa goal gaps;
  • stalled self-improvement cycles;
  • benchmark pressure.

The loop can improve itself. Repeated partial, no-effect, blocked, or failed cycles become cycle_stall targets so the planner can diagnose failure of the improvement process itself.

Scientific Promotion Gate

Autotune does not promote just because a run was "kept." Promotion requires evidence.

brain/autotune/scientific_evidence.py builds paired-case evidence from baseline and candidate evaluations:

  • baseline primary metric;
  • candidate primary metric;
  • primary delta in the lane's metric direction;
  • paired case count;
  • case wins/losses/ties;
  • mean case delta;
  • standard error;
  • 95 percent lower confidence bound;
  • promotion readiness and blockers.

The promotion gate can block when:

  • the run was not kept;
  • baseline or candidate metric is missing;
  • guard metrics regressed;
  • confirmation failed;
  • paired evidence is underpowered;
  • lower confidence bound is not positive;
  • source replay obligations are missing or failed;
  • touched lane has unresolved failed replay suites;
  • candidate properties are unsupported;
  • force-promoted candidates lack explicit review.

This is why a tiny improvement on one case may be kept for review but not promoted automatically. Scientific promotion needs enough paired evidence to avoid treating noise as progress.

Source Closure

After a promoted run clears the gate, source pressure is closed:

sb autotune benchmark show <candidate-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json

Closure fields:

  • resolved_by_run_id
  • resolved_at
  • resolution_status
  • resolution_evidence

Fixed candidates stop appearing as active benchmark_pressure. Failed attempts increment recurrence metadata instead of disappearing. This gives the next diagnosis pass a durable memory of what was tried.

Operator Runbook

Use this runbook when you want one falsifiable improvement loop.

1. Start from a clean tree

Autotune creates git worktrees. The run guard requires a clean repo:

git status --short --branch

Commit or stash unrelated edits before running a treatment cycle.

2. Inspect current pressure

sb quality summary --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb quality replay-cases --json

3. Run a control benchmark

sb autotune bench repl_prompt --pack hard --json

Save the result when demonstrating a loop:

sb autotune bench repl_prompt --pack hard --json \
  > out/checkins/<name>/control-hard-benchmark.json

4. Explain the target-bound plan

sb autotune improve \
  --lane repl_prompt \
  --max-lanes 1 \
  --pack smoke \
  --explain-plan \
  --json

Do not continue if the selected target is not the evidence you intend to test. If an earlier failed attempt created cycle_stall, sankalpa_gap, or autotune_failures pressure, use a fresh state for a demonstration or close the higher-priority pressure first.

5. Run treatment

sb autotune improve \
  --lane repl_prompt \
  --max-lanes 1 \
  --pack smoke \
  --ignore-pause \
  --json

The run may create candidate branches such as:

autotune/repl_prompt/cand-<run-id>

6. Inspect cycle and closure

sb autotune cycle show <cycle-id> --json
sb autotune benchmark show <candidate-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb autotune report repl_prompt --last 4 --json

7. Capture a checkin

A good checkin folder contains:

  • seeded source evidence;
  • explain-plan output;
  • control benchmark;
  • treatment cycle JSON;
  • cycle show JSON;
  • source candidate after closure;
  • unresolved-after output;
  • quality summary;
  • candidate branch diff;
  • README.md with the invariant summary.

Example from a successful local run:

out/checkins/true-improvement-loop-20260514-110100/

That run selected benchmark_pressure, ran the hard pack, improved the score from 0.786667 to 1.0, passed the scientific promotion gate with a positive lower confidence bound, closed the source candidate as fixed, and left no unresolved benchmark pressure for that lane in the isolated run state.

How To Interpret Outcomes

Outcome Meaning
success Source pressure fixed and promotion gate passed.
partial Mutation ran, but source is still open or gate did not pass.
no_effect No measurable gain or no pressure reduction.
invalid Evidence was not executable.
blocked Safety, replay, quality, or promotion gate blocked the change.
failed Runtime or infrastructure error.

Treat partial as useful evidence, not success. A kept run with an underpowered gate is a candidate for review; it is not a closed loop.

Implementation Map

Area Files
Environment contracts brain/environments/models.py, brain/environments/base.py
Built-in environments brain/environments/fixtures.py, brain/environments/workspace_env.py
Manifests brain/environments/manifest.py
Episode store brain/environments/store.py
Replay brain/environments/replay.py
Export brain/environments/export.py
Serve routes brain/serve/routers/environments.py
Environment to autotune bridge brain/autotune/environment_bridge.py
Planner brain/autotune/self_improve.py
Orchestrator brain/autotune/orchestrator.py
Runner and worktrees brain/autotune/runner.py, brain/autotune/worktrees.py, brain/autotune/gitops.py
Promotion gate brain/autotune/promotion_gate.py
Scientific evidence brain/autotune/scientific_evidence.py
Scientific experiments brain/autotune/scientific_experiments.py
Rewards brain/autotune/self_improvement_rewards.py
Cycle lineage brain/autotune/cycle_lineage.py
Replay execution brain/quality/replay_execution.py
Quality projection brain/quality/service.py, brain/quality/self_improvement_cycles.py

Common Failure Modes

Failure Diagnosis Fix
git working tree must be clean before autotune runs Autotune refuses to mutate from a dirty checkout. Commit, stash, or move unrelated edits before treatment.
Worktree creation denied Sandbox or filesystem blocked .git/worktrees. Run in an environment that allows git worktree metadata.
explain-plan selects the wrong target Earlier attempts created higher-priority stall or goal pressure. Close that pressure, or use a fresh isolated state for a demonstration.
Kept run does not promote Scientific evidence or replay obligation blocked the promotion. Add enough paired executable cases, rerun required replay cases, or route to human review.
Candidate remains unresolved Source closure did not run because gate failed or no source ids were covered. Inspect target_execution_plan, covered_candidate_ids, and gate metadata.
Replay case is not runnable No adapter can execute the replay or assertions are incomplete. Add expected properties, expected refs, or environment episode metadata.

Build A New Environment

Use this checklist when adding an environment:

  1. Define an EnvironmentSpec.
  2. Implement BaseEnvironment._reset and _step.
  3. Keep actions structured and serializable.
  4. Return observations with stable state values.
  5. Emit reward components with names and reasons.
  6. Terminate or truncate deterministically.
  7. Persist reset options needed for replay.
  8. Add replay comparison coverage.
  9. Add export coverage if trainer ingestion matters.
  10. Decide whether weak episodes should route to an existing autotune lane or a new dedicated lane.

The environment is ready for the improvement loop only when a stored failure can be replayed and converted into executable benchmark or replay evidence.

External Reading