Environment And Improvement Loop User Guide¶
SecondBrain's improvement loop is a local, auditable harness for turning real failures into measured changes. It combines four planes:
- Environments: replayable task episodes with reset, action, observation, reward, persistence, export, and replay.
- Quality: a control plane that summarizes runtime health, environment suites, replay cases, benchmark pressure, and self-improvement experiments.
- Autotune: bounded lane-specific mutation and evaluation with worktrees, benchmark packs, paired-case evidence, and promotion gates.
- Antahkarana: the cognitive loop that records goals, regret, strategy priors, cycle outcomes, and closure.
The design follows the same core shape used by modern RL environment guides: define a task, reset a bounded world, expose a small action space, return structured observations, score rewards, stop episodes deterministically, and keep rollouts replayable. SecondBrain applies that shape to local agent work rather than online training: every episode, replay, candidate, run, gate, and closure decision is stored as inspectable evidence.
Mental Model¶
The loop is not "run a model and hope it improves." It is a scientific control system:
flowchart LR
A["Runtime or environment failure"] --> B["Replayable evidence"]
B --> C["Quality replay case or benchmark candidate"]
C --> D["Improvement target"]
D --> E["Target execution plan"]
E --> F["Autotune baseline on required pack"]
F --> G["Bounded mutation in a git worktree"]
G --> H["Treatment on same evidence"]
H --> I["Promotion gate"]
I --> J["Source closure"]
J --> K["Cycle lineage and quality summary"]
Every successful cycle must answer:
| Invariant | What SecondBrain records |
|---|---|
| Hypothesis | The pressure being fixed, usually from a candidate, replay case, environment task, regret, or stalled cycle. |
| Source evidence | Candidate ids, replay case ids, environment episode ids, or cycle ids. |
| Control measurement | Baseline score on the required pack or replay before mutation. |
| Intervention | Mutation strategy, files touched, candidate branch, model/provider if any. |
| Treatment measurement | Candidate score on the same evidence. |
| Confirmation | Paired-case scientific evidence, replay obligations, and promotion gate metadata. |
| Closure | fixed, still_failing, invalid, superseded, or escalated for review. |
What Counts As An Environment¶
An environment is a bounded local task executor. The contract lives in
brain/environments/models.py and brain/environments/base.py.
| Contract | Purpose |
|---|---|
EnvironmentSpec |
Static environment metadata: id, action types, observation types, max steps, default task. |
TaskSpec |
Goal, success conditions, max steps, and initial values for one task. |
Action |
One structured action submitted to the environment. |
Observation |
Structured state returned by reset or step. |
RewardResult |
Total and normalized reward plus named components. |
StepResult |
Result of applying one action, including reward and terminal flags. |
EpisodeRecord |
Persisted envelope for reset observation, steps, final state, reward, and metadata. |
Current built-in environments:
| Environment | Use |
|---|---|
counter |
Pure deterministic fixture for lifecycle, reward, persistence, export, and replay checks. |
workspace |
Root-bounded file-task environment backed by SessionEnv, useful for small coding or file-verifier tasks. |
The important boundary: environments are not the production MCP/tool plane. They are the replayable measurement plane. Production tool execution stays in the agent runtime; environment episodes preserve trajectory and reward evidence for evaluation and improvement.
Run Environment Episodes¶
List available environments:
Run a simple counter episode:
Run a workspace verifier episode:
Run from a task manifest:
List and inspect persisted episodes:
Replay a stored episode against current environment code:
Export a trajectory for replay, analysis, or trainer ingestion:
sb env export <episode_id> --format openenv-json --output rollout.json
sb env export <episode_id> --format steps-jsonl --output rollout.steps.jsonl
The export path is data interchange. Replay is the regression check. If replay diverges on reward, terminal status, step count, or stable state signature, the environment implementation changed in a way that needs review.
Task Manifests¶
Task manifests are the preferred way to make an environment task repeatable.
They are JSON or YAML files loaded by sb env run --task-file.
schema_version: secondbrain.environment_task.v1
env_id: workspace
task:
task_id: workspace.write_answer
goal: Write the expected answer file.
success_conditions:
- answer.txt equals expected_text
reset_options:
target_path: answer.txt
expected_text: "ready\n"
verifiers:
- type: file_exists
name: answer_exists
path: answer.txt
- type: file_equals
name: answer_exact
path: answer.txt
expected_text: "ready\n"
actions:
- type: write_file
payload:
path: answer.txt
content: "ready\n"
- type: submit
metadata:
suite: local-fixture
Workspace verifiers are declarative and do not run shell commands:
file_existsfile_equalsfile_containsfile_matches_regex
Each verifier emits one reward component. The normalized reward is the weighted pass score divided by total verifier weight. This keeps task success inspectable instead of hiding it behind a single opaque scalar.
From Environment Failure To Improvement Pressure¶
Low-reward or incomplete environment task groups are not dead ends. They feed the improvement loop through two bridges:
SelfImprovementPlannerscansenvironment_episodesinwork.db.EnvironmentReplayBridgeturns weak task groups into:- an autotune idea, and
- a
BenchmarkCandidatewithsource_type="environment_episode".
Environment pressure includes:
env_idtask_id- latest failed or weak
episode_id - average normalized reward
- completion count
- target lane
- suggested mutation strategy
The current mapping routes counter and workspace to repl_prompt because
their failures usually indicate prompt or policy behavior around structured
actions, submit timing, and direct answers. Dedicated environment-specific
lanes can be added later when they have deterministic evaluators and mutation
surfaces.
Quality Replay Cases¶
Quality replay cases are the bridge between observed failures and executable evidence:
sb quality replay-cases --json
sb quality run-replay <case-id> --json
sb quality replay-results --json
sb quality promote-replay <case-id> --json
Replay execution uses adapter contracts in brain/quality/replay_execution.py:
| Adapter | Source |
|---|---|
runtime_prompt |
Prompt-backed quality cases linked to benchmark cases. |
retrieval_replay |
Retrieval cases with expected references. |
environment_replay |
Environment episodes replayed through environment fixtures. |
Replay results are persisted as measured pass/fail evidence. Promotion gates can block a candidate when a required replay case was not rerun, did not pass, or belongs to an unresolved failed suite.
Benchmark Candidates And Packs¶
Benchmark candidates are pressure sources. They are not automatically trusted until they carry executable expectations.
Common workflow:
sb autotune benchmark unresolved --lane repl_prompt --json
sb autotune benchmark show <candidate-id> --json
sb autotune benchmark enrich <candidate-id> ...
sb autotune benchmark promote <candidate-id> --pack recent_failures
sb autotune benchmark report repl_prompt --json
Important fields:
| Field | Meaning |
|---|---|
source_type |
Where the pressure came from: replay case, environment episode, runtime event, etc. |
source_ref |
Pointer to the replay case id, episode id, trace id, or artifact. |
expected_properties |
Executable behavior contract for the lane evaluator. |
output_schema |
Optional schema contract such as json_object. |
severity |
Escalates pack selection and review pressure. |
pack_suggestions |
Hints for recent_failures, hard, or regression packs. |
resolution_status |
Lifecycle status: open, fixed, still_failing, invalid, or superseded. |
Pack selection is target-bound:
- low/default evidence can run
smoke; - recent real failures run
recent_failures; - high-severity evidence escalates to
hard; - broad or risky changes can escalate to
regression.
The source evidence determines the pack. A run cannot claim improvement by passing an unrelated smoke pack when the pressure came from a hard failure.
Autotune Lanes¶
An autotune lane is a bounded mutation/evaluation contract. Lane specs live in
brain/autotune/specs/.
For example, repl_prompt defines:
- mutable prompt file:
brain/prompts/specs/agents/agent.profile.default.v1.yaml - frozen implementation paths
- evaluator:
prompt.repl - benchmark packs:
smoke,core,recent_failures,hard,regression - acceptance thresholds
- guard metrics
- confirmation policy
Run a standalone benchmark:
Run one bounded lane attempt:
Run the full self-improvement loop:
sb autotune improve --lane repl_prompt --max-lanes 1 --pack smoke --explain-plan --json
sb autotune improve --lane repl_prompt --max-lanes 1 --pack smoke --ignore-pause --json
The --explain-plan command is the safest first step. It shows:
- selected target;
- target key;
- required pack;
- confirmation pack;
- source candidate ids;
- replay case ids;
- environment episode ids;
- unsupported candidate properties;
- force-promoted candidates that need explicit review.
True Agent Harness Improvement Loop¶
The full loop is implemented by SelfImprovementOrchestrator in
brain/autotune/orchestrator.py.
One cycle does this:
- Diagnose pressure with
SelfImprovementPlanner. - Seed replay/environment/LLM ideas where appropriate.
- Register Sankalpa goals for selected targets.
- Advance meta-learning from prior autotune history.
- Build a target execution plan for the selected target.
- Run source replay obligations.
- Run autotune against the required evidence-bound pack.
- Persist run evidence, scientific evidence, promotion gate metadata, and cycle lineage.
- Update source closure when promotion succeeds or failed attempts continue.
- Record Antahkarana cycle outcome and quality projection.
Target sources include:
- Karma regrets;
- autotune failure rates;
- grounded trajectory scores;
- environment scores;
- Sankalpa goal gaps;
- stalled self-improvement cycles;
- benchmark pressure.
The loop can improve itself. Repeated partial, no-effect, blocked, or failed
cycles become cycle_stall targets so the planner can diagnose failure of the
improvement process itself.
Scientific Promotion Gate¶
Autotune does not promote just because a run was "kept." Promotion requires evidence.
brain/autotune/scientific_evidence.py builds paired-case evidence from
baseline and candidate evaluations:
- baseline primary metric;
- candidate primary metric;
- primary delta in the lane's metric direction;
- paired case count;
- case wins/losses/ties;
- mean case delta;
- standard error;
- 95 percent lower confidence bound;
- promotion readiness and blockers.
The promotion gate can block when:
- the run was not kept;
- baseline or candidate metric is missing;
- guard metrics regressed;
- confirmation failed;
- paired evidence is underpowered;
- lower confidence bound is not positive;
- source replay obligations are missing or failed;
- touched lane has unresolved failed replay suites;
- candidate properties are unsupported;
- force-promoted candidates lack explicit review.
This is why a tiny improvement on one case may be kept for review but not promoted automatically. Scientific promotion needs enough paired evidence to avoid treating noise as progress.
Source Closure¶
After a promoted run clears the gate, source pressure is closed:
sb autotune benchmark show <candidate-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json
Closure fields:
resolved_by_run_idresolved_atresolution_statusresolution_evidence
Fixed candidates stop appearing as active benchmark_pressure. Failed attempts
increment recurrence metadata instead of disappearing. This gives the next
diagnosis pass a durable memory of what was tried.
Operator Runbook¶
Use this runbook when you want one falsifiable improvement loop.
1. Start from a clean tree¶
Autotune creates git worktrees. The run guard requires a clean repo:
Commit or stash unrelated edits before running a treatment cycle.
2. Inspect current pressure¶
sb quality summary --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb quality replay-cases --json
3. Run a control benchmark¶
Save the result when demonstrating a loop:
sb autotune bench repl_prompt --pack hard --json \
> out/checkins/<name>/control-hard-benchmark.json
4. Explain the target-bound plan¶
Do not continue if the selected target is not the evidence you intend to test.
If an earlier failed attempt created cycle_stall, sankalpa_gap, or
autotune_failures pressure, use a fresh state for a demonstration or close the
higher-priority pressure first.
5. Run treatment¶
The run may create candidate branches such as:
6. Inspect cycle and closure¶
sb autotune cycle show <cycle-id> --json
sb autotune benchmark show <candidate-id> --json
sb autotune benchmark unresolved --lane repl_prompt --json
sb autotune report repl_prompt --last 4 --json
7. Capture a checkin¶
A good checkin folder contains:
- seeded source evidence;
- explain-plan output;
- control benchmark;
- treatment cycle JSON;
- cycle show JSON;
- source candidate after closure;
- unresolved-after output;
- quality summary;
- candidate branch diff;
README.mdwith the invariant summary.
Example from a successful local run:
That run selected benchmark_pressure, ran the hard pack, improved the score
from 0.786667 to 1.0, passed the scientific promotion gate with a positive
lower confidence bound, closed the source candidate as fixed, and left no
unresolved benchmark pressure for that lane in the isolated run state.
How To Interpret Outcomes¶
| Outcome | Meaning |
|---|---|
success |
Source pressure fixed and promotion gate passed. |
partial |
Mutation ran, but source is still open or gate did not pass. |
no_effect |
No measurable gain or no pressure reduction. |
invalid |
Evidence was not executable. |
blocked |
Safety, replay, quality, or promotion gate blocked the change. |
failed |
Runtime or infrastructure error. |
Treat partial as useful evidence, not success. A kept run with an underpowered
gate is a candidate for review; it is not a closed loop.
Implementation Map¶
| Area | Files |
|---|---|
| Environment contracts | brain/environments/models.py, brain/environments/base.py |
| Built-in environments | brain/environments/fixtures.py, brain/environments/workspace_env.py |
| Manifests | brain/environments/manifest.py |
| Episode store | brain/environments/store.py |
| Replay | brain/environments/replay.py |
| Export | brain/environments/export.py |
| Serve routes | brain/serve/routers/environments.py |
| Environment to autotune bridge | brain/autotune/environment_bridge.py |
| Planner | brain/autotune/self_improve.py |
| Orchestrator | brain/autotune/orchestrator.py |
| Runner and worktrees | brain/autotune/runner.py, brain/autotune/worktrees.py, brain/autotune/gitops.py |
| Promotion gate | brain/autotune/promotion_gate.py |
| Scientific evidence | brain/autotune/scientific_evidence.py |
| Scientific experiments | brain/autotune/scientific_experiments.py |
| Rewards | brain/autotune/self_improvement_rewards.py |
| Cycle lineage | brain/autotune/cycle_lineage.py |
| Replay execution | brain/quality/replay_execution.py |
| Quality projection | brain/quality/service.py, brain/quality/self_improvement_cycles.py |
Common Failure Modes¶
| Failure | Diagnosis | Fix |
|---|---|---|
git working tree must be clean before autotune runs |
Autotune refuses to mutate from a dirty checkout. | Commit, stash, or move unrelated edits before treatment. |
| Worktree creation denied | Sandbox or filesystem blocked .git/worktrees. |
Run in an environment that allows git worktree metadata. |
explain-plan selects the wrong target |
Earlier attempts created higher-priority stall or goal pressure. | Close that pressure, or use a fresh isolated state for a demonstration. |
| Kept run does not promote | Scientific evidence or replay obligation blocked the promotion. | Add enough paired executable cases, rerun required replay cases, or route to human review. |
| Candidate remains unresolved | Source closure did not run because gate failed or no source ids were covered. | Inspect target_execution_plan, covered_candidate_ids, and gate metadata. |
| Replay case is not runnable | No adapter can execute the replay or assertions are incomplete. | Add expected properties, expected refs, or environment episode metadata. |
Build A New Environment¶
Use this checklist when adding an environment:
- Define an
EnvironmentSpec. - Implement
BaseEnvironment._resetand_step. - Keep actions structured and serializable.
- Return observations with stable state values.
- Emit reward components with names and reasons.
- Terminate or truncate deterministically.
- Persist reset options needed for replay.
- Add replay comparison coverage.
- Add export coverage if trainer ingestion matters.
- Decide whether weak episodes should route to an existing autotune lane or a new dedicated lane.
The environment is ready for the improvement loop only when a stored failure can be replayed and converted into executable benchmark or replay evidence.