AutoData — Agentic Self-Instruct¶

Re-implementation of Meta AI's AutoData (Kulikov et al., 2026) inside SecondBrain. Generates high-quality grounded QA datasets through a five-agent loop, scores quality with the AutoData paper's gates, and exposes its tuning surface as a first-class autotune lane plus an MCP tool.

When to use it¶

Use it for	Don't use it for
Building a synthetic QA training set from a corpus you trust	One-off Q&A — use `sb ask`
Stress-testing a new strong model against your weak baseline	Inference at runtime
Producing eval data with verifiable quality gates and gap discrimination	Anything time-critical (real-LLM mode burns tokens)
Tuning the data-quality thresholds via the autotune lane	Free-form text generation

The five sub-agents¶

                    ┌────────────────┐
   source.md ──────▶│  Challenger    │── proposes QA + rubric
                    └───────┬────────┘
                            │
                            ▼
                    ┌────────────────┐  rejects on context-leak,
                    │ QualityVerifier│  rubric coverage, generic Q
                    └───────┬────────┘
                            │ pass
                            ▼
              ┌───────────────────────────────┐
              │   N × WeakSolver + Judge      │── weak_avg, weak_max
              └───────────────────────────────┘
              ┌───────────────────────────────┐
              │   N × StrongSolver + Judge    │── strong_avg, strong_min
              └───────────────────────────────┘
                            │
                            ▼
                    ┌────────────────┐
                    │ Acceptance gates│
                    │ • weak_avg ≤ 0.65
                    │ • strong_avg ∈ [0.60,0.95]
                    │ • gap ≥ 0.20
                    └───────┬────────┘
                       │           │
                  accept           reject → categorized feedback
                       │              ↑                   │
                       │              └───────────────────┘
                       ▼
                  result.json

The thresholds are the AutoData paper's CS-research defaults; they live in brain/autodata/contracts.py::AcceptanceCriteria and are configurable via brain/autodata/tuning.yaml.

Layers¶

brain/autodata/
├── contracts.py     Pydantic schemas (Rubric, QAItem, AcceptanceCriteria,
│                    SolverScores, RoundRecord, AutoDataResult)
├── prompts.py       System prompts for the five sub-agents + refinement
│                    feedback templating
├── agents.py        Challenger, QualityVerifier, Weak/StrongSolver, Judge
│                    — each wraps an LLMProvider
├── judge.py         score_solver_outputs(): rubric-graded judgment with
│                    parser fallbacks
├── loop.py          AutoDataLoop — inner loop (refinement) + middle loop
│                    (sweep over sources). Emits autodata.{round,accepted,
│                    qv_failed} events.
├── meta.py          MetaOptimizer — Boltzmann (T=0.1) evolutionary search
│                    over HarnessVariant tunings
├── tuning.py        load/apply/coerce/render of brain/autodata/tuning.yaml
├── store.py         AutoDataStore — JSONL persistence under
│                    state/autodata/<run_id>/
├── karma_feed.py    record_round() — KarmaEntry per round; rejects → regrets
├── fixtures.py      StubProvider + fixture router for offline evaluation
├── proposer.py      Bridge: meta-optimizer proposes; autotune lane validates
└── tuning.yaml      THE ONLY MUTABLE FILE in brain/autodata/. The autotune
                     `autodata` lane is allowed to mutate this and nothing else.

CLI surface¶

# Generate QA items from grounded source files (real LLMs)
sb autodata generate vault/01_projects/paper-*.md \
    [--max-rounds 8] [--weak-n 3 --strong-n 3] \
    [--feed-karma] [--json]

# Search the tuning surface stochastically against the autotune fixture
sb autodata meta-optimize \
    --train fixtures/a.md --validate fixtures/b.md \
    [--iterations 30] [--temperature 0.1] [--seed 0]

# Meta-opt proposes; autotune-lane judges; commit if accepted
sb autodata propose-validate --pack core --iterations 30 --seed 0 [--apply]

# Bootstrap an autotune fixture from real vault markdown
sb autodata fixture-init "vault/01_projects/*.md" --out my_fixture.yaml \
    [--weak-score 0.40 --strong-score 0.85]

# Dataset-level analysis of a generated run
sb autodata diversity state/autodata/<run-id> [--dedup-threshold 0.85] [--json]

# Export accepted items as a flat dataset (CSV/JSONL/JSON)
sb autodata to-dataset state/autodata/<run-id> --out dataset.jsonl [--dedup]

# Profile an export via brain.datasets.profile_dataset
sb autodata profile state/autodata/<run-id> [--dedup]

# Show summary for a generated run
sb autodata status state/autodata/<run-id>

MCP surface¶

brain/mcp/cc_server.py exposes one AutoData tool today:

secondbrain_autodata_propose_validate(
    pack: str = "core",          # autotune fixture pack name
    iterations: int = 30,        # meta-optimizer iterations
    seed: int = 0,
    temperature: float = 0.1,    # Boltzmann temperature
) -> str  # JSON: {proposal:{...}, validation:{...}}

Read-only — never writes tuning.yaml or commits. Use it from Claude Code, the API, or any MCP client.

Autotune lane integration¶

Aspect	Value
Lane spec	`brain/autotune/specs/autodata.yaml`
Evaluator	`autodata.synth` → `brain/autotune/evaluators/autodata.py::AutoDataEvaluator`
Mutable path	`brain/autodata/tuning.yaml` (only)
Mutation strategy	`AutoDataTuningMutationStrategy` (±0.05 grid + prompt-patch toggle)
Primary metric	`autodata_acceptance_rate` (accepted / total sources)
Guards	`schema_valid`, `avg_gap`, `qv_pass_rate`, `avg_strong_score`, `p95_latency_ms`
Fixtures	`brain/autotune/fixtures/autodata.{smoke,core}.yaml`
Karma mapping	`autodata_round` → `autodata` lane (regrets surface as autotune ideas)

Run the lane the same way as any other:

sb autotune run autodata --pack core --attempts 12      # actual commits land
sb autotune run autodata --pack core --dry-run --attempts 12   # preview only

The lane uses a stub provider + canned fixture responses so it runs offline (no LLM calls). Dispatch lives in brain/autotune/runner.py::_mutation_strategy(spec) — lane-name match first (autodata → AutoDataTuningMutationStrategy), kind-fallback second.

Tuning surface¶

# brain/autodata/tuning.yaml — bounded, lane-mutable
acceptance:
  weak_avg_max:    0.65   # paper default; bounds [0.05, 0.95]
  weak_max_cap:    0.75   # bounds [0.05, 0.99]
  strong_avg_min:  0.60   # bounds [0.05, 0.99]
  strong_avg_max:  0.95   # bounds [0.10, 0.99]
  gap_min:         0.20   # bounds [0.0, 0.80]

prompt_patches:
  paper_specific_insight:   false
  source_unique_knowledge:  false
  criterion_non_redundancy: false

Each prompt patch is a canned addition to the Challenger system prompt; the patch text lives in brain/autodata/tuning.py::PROMPT_PATCH_TEXT so diffs are reviewable in source control.

Karma feed¶

Every refinement round emits an autodata_round KarmaEntry:

Round outcome	KarmaEntry shape
Accepted (gates pass)	`outcome="success"`, `regret=False`, no lesson
Rejected (gates fail)	`outcome="failure"`, `regret=True`, `lesson=` joined gate reasons

Regrets flow through brain/autotune/karma_bridge.py::propose_idea_from_karma and surface as autotune ideas on the autodata lane. Wire it on:

sb autodata generate vault/path/*.md --feed-karma

The next sb autotune run autodata cycle picks up the lessons as research direction.

Propose-validate flow¶

The meta-optimizer is fast (~0.03 s for 30 iterations) but ungoverned; the autotune lane has the full judge ensemble + complexity tax + regression detection but searches via a deterministic grid that doesn't compound within a single invocation. sb autodata propose-validate combines them: meta-opt finds the best variant in-process, then the same evaluator + ensemble that the autotune lane runs scores baseline-vs-candidate and decides whether to accept. With --apply, a single autotune-style commit replaces the chain of incremental --resume invocations.

Meta-optimizer (30 iters):                Autotune lane validation:
  best score:    1.000                      baseline:        0.571
  accepted: 4 / rejected: 26                candidate:       1.000
  proposed: weak=0.70, strong=0.55,         raw_gain:        +0.429
            gap=0.15                        ensemble:        ACCEPTED
                                            metric:    pass score=+0.429 required
                                            contract:  pass score=+1.000
                                            regression:pass score=+1.000
                                            pairwise:  pass score=+1.000

Fixture format¶

# brain/autotune/fixtures/autodata.<pack>.yaml
name: autodata.<pack>
sources:
  - id: paper-foo
    title: "Short title"
    text: |
      Source markdown / text the challenger grounds on.
    responses:
      challenger: |     # canned strict-JSON QA + rubric
        {"question": "...", "context": "...",
         "reference_answer": "...",
         "rubric": {"criteria": [{"name": "...", "description": "...", "weight": 5}]}}
      qv: '{"passed": true, "issues": [], "feedback": "ok"}'
      weak_score: 0.40   # judge score for weak-solver responses on this source
      strong_score: 0.85 # judge score for strong-solver responses

The fixture router in brain/autodata/fixtures.py latches the active source on each challenger call (since solver/judge calls don't carry the source id), so per-source scores are honored across the inner loop.

Result schema¶

Mirrors AutoData's result.json:

{
  "source_id": "paper-foo",
  "source_title": "...",
  "rounds": [{
    "refinement_round": 1,
    "question": "...", "context": "...", "reference_answer": "...",
    "rubric": [{"name": "...", "weight": 5, ...}, ...],
    "accepted": true,
    "quality_verifier_passed": true,
    "weak_solver_avg": 0.40,
    "strong_solver_avg": 0.80,
    "gap": 0.40,
    "reject_reasons": [],
    "eval_report": "..."
  }, ...],
  "final_accepted_round": 1,
  "total_rounds": 1
}

Persisted as JSONL under state/autodata/<run_id>/sources.jsonl; accepted items also append to accepted.jsonl and a run-level summary lands in index.json.

Tests¶

File	Coverage
`tests/autodata/test_autodata.py`	Contracts, gates, refinement, judge fallbacks, inner loop, persistence, CLI registration (26)
`tests/autodata/test_autodata_autotune.py`	Tuning load/apply, lane spec resolution, evaluator end-to-end, MetaOptimizer over the new tuning surface (11)
`tests/autodata/test_autodata_karma.py`	record_round shapes, make_karma_feed filter, loop integration, karma_bridge mapping (8)
`tests/autodata/test_autodata_proposer.py`	propose/validate/apply flow, no-op handling, seed determinism (9)
`tests/autodata/test_autodata_mcp.py`	MCP tool registration, return shapes, error handling, determinism (5)
`tests/autodata/test_autodata_diversity.py`	Token-Jaccard math, near-dup clustering, accepted.jsonl loading, full report (16)
`tests/autodata/test_autodata_fixture_init.py`	Glob discovery, H1 title extraction, id collisions, score-bound validation, evaluator absolute-path support (14)
`tests/autodata/test_autodata_dataset_export.py`	flatten_item shape, CSV/JSONL/JSON formats, dedup integration, profile_run round-trip via brain.datasets (15)

123 tests total, all offline (deterministic stub provider). Total ~4.7 s.

.venv/bin/python -m pytest tests/autodata/test_autodata*.py -q --override-ini="addopts="

References¶

Kulikov et al., "Autodata: an automatic data scientist to create high quality data", Meta AI RAM, 2026 — https://facebookresearch.github.io/RAM/blogs/autodata/
Source repo: https://github.com/facebookresearch/RAM/tree/main/projects/autodata