Skip to content

Testing Strategy

SecondBrain has outgrown the "put new coverage in one more smoke file" phase. The practical strategy going forward should be:

Test Pyramid

  1. Unit tests

  2. Pure logic only.

  3. No real provider setup, embeddings, vector search, MCP boot, or filesystem-heavy state unless the unit under test owns it.
  4. Must be parallel-safe under pytest -n auto.
  5. Preferred tools: small fakes, direct constructor injection, no CLI round-trips.

  6. Contract tests

  7. Verify stable request/response or persistence contracts.

  8. Good fit for session stores, event logs, CLI JSON payloads, SSE reducers, and adapter boundary objects.
  9. Use real serialization and storage when cheap, but stub network and model layers.

  10. Integration tests

  11. Exercise one user-facing surface end to end with controlled dependencies.

  12. Typical examples: sb chat slash commands, sb sessions flows, API route handlers.
  13. Should stub providers and heavy retrieval paths so the test only proves the product wiring it cares about.
  14. Mark with @pytest.mark.integration.
  15. Prefer serial execution (-n 0) when the flow touches shared global state, thread-heavy runtimes, or embedding stacks.

  16. End-to-end and heavy stack tests

  17. Only for a few high-value journeys.

  18. Opt-in, small in count, and clearly separated from the fast default loop.
  19. Real embeddings or full retrieval boot should live here, not in generic smoke files.

File Boundaries

  • Keep test_*_smoke.py for a narrow happy-path sanity slice.
  • Split feature families into focused modules such as test_chat_sessions.py, test_chat_transport.py, and test_chat_antahkarana.py.
  • Prefer one behavioral theme per file over thousand-line catch-all files.

Fixture Rules

  • Default fixtures should build isolated temp config, temp state, and fake providers.
  • Heavy runtime initialization must be opt-in.
  • If a test only needs persisted chat rows, use EventLog or SessionStore directly instead of invoking the whole chat stack.

Parallelism Rules

  • -n auto is appropriate for unit and cheap contract tests.
  • Chat/integration slices that can trigger embedding threads, retrieval startup, or other native-code concurrency should run with -n 0 unless explicitly hardened for parallel workers.
  • When a test is not parallel-safe, fix the fixture boundary first; do not hide the problem by adding more unrelated smoke coverage.

Agent Behavior Evals

Do not write pytest tests that assert on open-ended LLM response text, persona, tone, or exact phrasing. Those tests look deterministic but fail for model, temperature, provider, and context-routing reasons that are unrelated to code correctness.

Use pytest for stable contracts:

  • tool schemas, validators, serializers, reducers, and persistence records
  • deterministic routing, policy, approval, and provider-selection logic
  • offline fixtures that assert event shapes, not model prose

Use the quality plane for behavior:

  • replay cases for failures seen in real sessions or traces
  • grounded and retrieval evals for faithfulness, citations, and hallucination risk
  • tool trajectory checks for whether the right tools were selected in the right order
  • rubric or judge-based checks when semantic quality matters more than exact text

Start with one or two core behavior cases, fix the failure, rerun the same case, then expand. Lowering thresholds or deleting flaky behavior cases should be a human-reviewed decision, because those cases often reveal missing instructions, loose tool descriptions, or hidden nondeterminism.

Current Cleanup Targets

  • Continue splitting tests/chat/test_chat_smoke.py by feature area.
  • Move session/resume coverage into its own module.
  • Keep resume and persistence regressions near the session helpers instead of mixing them with unrelated tool-loop tests.