Skip to content

sb chat Reliability Architecture

Added 2026-04-16 — see brain/chat/transport_events.py, brain/chat/transport.py, brain/chat/turn_journal.py, brain/chat/commands/diag.py.

Related frontend contract: specs/agent-harness-frontend-contract.md.


Current flow (before this change)

User input
  → ChatRepl.run() in repl.py
    → harness.run_turn()
      → TurnPreparer.prepare()       — message assembly
      → provider.generate() or       — LLM call
        _consume_stream()
      → ToolScheduler.schedule()     — wave partitioning
      → BoundedToolExecutor.execute() — per-tool deadline
      → TurnFinalizer.finalize()     — memory, citations
    → TurnResult returned to REPL
  → render to terminal

OTEL tracing already wraps LLM calls and tool waves with spans.

ProviderChain already handles failover/cooldown between providers and exposes result.degraded, result.responding_provider, and result.provider_failures.

TurnRuntimeState.close_writes() guards late-arriving tool results via a write-gate that timed-out threads check before committing side-effects.


Current failure points

Failure Current handling Gap
Provider 429/503 ProviderChain failover + cooldown Fallback is silent — user never sees it
Stream stalls (no bytes) Single timeout_s deadline on entire turn No separate network-idle vs content-idle distinction
Connection never starts Same timeout_s No fast connection timeout
LLM thinking long (alive, no output) No progress indicator update Spinner just sits; user cannot distinguish stall from thinking
Tool completes after cancellation Late-write gate (close_writes) No structured event for post-cancel completion
No turn_id trace_id = session_id:call_id Cannot join tool events to a specific turn across sessions
No phase visibility Spinner says "Thinking…" throughout User cannot tell: connecting / streaming / running tools / recovering
No diagnostic query path None Cannot retroactively detect orphaned tools, retries, fallbacks

Chosen implementation

A. Normalized transport events (brain/chat/transport_events.py)

A flat dataclass hierarchy for all observable events in a turn. These are emitted by the transport layer and consumed by: - TurnJournal (durable storage) - TerminalCallbacks (visible status) - Observability/metrics hooks

B. Transport config + monitor (brain/chat/transport.py)

TransportConfig separates three timeout concepts:

Timeout Meaning Default Behavior on breach
connection_timeout_s Time until provider sends first byte 30 s Abort, raise
network_idle_timeout_s No bytes received for N seconds 45 s Abort, allow fallback
content_idle_timeout_s Alive connection, no content yet 120 s Show "still thinking" — do not abort

ChatTransport wraps harness.run_turn(): - Emits TurnStarted / TurnCompleted / TurnFailed / TurnCancelled - Detects result.degraded → emits FallbackApplied, prints visible warning - Journalizes every turn to chat_turn_journal SQLite table - Feeds the REPL health monitor, which can pause or request approval when the session crosses operator-configured runtime, token, cost, context, loop, tool-failure, or latency thresholds.

C. Structured turn journal (brain/chat/turn_journal.py)

New SQLite table chat_turn_journal (migration 048) with fields:

turn_id, session_id, trace_id, provider, model_requested, model_actual
phase, retry_index, fallback_reason, abort_reason
latency_ms, input_tokens, output_tokens
transport_warnings_json, tool_summary_json
created_at, completed_at

TurnJournal wraps EventLog and writes structured rows on turn lifecycle events.

D. Visible runtime states

TerminalCallbacks is extended with phase-aware spinner text:

Phase Spinner text
connecting Connecting… (Ns)
streaming Streaming… (Ns)
thinking Thinking… (esc to interrupt, Ns)
tool_running Using <tool>… (Ns)
recovering Recovering… (retry N)

Fallback / model-downgrade events are always printed as visible lines, never swallowed silently.

E. Diagnostic command (/diag or sb chat doctor)

brain/chat/commands/diag.py implements /diag with sub-commands:

  • /diag turns — recent turn journal (latency, phase, abort reason)
  • /diag retries — turns with retry_index > 0 or fallback_reason set
  • /diag tools — tool latency distribution, failure rate
  • /diag stalls — turns with transport_warning events (stream stalls)

F. Enforced operator health policy (/health)

brain/chat/health_policy.py adds the chat-native health gate. It turns the turn journal, session events, context estimator, and cost tracker into one operator snapshot. The REPL checks the policy before and after natural-language turns:

  • approval mode creates a local approval request with tool_name=chat_health_policy and blocks further natural-language turns until it is approved.
  • pause mode blocks further natural-language turns immediately.
  • observe mode leaves the snapshot visible without enforcement.

Each enforcement writes policy_decision, approval_required where applicable, and a session_checkpoint with checkpoint_type=health_policy, so the audit trail explains what was attempted, what threshold was crossed, what policy applied, and how to resume.


Tool lifecycle invariants

Every tool call already has: - call_id (ToolCall.id, provider-assigned) - OTEL span: secondbrain.agent.tool_call - on_tool_call / on_tool_result callbacks

After this change, the turn journal also records per-tool summary (name, latency_ms, outcome) in tool_summary_json on the turn row.

Invariant checks performed in ChatTransport.run_turn(): 1. tool_call_id collision (same id requested twice) — logged as warning 2. terminal_reason == "cancelled" + non-empty tool trace → logged as "tools abandoned at cancellation" diagnostic event 3. result.degraded and responding_provider != expected_provider → emits FallbackApplied event (visible + journalized)


Known limitations / next-step refactors

  1. Network-idle timeout inside streaming — the content_idle_timeout_s is surfaced as a visible hint but does not abort the stream. A true network-idle abort requires wrapping provider.generate_stream() in a per-iteration select()-style watcher. Deferred because Anthropic SSE parsing is synchronous (no coroutine hooks).

  2. Unified cancellation token — harness.cancel() is a threading.Event. A CancellationScope tree (parent turn → child tools → child rendering) is the right model but requires refactoring BoundedToolExecutor signatures. Current: REPL ESC → harness.cancel() → checked after each wave. Next: propagate scope to BoundedToolExecutor's DeadlineToken.

  3. Provider heartbeat — Anthropic SSE sends : (comment-only) pings. These are currently discarded by the streaming parser. Surfacing them as Heartbeat events would let the network-idle watchdog distinguish live connections from dead sockets. Requires modifying anthropic_messages.py.

  4. Mid-stream provider fallback — once a stream starts, ProviderChain commits to that provider. Mid-stream failover would require buffering partial completions and replaying; not implemented.

  5. Tool delivered_to_model event — the responded_ids set in harness already tracks which tool results were appended to messages. Exposing this as a callback event (on_tool_delivered) would complete the tool lifecycle. Deferred to avoid callback protocol churn.


Implementation status (as of 2026-04-17)

Wiring map

Concern Code path
REPL entry → transport ChatRepl.__init__ instantiates ChatTransport and stores it on CommandContext.transport (brain/chat/repl.py). _run_turn_interruptible calls self._transport.run_turn(...).
Transport wrapper ChatTransport.run_turn (brain/chat/transport.py) — wraps harness.callbacks with _PhaseTrackingCallbacks, starts _ContentIdleWatchdog, records turn row, prints fallback notices.
Degradation signal TurnRuntimeState.degraded/degradation_reason/responding_provider/provider_failures populated by harness._execute_turn_loop_v2, copied to TurnResult via TurnFinalizer.finalize. Transport reads these fields.
Per-tool journaling _JournalingToolCallback in brain/chat/repl.py — installed on the harness callback chain. Writes record_tool_event for every on_tool_call and on_tool_result; detects duplicate tool_call_id.
Phase-aware spinner TerminalCallbacks.set_phase() + _phase_label() in brain/agent/callbacks.py. Transport pushes phase changes into TerminalCallbacks each time it observes on_stream_chunk / on_tool_call / on_tool_result.
Visible fallback notice TerminalCallbacks.print_transport_notice("fallback", ...) — called from ChatTransport._print_fallback_notice. Always stops the spinner before printing.
Invariant checks ChatTransport._check_tool_invariants — runs after each turn; writes tools_abandoned_at_cancellation or orphan_tool_completion transport warnings.

CLI / REPL entry points

Command What it does File
sb chat-doctor [summary\|turns\|retries\|tools\|stalls] Read-only query over chat_turn_journal from outside the REPL (e.g. post-incident triage). brain/cli/chat.py
/diag in REPL (alias /doctor) Same queries rendered inside the running session. brain/chat/commands/diag.py
/health [status\|risks\|policy\|approve\|deny\|resume\|observe\|json] Live enforced health snapshot and recovery controls for the running REPL session. brain/chat/commands/health.py

Test coverage

File What it covers
tests/chat/test_turn_journal.py Schema creation, record_start/end, transport warnings, fallback recording, per-tool events, all query_* methods, disabled-journal no-op fallback.
tests/chat/test_chat_transport.py Happy-path journal row, kwargs forwarding, fallback notice (degraded=True), silent-downgrade (mismatched responding_provider), suppressed console when show_fallback_notice=False, harness exception records abort_reason, abandoned/orphan tool invariants, phase propagation to TerminalCallbacks, _ContentIdleWatchdog fire + reset, _JournalingToolCallback duplicate detection, composite chain walk.
tests/chat/test_chat_transport_soak.py Long session (20 turns) without leakage, alternating success/fallback, exception does not poison next turn, repeated overload counter, _current_turn_id reset between turns.
tests/chat/test_diag_command.py All /diag sub-commands render Rich tables correctly, empty-journal hint, missing-transport hint, return-False-to-stay-in-REPL invariant.

Run just the reliability tests::

.venv/bin/python -m pytest tests/chat/test_turn_journal.py \
    tests/chat/test_chat_transport.py \
    tests/chat/test_chat_transport_soak.py \
    tests/chat/test_diag_command.py -q

Operator notes

  • Journal table lives in the same DB as EventLog.db_path (runtime.db). The migration is idempotent; it runs lazily on first TurnJournal instantiation.
  • sb chat-doctor is the post-mortem entry point when the REPL has already exited. /diag is the live counterpart during a session.
  • Silent fallbacks are gone — every ProviderChain failover now either prints a yellow ⚠ provider fallback: line or, if show_fallback_notice=False, writes the row to chat_turn_journal.fallback_reason. Neither path loses the signal.
  • To reset the journal during testing, delete runtime.db or point SB_STATE_DIR at an empty directory. The migration recreates the schema on the next sb chat invocation.