`sb chat` Reliability Architecture¶

Added 2026-04-16 — see brain/chat/transport_events.py, brain/chat/transport.py, brain/chat/turn_journal.py, brain/chat/commands/diag.py.

Related frontend contract: specs/agent-harness-frontend-contract.md.

Current flow (before this change)¶

User input
  → ChatRepl.run() in repl.py
    → harness.run_turn()
      → TurnPreparer.prepare()       — message assembly
      → provider.generate() or       — LLM call
        _consume_stream()
      → ToolScheduler.schedule()     — wave partitioning
      → BoundedToolExecutor.execute() — per-tool deadline
      → TurnFinalizer.finalize()     — memory, citations
    → TurnResult returned to REPL
  → render to terminal

OTEL tracing already wraps LLM calls and tool waves with spans.

ProviderChain already handles failover/cooldown between providers and exposes result.degraded, result.responding_provider, and result.provider_failures.

TurnRuntimeState.close_writes() guards late-arriving tool results via a write-gate that timed-out threads check before committing side-effects.

Current failure points¶

Failure	Current handling	Gap
Provider 429/503	ProviderChain failover + cooldown	Fallback is silent — user never sees it
Stream stalls (no bytes)	Single `timeout_s` deadline on entire turn	No separate network-idle vs content-idle distinction
Connection never starts	Same `timeout_s`	No fast connection timeout
LLM thinking long (alive, no output)	No progress indicator update	Spinner just sits; user cannot distinguish stall from thinking
Tool completes after cancellation	Late-write gate (close_writes)	No structured event for post-cancel completion
No turn_id	trace_id = session_id:call_id	Cannot join tool events to a specific turn across sessions
No phase visibility	Spinner says "Thinking…" throughout	User cannot tell: connecting / streaming / running tools / recovering
No diagnostic query path	None	Cannot retroactively detect orphaned tools, retries, fallbacks

Chosen implementation¶

A. Normalized transport events (`brain/chat/transport_events.py`)¶

A flat dataclass hierarchy for all observable events in a turn. These are emitted by the transport layer and consumed by: - TurnJournal (durable storage) - TerminalCallbacks (visible status) - Observability/metrics hooks

B. Transport config + monitor (`brain/chat/transport.py`)¶

TransportConfig separates three timeout concepts:

Timeout	Meaning	Default	Behavior on breach
`connection_timeout_s`	Time until provider sends first byte	30 s	Abort, raise
`network_idle_timeout_s`	No bytes received for N seconds	45 s	Abort, allow fallback
`content_idle_timeout_s`	Alive connection, no content yet	120 s	Show "still thinking" — do not abort

ChatTransport wraps harness.run_turn(): - Emits TurnStarted / TurnCompleted / TurnFailed / TurnCancelled - Detects result.degraded → emits FallbackApplied, prints visible warning - Journalizes every turn to chat_turn_journal SQLite table - Feeds the REPL health monitor, which can pause or request approval when the session crosses operator-configured runtime, token, cost, context, loop, tool-failure, or latency thresholds.

C. Structured turn journal (`brain/chat/turn_journal.py`)¶

New SQLite table chat_turn_journal (migration 048) with fields:

turn_id, session_id, trace_id, provider, model_requested, model_actual
phase, retry_index, fallback_reason, abort_reason
latency_ms, input_tokens, output_tokens
transport_warnings_json, tool_summary_json
created_at, completed_at

TurnJournal wraps EventLog and writes structured rows on turn lifecycle events.

D. Visible runtime states¶

TerminalCallbacks is extended with phase-aware spinner text:

Phase	Spinner text
`connecting`	`Connecting… (Ns)`
`streaming`	`Streaming… (Ns)`
`thinking`	`Thinking… (esc to interrupt, Ns)`
`tool_running`	`Using <tool>… (Ns)`
`recovering`	`Recovering… (retry N)`

Fallback / model-downgrade events are always printed as visible lines, never swallowed silently.

E. Diagnostic command (`/diag` or `sb chat doctor`)¶

brain/chat/commands/diag.py implements /diag with sub-commands:

/diag turns — recent turn journal (latency, phase, abort reason)
/diag retries — turns with retry_index > 0 or fallback_reason set
/diag tools — tool latency distribution, failure rate
/diag stalls — turns with transport_warning events (stream stalls)

F. Enforced operator health policy (`/health`)¶

brain/chat/health_policy.py adds the chat-native health gate. It turns the turn journal, session events, context estimator, and cost tracker into one operator snapshot. The REPL checks the policy before and after natural-language turns:

approval mode creates a local approval request with tool_name=chat_health_policy and blocks further natural-language turns until it is approved.
pause mode blocks further natural-language turns immediately.
observe mode leaves the snapshot visible without enforcement.

Each enforcement writes policy_decision, approval_required where applicable, and a session_checkpoint with checkpoint_type=health_policy, so the audit trail explains what was attempted, what threshold was crossed, what policy applied, and how to resume.

Tool lifecycle invariants¶

Every tool call already has: - call_id (ToolCall.id, provider-assigned) - OTEL span: secondbrain.agent.tool_call - on_tool_call / on_tool_result callbacks

After this change, the turn journal also records per-tool summary (name, latency_ms, outcome) in tool_summary_json on the turn row.

Invariant checks performed in ChatTransport.run_turn(): 1. tool_call_id collision (same id requested twice) — logged as warning 2. terminal_reason == "cancelled" + non-empty tool trace → logged as "tools abandoned at cancellation" diagnostic event 3. result.degraded and responding_provider != expected_provider → emits FallbackApplied event (visible + journalized)

Known limitations / next-step refactors¶

Network-idle timeout inside streaming — the content_idle_timeout_s is surfaced as a visible hint but does not abort the stream. A true network-idle abort requires wrapping provider.generate_stream() in a per-iteration select()-style watcher. Deferred because Anthropic SSE parsing is synchronous (no coroutine hooks).
Unified cancellation token — harness.cancel() is a threading.Event. A CancellationScope tree (parent turn → child tools → child rendering) is the right model but requires refactoring BoundedToolExecutor signatures. Current: REPL ESC → harness.cancel() → checked after each wave. Next: propagate scope to BoundedToolExecutor's DeadlineToken.
Provider heartbeat — Anthropic SSE sends : (comment-only) pings. These are currently discarded by the streaming parser. Surfacing them as Heartbeat events would let the network-idle watchdog distinguish live connections from dead sockets. Requires modifying anthropic_messages.py.
Mid-stream provider fallback — once a stream starts, ProviderChain commits to that provider. Mid-stream failover would require buffering partial completions and replaying; not implemented.
Tool delivered_to_model event — the responded_ids set in harness already tracks which tool results were appended to messages. Exposing this as a callback event (on_tool_delivered) would complete the tool lifecycle. Deferred to avoid callback protocol churn.

Implementation status (as of 2026-04-17)¶

Wiring map¶

Concern	Code path
REPL entry → transport	`ChatRepl.__init__` instantiates `ChatTransport` and stores it on `CommandContext.transport` (`brain/chat/repl.py`). `_run_turn_interruptible` calls `self._transport.run_turn(...)`.
Transport wrapper	`ChatTransport.run_turn` (`brain/chat/transport.py`) — wraps `harness.callbacks` with `_PhaseTrackingCallbacks`, starts `_ContentIdleWatchdog`, records turn row, prints fallback notices.
Degradation signal	`TurnRuntimeState.degraded/degradation_reason/responding_provider/provider_failures` populated by `harness._execute_turn_loop_v2`, copied to `TurnResult` via `TurnFinalizer.finalize`. Transport reads these fields.
Per-tool journaling	`_JournalingToolCallback` in `brain/chat/repl.py` — installed on the harness callback chain. Writes `record_tool_event` for every `on_tool_call` and `on_tool_result`; detects duplicate `tool_call_id`.
Phase-aware spinner	`TerminalCallbacks.set_phase()` + `_phase_label()` in `brain/agent/callbacks.py`. Transport pushes phase changes into TerminalCallbacks each time it observes `on_stream_chunk` / `on_tool_call` / `on_tool_result`.
Visible fallback notice	`TerminalCallbacks.print_transport_notice("fallback", ...)` — called from `ChatTransport._print_fallback_notice`. Always stops the spinner before printing.
Invariant checks	`ChatTransport._check_tool_invariants` — runs after each turn; writes `tools_abandoned_at_cancellation` or `orphan_tool_completion` transport warnings.

CLI / REPL entry points¶

Command	What it does	File
`sb chat-doctor [summary\\|turns\\|retries\\|tools\\|stalls]`	Read-only query over `chat_turn_journal` from outside the REPL (e.g. post-incident triage).	`brain/cli/chat.py`
`/diag` in REPL (alias `/doctor`)	Same queries rendered inside the running session.	`brain/chat/commands/diag.py`
`/health [status\\|risks\\|policy\\|approve\\|deny\\|resume\\|observe\\|json]`	Live enforced health snapshot and recovery controls for the running REPL session.	`brain/chat/commands/health.py`

Test coverage¶

File	What it covers
`tests/chat/test_turn_journal.py`	Schema creation, record_start/end, transport warnings, fallback recording, per-tool events, all `query_*` methods, disabled-journal no-op fallback.
`tests/chat/test_chat_transport.py`	Happy-path journal row, kwargs forwarding, fallback notice (degraded=True), silent-downgrade (mismatched responding_provider), suppressed console when `show_fallback_notice=False`, harness exception records abort_reason, abandoned/orphan tool invariants, phase propagation to TerminalCallbacks, `_ContentIdleWatchdog` fire + reset, `_JournalingToolCallback` duplicate detection, composite chain walk.
`tests/chat/test_chat_transport_soak.py`	Long session (20 turns) without leakage, alternating success/fallback, exception does not poison next turn, repeated overload counter, `_current_turn_id` reset between turns.
`tests/chat/test_diag_command.py`	All `/diag` sub-commands render Rich tables correctly, empty-journal hint, missing-transport hint, return-False-to-stay-in-REPL invariant.

Run just the reliability tests::

.venv/bin/python -m pytest tests/chat/test_turn_journal.py \
    tests/chat/test_chat_transport.py \
    tests/chat/test_chat_transport_soak.py \
    tests/chat/test_diag_command.py -q

Operator notes¶

Journal table lives in the same DB as EventLog.db_path (runtime.db). The migration is idempotent; it runs lazily on first TurnJournal instantiation.
sb chat-doctor is the post-mortem entry point when the REPL has already exited. /diag is the live counterpart during a session.
Silent fallbacks are gone — every ProviderChain failover now either prints a yellow ⚠ provider fallback: line or, if show_fallback_notice=False, writes the row to chat_turn_journal.fallback_reason. Neither path loses the signal.
To reset the journal during testing, delete runtime.db or point SB_STATE_DIR at an empty directory. The migration recreates the schema on the next sb chat invocation.

sb chat Reliability Architecture¶