sb chat Reliability Architecture¶
Added 2026-04-16 — see brain/chat/transport_events.py, brain/chat/transport.py,
brain/chat/turn_journal.py, brain/chat/commands/diag.py.
Related frontend contract: specs/agent-harness-frontend-contract.md.
Current flow (before this change)¶
User input
→ ChatRepl.run() in repl.py
→ harness.run_turn()
→ TurnPreparer.prepare() — message assembly
→ provider.generate() or — LLM call
_consume_stream()
→ ToolScheduler.schedule() — wave partitioning
→ BoundedToolExecutor.execute() — per-tool deadline
→ TurnFinalizer.finalize() — memory, citations
→ TurnResult returned to REPL
→ render to terminal
OTEL tracing already wraps LLM calls and tool waves with spans.
ProviderChain already handles failover/cooldown between providers and
exposes result.degraded, result.responding_provider, and
result.provider_failures.
TurnRuntimeState.close_writes() guards late-arriving tool results via a write-gate that timed-out threads check before committing side-effects.
Current failure points¶
| Failure | Current handling | Gap |
|---|---|---|
| Provider 429/503 | ProviderChain failover + cooldown | Fallback is silent — user never sees it |
| Stream stalls (no bytes) | Single timeout_s deadline on entire turn |
No separate network-idle vs content-idle distinction |
| Connection never starts | Same timeout_s |
No fast connection timeout |
| LLM thinking long (alive, no output) | No progress indicator update | Spinner just sits; user cannot distinguish stall from thinking |
| Tool completes after cancellation | Late-write gate (close_writes) | No structured event for post-cancel completion |
| No turn_id | trace_id = session_id:call_id | Cannot join tool events to a specific turn across sessions |
| No phase visibility | Spinner says "Thinking…" throughout | User cannot tell: connecting / streaming / running tools / recovering |
| No diagnostic query path | None | Cannot retroactively detect orphaned tools, retries, fallbacks |
Chosen implementation¶
A. Normalized transport events (brain/chat/transport_events.py)¶
A flat dataclass hierarchy for all observable events in a turn. These are emitted by the transport layer and consumed by: - TurnJournal (durable storage) - TerminalCallbacks (visible status) - Observability/metrics hooks
B. Transport config + monitor (brain/chat/transport.py)¶
TransportConfig separates three timeout concepts:
| Timeout | Meaning | Default | Behavior on breach |
|---|---|---|---|
connection_timeout_s |
Time until provider sends first byte | 30 s | Abort, raise |
network_idle_timeout_s |
No bytes received for N seconds | 45 s | Abort, allow fallback |
content_idle_timeout_s |
Alive connection, no content yet | 120 s | Show "still thinking" — do not abort |
ChatTransport wraps harness.run_turn():
- Emits TurnStarted / TurnCompleted / TurnFailed / TurnCancelled
- Detects result.degraded → emits FallbackApplied, prints visible warning
- Journalizes every turn to chat_turn_journal SQLite table
- Feeds the REPL health monitor, which can pause or request approval when the
session crosses operator-configured runtime, token, cost, context, loop,
tool-failure, or latency thresholds.
C. Structured turn journal (brain/chat/turn_journal.py)¶
New SQLite table chat_turn_journal (migration 048) with fields:
turn_id, session_id, trace_id, provider, model_requested, model_actual
phase, retry_index, fallback_reason, abort_reason
latency_ms, input_tokens, output_tokens
transport_warnings_json, tool_summary_json
created_at, completed_at
TurnJournal wraps EventLog and writes structured rows on turn lifecycle events.
D. Visible runtime states¶
TerminalCallbacks is extended with phase-aware spinner text:
| Phase | Spinner text |
|---|---|
connecting |
Connecting… (Ns) |
streaming |
Streaming… (Ns) |
thinking |
Thinking… (esc to interrupt, Ns) |
tool_running |
Using <tool>… (Ns) |
recovering |
Recovering… (retry N) |
Fallback / model-downgrade events are always printed as visible lines, never swallowed silently.
E. Diagnostic command (/diag or sb chat doctor)¶
brain/chat/commands/diag.py implements /diag with sub-commands:
/diag turns— recent turn journal (latency, phase, abort reason)/diag retries— turns with retry_index > 0 or fallback_reason set/diag tools— tool latency distribution, failure rate/diag stalls— turns withtransport_warningevents (stream stalls)
F. Enforced operator health policy (/health)¶
brain/chat/health_policy.py adds the chat-native health gate. It turns the
turn journal, session events, context estimator, and cost tracker into one
operator snapshot. The REPL checks the policy before and after natural-language
turns:
approvalmode creates a local approval request withtool_name=chat_health_policyand blocks further natural-language turns until it is approved.pausemode blocks further natural-language turns immediately.observemode leaves the snapshot visible without enforcement.
Each enforcement writes policy_decision, approval_required where applicable,
and a session_checkpoint with checkpoint_type=health_policy, so the audit
trail explains what was attempted, what threshold was crossed, what policy
applied, and how to resume.
Tool lifecycle invariants¶
Every tool call already has:
- call_id (ToolCall.id, provider-assigned)
- OTEL span: secondbrain.agent.tool_call
- on_tool_call / on_tool_result callbacks
After this change, the turn journal also records per-tool summary
(name, latency_ms, outcome) in tool_summary_json on the turn row.
Invariant checks performed in ChatTransport.run_turn():
1. tool_call_id collision (same id requested twice) — logged as warning
2. terminal_reason == "cancelled" + non-empty tool trace → logged as
"tools abandoned at cancellation" diagnostic event
3. result.degraded and responding_provider != expected_provider → emits
FallbackApplied event (visible + journalized)
Known limitations / next-step refactors¶
-
Network-idle timeout inside streaming — the
content_idle_timeout_sis surfaced as a visible hint but does not abort the stream. A true network-idle abort requires wrappingprovider.generate_stream()in a per-iterationselect()-style watcher. Deferred because Anthropic SSE parsing is synchronous (no coroutine hooks). -
Unified cancellation token — harness.cancel() is a threading.Event. A
CancellationScopetree (parent turn → child tools → child rendering) is the right model but requires refactoring BoundedToolExecutor signatures. Current: REPL ESC → harness.cancel() → checked after each wave. Next: propagate scope to BoundedToolExecutor's DeadlineToken. -
Provider heartbeat — Anthropic SSE sends
:(comment-only) pings. These are currently discarded by the streaming parser. Surfacing them asHeartbeatevents would let the network-idle watchdog distinguish live connections from dead sockets. Requires modifyinganthropic_messages.py. -
Mid-stream provider fallback — once a stream starts, ProviderChain commits to that provider. Mid-stream failover would require buffering partial completions and replaying; not implemented.
-
Tool
delivered_to_modelevent — theresponded_idsset in harness already tracks which tool results were appended to messages. Exposing this as a callback event (on_tool_delivered) would complete the tool lifecycle. Deferred to avoid callback protocol churn.
Implementation status (as of 2026-04-17)¶
Wiring map¶
| Concern | Code path |
|---|---|
| REPL entry → transport | ChatRepl.__init__ instantiates ChatTransport and stores it on CommandContext.transport (brain/chat/repl.py). _run_turn_interruptible calls self._transport.run_turn(...). |
| Transport wrapper | ChatTransport.run_turn (brain/chat/transport.py) — wraps harness.callbacks with _PhaseTrackingCallbacks, starts _ContentIdleWatchdog, records turn row, prints fallback notices. |
| Degradation signal | TurnRuntimeState.degraded/degradation_reason/responding_provider/provider_failures populated by harness._execute_turn_loop_v2, copied to TurnResult via TurnFinalizer.finalize. Transport reads these fields. |
| Per-tool journaling | _JournalingToolCallback in brain/chat/repl.py — installed on the harness callback chain. Writes record_tool_event for every on_tool_call and on_tool_result; detects duplicate tool_call_id. |
| Phase-aware spinner | TerminalCallbacks.set_phase() + _phase_label() in brain/agent/callbacks.py. Transport pushes phase changes into TerminalCallbacks each time it observes on_stream_chunk / on_tool_call / on_tool_result. |
| Visible fallback notice | TerminalCallbacks.print_transport_notice("fallback", ...) — called from ChatTransport._print_fallback_notice. Always stops the spinner before printing. |
| Invariant checks | ChatTransport._check_tool_invariants — runs after each turn; writes tools_abandoned_at_cancellation or orphan_tool_completion transport warnings. |
CLI / REPL entry points¶
| Command | What it does | File |
|---|---|---|
sb chat-doctor [summary\|turns\|retries\|tools\|stalls] |
Read-only query over chat_turn_journal from outside the REPL (e.g. post-incident triage). |
brain/cli/chat.py |
/diag in REPL (alias /doctor) |
Same queries rendered inside the running session. | brain/chat/commands/diag.py |
/health [status\|risks\|policy\|approve\|deny\|resume\|observe\|json] |
Live enforced health snapshot and recovery controls for the running REPL session. | brain/chat/commands/health.py |
Test coverage¶
| File | What it covers |
|---|---|
tests/chat/test_turn_journal.py |
Schema creation, record_start/end, transport warnings, fallback recording, per-tool events, all query_* methods, disabled-journal no-op fallback. |
tests/chat/test_chat_transport.py |
Happy-path journal row, kwargs forwarding, fallback notice (degraded=True), silent-downgrade (mismatched responding_provider), suppressed console when show_fallback_notice=False, harness exception records abort_reason, abandoned/orphan tool invariants, phase propagation to TerminalCallbacks, _ContentIdleWatchdog fire + reset, _JournalingToolCallback duplicate detection, composite chain walk. |
tests/chat/test_chat_transport_soak.py |
Long session (20 turns) without leakage, alternating success/fallback, exception does not poison next turn, repeated overload counter, _current_turn_id reset between turns. |
tests/chat/test_diag_command.py |
All /diag sub-commands render Rich tables correctly, empty-journal hint, missing-transport hint, return-False-to-stay-in-REPL invariant. |
Run just the reliability tests::
.venv/bin/python -m pytest tests/chat/test_turn_journal.py \
tests/chat/test_chat_transport.py \
tests/chat/test_chat_transport_soak.py \
tests/chat/test_diag_command.py -q
Operator notes¶
- Journal table lives in the same DB as
EventLog.db_path(runtime.db). The migration is idempotent; it runs lazily on firstTurnJournalinstantiation. sb chat-doctoris the post-mortem entry point when the REPL has already exited./diagis the live counterpart during a session.- Silent fallbacks are gone — every
ProviderChainfailover now either prints a yellow⚠ provider fallback:line or, ifshow_fallback_notice=False, writes the row tochat_turn_journal.fallback_reason. Neither path loses the signal. - To reset the journal during testing, delete
runtime.dbor pointSB_STATE_DIRat an empty directory. The migration recreates the schema on the nextsb chatinvocation.