Observability¶

Current as of 2026-04-25.

SecondBrain's observability is local-first. The goal is to make a run inspectable on the user's machine without requiring an external telemetry platform.

What Exists¶

structured SQLite state across the split DB topology
chat/session persistence through EventLog and SessionStore
session_events for streamable chat, gateway, and background-session events
typed chat stream events in brain/serve/chat_runtime.py
React stream reduction and drift warnings in serve-ui/src/lib/chat.ts
runtime observation envelopes in brain/obs/
decision traces exposed through sb trace / sb traces
background-session heartbeats, checkpoints, artifacts, and terminal state
agent health snapshots derived from local session events, tool calls, checkpoints, memory proposals, approvals, budget signals, and recovery state
quality, eval, simulation, conversation-improvement, and autotune stores

What A User Should Be Able To Inspect¶

what request or automation ran
which context, memory, or retrieval records were used
what provider/model handled the turn
what tools were selected, called, denied, failed, or timed out
what policy or approval decision was applied
what artifact, answer, outbound message, or memory update was produced
why the run ended
whether the agent is running, stuck, waiting for approval, burning budget, losing tool calls, missing memory context, or safe to resume/branch

Main Places To Inspect¶

sb sessions list and sb sessions show
sb traces ...
sb quality ...
sb autotune ...
sb daemon status
the operator UI run/chat/quality/approvals pages
GET /sessions/{session_id}/health/stream for real-time background-session health snapshots
SQLite databases under settings.paths.state_dir
artifacts, logs, reports, and output directories under local state paths

Main Implementation Areas¶

brain/obs/
brain/state/event_log.py
brain/chat/session_store.py
brain/serve/chat_runtime.py
brain/serve/routers/core.py
brain/serve/payloads.py
brain/serve/routers/sessions.py
serve-ui/src/lib/chat.ts
serve-ui/src/pages/AgentCockpitPage.tsx
serve-ui/src/pages/AgentsPage.tsx
brain/background_sessions/store.py
brain/quality/
brain/eval/
brain/evals/
brain/simulations/
brain/improvement/
brain/autotune/

Stream Contract Check¶

The serve chat stream contract is intentionally explicit:

server event names live in STREAM_EVENT_TYPES
the UI handled set lives in HANDLED_EVENT_TYPES
GET /stream-events exposes the server list
the browser warns if it receives an event the reducer does not handle

When adding a new chat event, update the server callbacks, server event list, UI reducer, and chat-page rendering in the same diff.

Agent Health Snapshot¶

Background sessions expose one operator-facing health snapshot per active or recent run. The snapshot is a local-first summary, not an external telemetry dependency. It includes status, score, risk flags, last event/tool, warnings, approval state, evidence links, memory scope/provenance, budget usage, audit trail rows, and recovery hints.

The snapshot turns raw events into product state. It detects stale or missing heartbeats, approval waits, loop warnings, repeated plan text, repeated failed tools, slow or escalating tool latency, token/cost/context pressure, retries without progress, context growth without evidence, missing memory context, provider degradation, retry exhaustion, expiry, and safe resume/branch points.

Background sessions also carry a local health_policy in session metadata. New sessions get a default policy, and the serve UI exposes the main operator budgets: runtime minutes, total tokens, cost, context percentage, and whether a breach should request approval or pause automatically. The supervisor and runtime enforce those thresholds from local events, tool observations, checkpoints, and artifacts. Approval escalation uses the existing local approval store; pause escalation records a policy decision and checkpoint before stopping the session.

Guidance¶

Prefer adding structured local events and clear inspection commands before adding a new telemetry dependency. A good feature leaves enough local evidence behind to explain what happened without replaying the whole run.