Observability¶
Current as of 2026-04-25.
SecondBrain's observability is local-first. The goal is to make a run inspectable on the user's machine without requiring an external telemetry platform.
What Exists¶
- structured SQLite state across the split DB topology
- chat/session persistence through
EventLogandSessionStore session_eventsfor streamable chat, gateway, and background-session events- typed chat stream events in
brain/serve/chat_runtime.py - React stream reduction and drift warnings in
serve-ui/src/lib/chat.ts - runtime observation envelopes in
brain/obs/ - decision traces exposed through
sb trace/sb traces - background-session heartbeats, checkpoints, artifacts, and terminal state
- agent health snapshots derived from local session events, tool calls, checkpoints, memory proposals, approvals, budget signals, and recovery state
- quality, eval, simulation, conversation-improvement, and autotune stores
What A User Should Be Able To Inspect¶
- what request or automation ran
- which context, memory, or retrieval records were used
- what provider/model handled the turn
- what tools were selected, called, denied, failed, or timed out
- what policy or approval decision was applied
- what artifact, answer, outbound message, or memory update was produced
- why the run ended
- whether the agent is running, stuck, waiting for approval, burning budget, losing tool calls, missing memory context, or safe to resume/branch
Main Places To Inspect¶
sb sessions listandsb sessions showsb traces ...sb quality ...sb autotune ...sb daemon status- the operator UI run/chat/quality/approvals pages
GET /sessions/{session_id}/health/streamfor real-time background-session health snapshots- SQLite databases under
settings.paths.state_dir - artifacts, logs, reports, and output directories under local state paths
Main Implementation Areas¶
brain/obs/brain/state/event_log.pybrain/chat/session_store.pybrain/serve/chat_runtime.pybrain/serve/routers/core.pybrain/serve/payloads.pybrain/serve/routers/sessions.pyserve-ui/src/lib/chat.tsserve-ui/src/pages/AgentCockpitPage.tsxserve-ui/src/pages/AgentsPage.tsxbrain/background_sessions/store.pybrain/quality/brain/eval/brain/evals/brain/simulations/brain/improvement/brain/autotune/
Stream Contract Check¶
The serve chat stream contract is intentionally explicit:
- server event names live in
STREAM_EVENT_TYPES - the UI handled set lives in
HANDLED_EVENT_TYPES GET /stream-eventsexposes the server list- the browser warns if it receives an event the reducer does not handle
When adding a new chat event, update the server callbacks, server event list, UI reducer, and chat-page rendering in the same diff.
Agent Health Snapshot¶
Background sessions expose one operator-facing health snapshot per active or recent run. The snapshot is a local-first summary, not an external telemetry dependency. It includes status, score, risk flags, last event/tool, warnings, approval state, evidence links, memory scope/provenance, budget usage, audit trail rows, and recovery hints.
The snapshot turns raw events into product state. It detects stale or missing heartbeats, approval waits, loop warnings, repeated plan text, repeated failed tools, slow or escalating tool latency, token/cost/context pressure, retries without progress, context growth without evidence, missing memory context, provider degradation, retry exhaustion, expiry, and safe resume/branch points.
Background sessions also carry a local health_policy in session metadata.
New sessions get a default policy, and the serve UI exposes the main operator
budgets: runtime minutes, total tokens, cost, context percentage, and whether a
breach should request approval or pause automatically. The supervisor and
runtime enforce those thresholds from local events, tool observations,
checkpoints, and artifacts. Approval escalation uses the existing local approval
store; pause escalation records a policy decision and checkpoint before stopping
the session.
Guidance¶
Prefer adding structured local events and clear inspection commands before adding a new telemetry dependency. A good feature leaves enough local evidence behind to explain what happened without replaying the whole run.