Skip to content

Observability

Current as of 2026-04-25.

SecondBrain's observability is local-first. The goal is to make a run inspectable on the user's machine without requiring an external telemetry platform.

What Exists

  • structured SQLite state across the split DB topology
  • chat/session persistence through EventLog and SessionStore
  • session_events for streamable chat, gateway, and background-session events
  • typed chat stream events in brain/serve/chat_runtime.py
  • React stream reduction and drift warnings in serve-ui/src/lib/chat.ts
  • runtime observation envelopes in brain/obs/
  • decision traces exposed through sb trace / sb traces
  • background-session heartbeats, checkpoints, artifacts, and terminal state
  • agent health snapshots derived from local session events, tool calls, checkpoints, memory proposals, approvals, budget signals, and recovery state
  • quality, eval, simulation, conversation-improvement, and autotune stores

What A User Should Be Able To Inspect

  • what request or automation ran
  • which context, memory, or retrieval records were used
  • what provider/model handled the turn
  • what tools were selected, called, denied, failed, or timed out
  • what policy or approval decision was applied
  • what artifact, answer, outbound message, or memory update was produced
  • why the run ended
  • whether the agent is running, stuck, waiting for approval, burning budget, losing tool calls, missing memory context, or safe to resume/branch

Main Places To Inspect

  • sb sessions list and sb sessions show
  • sb traces ...
  • sb quality ...
  • sb autotune ...
  • sb daemon status
  • the operator UI run/chat/quality/approvals pages
  • GET /sessions/{session_id}/health/stream for real-time background-session health snapshots
  • SQLite databases under settings.paths.state_dir
  • artifacts, logs, reports, and output directories under local state paths

Main Implementation Areas

  • brain/obs/
  • brain/state/event_log.py
  • brain/chat/session_store.py
  • brain/serve/chat_runtime.py
  • brain/serve/routers/core.py
  • brain/serve/payloads.py
  • brain/serve/routers/sessions.py
  • serve-ui/src/lib/chat.ts
  • serve-ui/src/pages/AgentCockpitPage.tsx
  • serve-ui/src/pages/AgentsPage.tsx
  • brain/background_sessions/store.py
  • brain/quality/
  • brain/eval/
  • brain/evals/
  • brain/simulations/
  • brain/improvement/
  • brain/autotune/

Stream Contract Check

The serve chat stream contract is intentionally explicit:

  • server event names live in STREAM_EVENT_TYPES
  • the UI handled set lives in HANDLED_EVENT_TYPES
  • GET /stream-events exposes the server list
  • the browser warns if it receives an event the reducer does not handle

When adding a new chat event, update the server callbacks, server event list, UI reducer, and chat-page rendering in the same diff.

Agent Health Snapshot

Background sessions expose one operator-facing health snapshot per active or recent run. The snapshot is a local-first summary, not an external telemetry dependency. It includes status, score, risk flags, last event/tool, warnings, approval state, evidence links, memory scope/provenance, budget usage, audit trail rows, and recovery hints.

The snapshot turns raw events into product state. It detects stale or missing heartbeats, approval waits, loop warnings, repeated plan text, repeated failed tools, slow or escalating tool latency, token/cost/context pressure, retries without progress, context growth without evidence, missing memory context, provider degradation, retry exhaustion, expiry, and safe resume/branch points.

Background sessions also carry a local health_policy in session metadata. New sessions get a default policy, and the serve UI exposes the main operator budgets: runtime minutes, total tokens, cost, context percentage, and whether a breach should request approval or pause automatically. The supervisor and runtime enforce those thresholds from local events, tool observations, checkpoints, and artifacts. Approval escalation uses the existing local approval store; pause escalation records a policy decision and checkpoint before stopping the session.

Guidance

Prefer adding structured local events and clear inspection commands before adding a new telemetry dependency. A good feature leaves enough local evidence behind to explain what happened without replaying the whole run.