Local Agent Stack¶

SecondBrain can route private work to local model profiles and run generated code or shell commands under deterministic execution-isolation policy. The local-agent stack has two separable responsibilities:

Model fit: choose an explainable local model profile from detected RAM, accelerator memory, context requirements, quantization, and native tool-call support.
Execution isolation: preflight commands, enforce allowlists and denylists, restrict network access, and require Docker/E2B isolation for write-like actions unless unsafe host execution is explicitly enabled.

Model Profiles¶

Local model profiles live in brain/providers/model_profiles.py and are merged with models.local_profiles from configuration. Each profile records the provider, model name, backend, quantization, weight format, context policy, memory envelope, and tool-calling support.

Use the provider diagnostics command to see the current host fit:

sb providers local-profiles --context 4096 --requires-tools
sb providers local-profiles --context 32768 --requires-tools --json

The output includes:

detected hardware and effective accelerator memory
every profile decision as fit, degraded, or unavailable
score, effective context window, reasons, and warnings
the preferred profile ordering from model_routing.preferred_local_profile

Configured profiles override built-ins by profile_id:

models:
  local_profiles:
    - profile_id: local-qwen-custom
      provider: ollama
      model: qwen2.5:7b-instruct-q4_K_M
      backend: ollama
      locality: local
      parameter_count_b: 7
      quantization: q4_K_M
      weight_format: gguf
      context_tokens: 32768
      min_ram_gib: 10
      recommended_ram_gib: 16
      min_vram_gib: 6
      recommended_vram_gib: 10
      supports_native_tool_calling: true

model_routing:
  preferred_local_profile: local-qwen-custom

Hardware Detection¶

Hardware detection is offline-safe and dependency-light. It uses platform data for CPU/RAM, nvidia-smi when available for CUDA memory, and Apple Silicon platform signals for unified-memory MPS fit checks.

Ollama-managed GGUF profiles use a VRAM-aware context policy:

Effective accelerator memory	Default effective context
Below 24 GiB or unknown	4,096 tokens
24 GiB to below 48 GiB	32,768 tokens
48 GiB or more	256,000 tokens

Operators can override that policy with OLLAMA_CONTEXT_LENGTH or SB_OLLAMA_CONTEXT_LENGTH.

Execution Isolation¶

The central policy lives in brain/sandbox/policy.py. It is used by shell.exec and by the sb policy preflight diagnostics command.

Docker Shell Execution¶

When shell_exec.isolation_backend is docker, shell.exec mounts the approved working directory at /workspace and runs the command in a hardened ephemeral container:

no network by default when network_mode: none
read-only container filesystem
writable /tmp tmpfs with noexec and nosuid
dropped Linux capabilities
no-new-privileges
bounded process count
host UID/GID for mounted workspace writes

If Docker is not reachable, shell.exec fails closed instead of falling back to host execution. E2B remains supported for generated Python code paths, but arbitrary shell execution requires Docker or explicit unsafe local execution.

Diagnostics API¶

sb serve exposes the authenticated local-agent stack diagnostics endpoint:

GET /diagnostics/local-agent-stack?context=4096&requires_tools=true

The response includes a single score and status plus detailed sections:

hardware
model_profiles
execution_isolation
recommendations

Use this endpoint for operator UI surfaces that need to show whether the current machine is ready for private local-agent work.