Agent Harness — Testing Guide¶

Test Files¶

File	Tests	What it covers
`tests/agent/test_harness_refactor.py`	78	All 10 collaborators, unit-level
`tests/agent/test_agent_harness_timeouts.py`	2	Timeout handling, `_tool_timeout_cap()`

Running Tests¶

# All harness tests
.venv/bin/python -m pytest tests/agent/test_harness_refactor.py tests/agent/test_agent_harness_timeouts.py -v

# Just the collaborator unit tests
.venv/bin/python -m pytest tests/agent/test_harness_refactor.py -v

# Quick smoke run (no output capture)
.venv/bin/python -m pytest tests/agent/ -q

Coverage Map¶

TurnBudget / DeadlineToken (8 tests)¶

Test	What it verifies
`test_turn_budget_create_sets_deadline`	`deadline = perf_counter() + timeout_s`
`test_turn_budget_claim_step_exhaustion`	Returns `False` after `max_steps` claims
`test_turn_budget_is_expired`	Wall-clock deadline check
`test_turn_budget_per_tool_remaining_min_floor`	Never returns less than 5.0s
`test_turn_budget_snapshot`	Snapshot dict has all expected keys
`test_deadline_token_expires`	Expiry after deadline
`test_deadline_token_cancel`	`cancel()` makes it immediately expired
`test_deadline_token_from_budget`	Deadline = min(turn_deadline, now + cap)

TurnRuntimeState (11 tests)¶

Test	What it verifies
`test_turn_state_initial_values`	All fields at safe defaults
`test_turn_state_add_tokens`	Accumulation is additive
`test_turn_state_trace_lifecycle_normal`	running → complete returns True
`test_turn_state_trace_lifecycle_timeout`	running → timed_out → late returns False
`test_turn_state_late_write_gate`	`close_writes()` makes `is_accepting_writes()` False
`test_turn_state_repair_count`	`increment_repair` + `repair_count`
`test_turn_state_record_tool_result`	Appends to `tool_results`
`test_turn_state_blocked_tool_names`	Set operations
`test_turn_state_citations`	`CitationAccumulator` fields
`test_turn_state_trace_lock_thread_safety`	No data race under concurrent writes
`test_turn_state_writes_lock_thread_safety`	No data race under concurrent close/check

ToolOutcomes (8 tests)¶

Test	What it verifies
`test_tool_execution_result_frozen`	Mutation raises `FrozenInstanceError`
`test_outcome_to_model_content_success`	JSON of output dict
`test_outcome_to_model_content_timeout`	Error JSON with `timed_out: true`
`test_outcome_to_model_content_failure`	Error JSON
`test_outcome_to_model_content_denied_duplicate`	Warning dict
`test_outcome_to_model_content_denied_validation`	Hint dict
`test_outcome_to_model_content_artifact`	Reference dict
`test_outcome_predicates`	`outcome_is_error`, `outcome_blocks_tool`

ArtifactStore (8 tests)¶

Test	What it verifies
`test_artifact_store_creates_file`	File written to disk
`test_artifact_store_retrieve_success`	Round-trip read
`test_artifact_store_retrieve_expired`	Returns None after TTL
`test_artifact_store_retrieve_missing`	Returns None for unknown ID
`test_artifact_store_cleanup_session`	Removes all session artifacts
`test_artifact_store_cleanup_expired`	Removes expired artifacts
`test_artifact_store_stats`	Returns count and total bytes
`test_get_global_artifact_store_singleton`	Same instance returned twice

ReplayControl (8 tests)¶

Test	What it verifies
`test_replay_first_call_allowed`	No prior history → allow
`test_replay_idempotent_duplicate_skipped`	Prior success → skip
`test_replay_idempotent_failure_retry_allowed`	Prior failure → allow
`test_replay_idempotent_timeout_retry_allowed`	Prior timeout → allow
`test_replay_non_idempotent_always_allowed`	`idempotent=False` → never skip
`test_replay_no_spec_conservative_dedup`	No ToolSpec → treated as idempotent
`test_replay_record_success_then_skip`	Record + check cycle
`test_replay_history_size`	Counter increments

ToolScheduler (6 tests)¶

Test	What it verifies
`test_scheduler_empty_calls`	Returns empty list
`test_scheduler_single_call`	Always sequential
`test_scheduler_parallel_read_only`	Two non-overlapping READ_ONLY → parallel wave
`test_scheduler_local_write_sequential`	LOCAL_WRITE → own sequential wave
`test_scheduler_resource_overlap_splits`	Overlapping READ_ONLY calls → two waves
`test_scheduler_parallel_disabled`	`parallel_enabled=False` → all sequential

TurnPreparer (8 tests)¶

Test	What it verifies
`test_turn_preparer_builds_messages`	System + user message present
`test_turn_preparer_includes_history`	History injected at correct position
`test_turn_preparer_memory_context`	Memory text injected as system message
`test_turn_preparer_render_hash_per_turn`	Hash changes when prompt changes
`test_turn_preparer_hash_not_sticky`	Two calls with different prompts → different hashes
`test_turn_preparer_skill_context`	Skill context message injected
`test_turn_preparer_no_skill_context`	No extra message when skill_context=None
`test_turn_preparer_resolved_model_inferred`	Model inferred from provider

BoundedToolExecutor (9 tests)¶

Test	What it verifies
`test_executor_deadline_gate`	Expired budget → `ToolDenied(reason="deadline")`
`test_executor_successful_execution`	Returns `ToolExecutionResult`
`test_executor_tool_failure`	Exception → `ToolFailure`
`test_executor_timeout`	Thread timeout → `ToolTimeout`
`test_executor_late_write_suppressed`	Post-deadline completion → side effects skipped
`test_executor_oversized_output`	Output > 12k → `ToolArtifactReference`
`test_executor_web_mode_off`	`web_search` disabled
`test_executor_write_confirmation_denied`	Write denied → `ToolDenied(reason="write_denied")`
`test_executor_write_confirmation_approved`	Write approved → tool re-executed

ReflectionEngine (5 tests)¶

Test	What it verifies
`test_reflection_budget_exhausted`	Budget=0 → stub result returned
`test_reflection_deadline_expired`	Expired budget → stub result
`test_reflection_success`	LLM call returns text → `ReflectionResult`
`test_reflection_llm_error`	Provider raises → stub result, no exception
`test_reflection_timeout_capped`	`timeout_s` ≤ `budget.remaining_s()`

TurnFinalizer (10 tests)¶

Test	What it verifies
`test_finalizer_returns_turn_result`	Return type and field values
`test_finalizer_citation_append`	Citations added when `show_citations=True`
`test_finalizer_no_citations`	No append when `show_citations=False`
`test_finalizer_memory_extraction`	`extract_turn()` called
`test_finalizer_memory_extraction_exception`	Exception swallowed
`test_finalizer_decision_record_direct`	`direct_answer` strategy
`test_finalizer_decision_record_retrieval`	`retrieval_augmented` strategy
`test_finalizer_decision_record_exception`	Exception swallowed
`test_finalizer_no_decision_store`	Skipped when `decision_store=None`
`test_finalizer_judge_scheduling`	`schedule_prompt_judges` called

Writing New Tests¶

Testing a collaborator in isolation¶

Each collaborator accepts all its dependencies as constructor arguments. Use MagicMock for anything you don't need:

from unittest.mock import MagicMock
from brain.agent.turn_budget import TurnBudget
from brain.agent.turn_state import TurnRuntimeState
from brain.agent.tool_executor_v2 import BoundedToolExecutor

def test_my_feature():
    budget = TurnBudget.create(timeout_s=60)
    state  = TurnRuntimeState(session_id="test", budget=budget)

    executor = BoundedToolExecutor(
        tools_registry=MagicMock(),
        callbacks=MagicMock(),
        event_log=MagicMock(),
        checkpoint_store=None,
        artifact_store=MagicMock(),
        thinking_manager=MagicMock(),
    )
    ...

Testing timeout behaviour¶

Patch BoundedToolExecutor.execute directly on a harness._bounded_executor instance to simulate ToolTimeout or ToolFailure outcomes without needing real tool threads:

from brain.agent.tool_outcomes import ToolFailure

def fake_execute(call, state, *, web_mode, budget, tool_timeout_cap_s):
    return ToolFailure(call_id=call.id, tool_name=call.name,
                       error="simulated timeout", retryable=False)

harness._bounded_executor.execute = fake_execute

Testing `run_turn()` end-to-end¶

Provide a _Provider class that returns ProviderResult with controlled tool_calls / text values. Use a _Tools stub that responds to tools.run() and tools.available_tool_schemas(). See tests/agent/test_agent_harness_timeouts.py for a complete example.