Skip to content

Agent Harness — Testing Guide


Test Files

File Tests What it covers
tests/agent/test_harness_refactor.py 78 All 10 collaborators, unit-level
tests/agent/test_agent_harness_timeouts.py 2 Timeout handling, _tool_timeout_cap()

Running Tests

# All harness tests
.venv/bin/python -m pytest tests/agent/test_harness_refactor.py tests/agent/test_agent_harness_timeouts.py -v

# Just the collaborator unit tests
.venv/bin/python -m pytest tests/agent/test_harness_refactor.py -v

# Quick smoke run (no output capture)
.venv/bin/python -m pytest tests/agent/ -q

Coverage Map

TurnBudget / DeadlineToken (8 tests)

Test What it verifies
test_turn_budget_create_sets_deadline deadline = perf_counter() + timeout_s
test_turn_budget_claim_step_exhaustion Returns False after max_steps claims
test_turn_budget_is_expired Wall-clock deadline check
test_turn_budget_per_tool_remaining_min_floor Never returns less than 5.0s
test_turn_budget_snapshot Snapshot dict has all expected keys
test_deadline_token_expires Expiry after deadline
test_deadline_token_cancel cancel() makes it immediately expired
test_deadline_token_from_budget Deadline = min(turn_deadline, now + cap)

TurnRuntimeState (11 tests)

Test What it verifies
test_turn_state_initial_values All fields at safe defaults
test_turn_state_add_tokens Accumulation is additive
test_turn_state_trace_lifecycle_normal running → complete returns True
test_turn_state_trace_lifecycle_timeout running → timed_out → late returns False
test_turn_state_late_write_gate close_writes() makes is_accepting_writes() False
test_turn_state_repair_count increment_repair + repair_count
test_turn_state_record_tool_result Appends to tool_results
test_turn_state_blocked_tool_names Set operations
test_turn_state_citations CitationAccumulator fields
test_turn_state_trace_lock_thread_safety No data race under concurrent writes
test_turn_state_writes_lock_thread_safety No data race under concurrent close/check

ToolOutcomes (8 tests)

Test What it verifies
test_tool_execution_result_frozen Mutation raises FrozenInstanceError
test_outcome_to_model_content_success JSON of output dict
test_outcome_to_model_content_timeout Error JSON with timed_out: true
test_outcome_to_model_content_failure Error JSON
test_outcome_to_model_content_denied_duplicate Warning dict
test_outcome_to_model_content_denied_validation Hint dict
test_outcome_to_model_content_artifact Reference dict
test_outcome_predicates outcome_is_error, outcome_blocks_tool

ArtifactStore (8 tests)

Test What it verifies
test_artifact_store_creates_file File written to disk
test_artifact_store_retrieve_success Round-trip read
test_artifact_store_retrieve_expired Returns None after TTL
test_artifact_store_retrieve_missing Returns None for unknown ID
test_artifact_store_cleanup_session Removes all session artifacts
test_artifact_store_cleanup_expired Removes expired artifacts
test_artifact_store_stats Returns count and total bytes
test_get_global_artifact_store_singleton Same instance returned twice

ReplayControl (8 tests)

Test What it verifies
test_replay_first_call_allowed No prior history → allow
test_replay_idempotent_duplicate_skipped Prior success → skip
test_replay_idempotent_failure_retry_allowed Prior failure → allow
test_replay_idempotent_timeout_retry_allowed Prior timeout → allow
test_replay_non_idempotent_always_allowed idempotent=False → never skip
test_replay_no_spec_conservative_dedup No ToolSpec → treated as idempotent
test_replay_record_success_then_skip Record + check cycle
test_replay_history_size Counter increments

ToolScheduler (6 tests)

Test What it verifies
test_scheduler_empty_calls Returns empty list
test_scheduler_single_call Always sequential
test_scheduler_parallel_read_only Two non-overlapping READ_ONLY → parallel wave
test_scheduler_local_write_sequential LOCAL_WRITE → own sequential wave
test_scheduler_resource_overlap_splits Overlapping READ_ONLY calls → two waves
test_scheduler_parallel_disabled parallel_enabled=False → all sequential

TurnPreparer (8 tests)

Test What it verifies
test_turn_preparer_builds_messages System + user message present
test_turn_preparer_includes_history History injected at correct position
test_turn_preparer_memory_context Memory text injected as system message
test_turn_preparer_render_hash_per_turn Hash changes when prompt changes
test_turn_preparer_hash_not_sticky Two calls with different prompts → different hashes
test_turn_preparer_skill_context Skill context message injected
test_turn_preparer_no_skill_context No extra message when skill_context=None
test_turn_preparer_resolved_model_inferred Model inferred from provider

BoundedToolExecutor (9 tests)

Test What it verifies
test_executor_deadline_gate Expired budget → ToolDenied(reason="deadline")
test_executor_successful_execution Returns ToolExecutionResult
test_executor_tool_failure Exception → ToolFailure
test_executor_timeout Thread timeout → ToolTimeout
test_executor_late_write_suppressed Post-deadline completion → side effects skipped
test_executor_oversized_output Output > 12k → ToolArtifactReference
test_executor_web_mode_off web_search disabled
test_executor_write_confirmation_denied Write denied → ToolDenied(reason="write_denied")
test_executor_write_confirmation_approved Write approved → tool re-executed

ReflectionEngine (5 tests)

Test What it verifies
test_reflection_budget_exhausted Budget=0 → stub result returned
test_reflection_deadline_expired Expired budget → stub result
test_reflection_success LLM call returns text → ReflectionResult
test_reflection_llm_error Provider raises → stub result, no exception
test_reflection_timeout_capped timeout_sbudget.remaining_s()

TurnFinalizer (10 tests)

Test What it verifies
test_finalizer_returns_turn_result Return type and field values
test_finalizer_citation_append Citations added when show_citations=True
test_finalizer_no_citations No append when show_citations=False
test_finalizer_memory_extraction extract_turn() called
test_finalizer_memory_extraction_exception Exception swallowed
test_finalizer_decision_record_direct direct_answer strategy
test_finalizer_decision_record_retrieval retrieval_augmented strategy
test_finalizer_decision_record_exception Exception swallowed
test_finalizer_no_decision_store Skipped when decision_store=None
test_finalizer_judge_scheduling schedule_prompt_judges called

Writing New Tests

Testing a collaborator in isolation

Each collaborator accepts all its dependencies as constructor arguments. Use MagicMock for anything you don't need:

from unittest.mock import MagicMock
from brain.agent.turn_budget import TurnBudget
from brain.agent.turn_state import TurnRuntimeState
from brain.agent.tool_executor_v2 import BoundedToolExecutor

def test_my_feature():
    budget = TurnBudget.create(timeout_s=60)
    state  = TurnRuntimeState(session_id="test", budget=budget)

    executor = BoundedToolExecutor(
        tools_registry=MagicMock(),
        callbacks=MagicMock(),
        event_log=MagicMock(),
        checkpoint_store=None,
        artifact_store=MagicMock(),
        thinking_manager=MagicMock(),
    )
    ...

Testing timeout behaviour

Patch BoundedToolExecutor.execute directly on a harness._bounded_executor instance to simulate ToolTimeout or ToolFailure outcomes without needing real tool threads:

from brain.agent.tool_outcomes import ToolFailure

def fake_execute(call, state, *, web_mode, budget, tool_timeout_cap_s):
    return ToolFailure(call_id=call.id, tool_name=call.name,
                       error="simulated timeout", retryable=False)

harness._bounded_executor.execute = fake_execute

Testing run_turn() end-to-end

Provide a _Provider class that returns ProviderResult with controlled tool_calls / text values. Use a _Tools stub that responds to tools.run() and tools.available_tool_schemas(). See tests/agent/test_agent_harness_timeouts.py for a complete example.