Agents fail in ways that traditional software doesn't. The debugging skills are different too.
Traditional debugging asks "why did this code path execute?" Agent debugging asks "why did the model choose this action?" The former traces deterministic execution; the latter reverse-engineers probabilistic decisions from limited observability. This fundamental difference shapes every debugging technique in this chapter.
The Debugging Mindset
Agent Debugging vs Traditional Debugging
| Traditional Code | Agent Systems |
|---|---|
| Deterministic execution | Probabilistic decisions |
| Stack traces show cause | Only inputs/outputs visible |
| Same input → same output | Same input → varying outputs |
| Bugs reproduce reliably | Failures may be intermittent |
| Fix the code | Fix the prompt, context, tools, OR code |
Agent debugging requires working backward from observed behavior to infer what the model "thought." The model's internal reasoning remains opaque, so debugging relies on:
- Structured logging at decision points
- Systematic hypothesis testing
- Understanding model behavior patterns
The Core Four Framework
Every agent failure traces to one or more of the "core four" components. Before debugging, identify which component is suspect:
| Component | Symptoms | First Check |
|---|---|---|
| Prompt | Wrong interpretation, missed instructions | Does the prompt clearly specify what NOT to do? |
| Model | Capability limits, reasoning errors | Can a more capable model do this task? |
| Context | Hallucination, outdated info, missed details | What context was actually available? |
| Tools | Wrong tool choice, tool errors, bad outputs | Did tools return what was expected? |
The diagnostic sequence: Check tools first (easiest to verify), then context (what did the agent know?), then prompt (ambiguity?), then model (capability?) last.
Diagnostic Decision Tree
When an agent fails, work through this decision tree to identify root cause:
Step 1: Characterize the Failure
Q: What type of failure occurred?
| If the agent... | Jump to |
|---|---|
| Produced wrong output | → Step 2A |
| Got stuck / looped | → Step 2B |
| Stopped too early | → Step 2C |
| Used wrong tool | → Step 2D |
| Made something up | → Step 2E |
| Crashed / errored | → Step 2F |
Step 2A: Wrong Output
The agent completed but produced incorrect results.
- Check prompt clarity: Does the prompt specify the expected format? Are there ambiguous instructions?
- Check context completeness: Did the agent have access to all necessary information?
- Compare to examples: If using few-shot examples, do they match the desired behavior?
- Verify tool outputs: Did tools return correct data that the agent then misinterpreted?
Common root causes:
- Ambiguous success criteria in prompt
- Context window exceeded (lost early instructions)
- Examples contradicted by instructions
- Model capability insufficient for task complexity
Step 2B: Stuck / Looping
The agent repeats actions without progress or enters infinite loops.
- Check exit conditions: Is there a clear completion criterion?
- Review tool feedback: Are tool errors being ignored?
- Examine retry logic: Is the agent retrying the same failed action?
- Look for oscillation: Is the agent alternating between two states?
Common root causes:
- No explicit stopping condition
- Tool returning same error without useful feedback
- Agent misinterpreting partial success as failure
- Context pollution from repeated attempts
Step 2C: Stopped Too Early
The agent declared completion prematurely.
- Check success criteria: Are completion conditions too easily satisfied?
- Review final state: Did the agent miss steps in a multi-step process?
- Examine context usage: Did the task description get truncated?
Common root causes:
- Vague completion criteria ("do your best")
- Long task list in prompt, model stopped after early items
- Context window filled, later instructions dropped
- Model chose easier interpretation of ambiguous requirements
Step 2D: Wrong Tool Selection
The agent chose an inappropriate tool for the task.
- Check tool descriptions: Are tools clearly distinguished?
- Review similar tools: Are there tools with overlapping purposes?
- Examine tool availability: Was the correct tool available?
- Look for capability assumptions: Did the agent assume a tool could do something it couldn't?
Common root causes:
- Tool descriptions don't differentiate use cases
- Too many similar tools creating decision fatigue
- Correct tool not in allowed tools list
- Tool name or description misleading
Step 2E: Hallucination
The agent fabricated information not present in context.
- Verify context provided: Was the needed information actually in context?
- Check retrieval: If using RAG, did retrieval return relevant results?
- Review confidence signals: Did the agent express certainty about made-up facts?
- Examine pressure: Was the agent pressured to answer despite missing info?
Common root causes:
- Information genuinely missing from context
- Context too long, relevant info buried
- Prompt doesn't allow "I don't know" responses
- Similar but different information confused the model
Step 2F: Crash / Error
The agent or tooling produced an error.
- Read the error message: What does it actually say?
- Check tool configuration: Are tools properly configured?
- Review resource limits: Token limits? Rate limits? Timeouts?
- Examine input validation: Did invalid input cause the error?
Common root causes:
- API rate limits or authentication issues
- Context window overflow
- Tool returned unexpected format
- Environment configuration missing
Common Failure Modes
Context Overflow
Symptoms: Agent forgets early instructions, misses requirements mentioned at the start of a long conversation, produces outputs that ignore constraints specified initially.
Diagnosis:
- Check total token count vs model context limit
- Look for "primacy bias" reversal (late tokens dominate)
- Note if failures correlate with conversation length
Root causes:
- Context budget exhausted by tool outputs
- Long system prompts combined with long conversations
- Accumulated context from multi-turn interactions
Fixes:
- Summarize intermediate results instead of preserving raw outputs
- Move critical instructions to end of prompt (recency bias)
- Implement context pruning strategies
- Split into multiple smaller agent calls
Tool Errors
Symptoms: Agent reports tool failures, produces partial results, or works around tool limitations in unexpected ways.
Diagnosis:
- Check tool return values in logs
- Verify tool authentication and configuration
- Test tools in isolation outside agent context
Root causes:
- Tool misconfiguration (API keys, endpoints)
- Tool returning error messages agent can't interpret
- Tool success format different than agent expects
- Race conditions in async tool calls
Fixes:
- Add structured error handling in tool implementations
- Provide clear error messages with remediation hints
- Validate tool outputs before returning to agent
- Add retry logic with exponential backoff
Hallucination
Symptoms: Agent states facts not in context, makes up file paths or function names, fabricates API responses, invents prior conversation history.
Diagnosis:
- Compare agent claims to actual context provided
- Check retrieval results if using RAG
- Look for "confident but wrong" patterns
Root causes:
- Information genuinely not in context (expected knowledge)
- RAG retrieval returning irrelevant results
- Similar information creating confusion
- Pressure to answer without "I don't know" escape hatch
Fixes:
- Explicitly state what information is NOT available
- Add "if information is not in context, say so" instruction
- Improve retrieval quality and relevance filtering
- Use verification tools (agent can check its own claims)
Instruction Drift
Symptoms: Agent follows instructions accurately at first, gradually deviates over extended conversations, ignores constraints that were respected earlier.
Diagnosis:
- Compare early vs late behavior in same conversation
- Check if critical instructions are near prompt start
- Look for context approaching limits
Root causes:
- Context filling with conversation history
- Critical instructions too far from active context
- Competing instructions accumulated over time
Fixes:
- Repeat key constraints at intervals
- Use structured conversation resets
- Summarize and restart for long workflows
- Place critical rules in multiple prompt positions
Tool Selection Confusion
Symptoms: Agent chooses suboptimal tools, oscillates between similar tools, uses tool A when tool B would be more appropriate.
Diagnosis:
- Review available tools and their descriptions
- Check for overlapping tool capabilities
- Examine if tool descriptions distinguish use cases
Root causes:
- Too many tools with unclear differentiation
- Tool descriptions emphasize capabilities over use cases
- No guidance on tool selection strategy
Fixes:
- Reduce number of available tools
- Rewrite tool descriptions with clear "when to use" guidance
- Add tool selection hints in system prompt
- Use tool groups/categories to organize options
Premature Termination
Symptoms: Agent stops before completing all steps, declares success when only partially done, skips items in a list.
Diagnosis:
- Check completion criteria in prompt
- Review what the agent reported as "done"
- Look for truncation in task lists
Root causes:
- Vague or easily-satisfied completion criteria
- Long task lists where model attention fades
- Implicit rather than explicit requirements
Fixes:
- Add explicit completion checklist
- Use numbered steps with "confirm each step complete"
- Require verification of completion conditions
- Break long task lists into phases
State Confusion in Multi-Agent Systems
Symptoms: Agent acts on outdated information, conflicts between parallel agents, inconsistent state after handoffs.
Diagnosis:
- Trace state passing between agents
- Check timestamp of context provided to each agent
- Verify state serialization/deserialization
Root causes:
- State not passed completely between agents
- Parallel agents modifying shared state
- Stale context provided to downstream agents
Fixes:
- Use explicit state objects passed between agents
- Implement read/write locks for shared state
- Refresh context before critical decisions
- Add state verification steps at handoffs
Debugging Tools and Techniques
Structured Logging
Effective agent debugging requires structured logs at decision points, not just outcomes.
What to log:
[DECISION] Tool selection
- Available tools: [list]
- Chosen tool: {tool}
- Selection reason: {from model output if available}
[TOOL_CALL] {tool_name}
- Input: {parameters}
- Output: {result}
- Duration: {ms}
- Status: {success/error}
[STATE] Context checkpoint
- Token count: {current}/{max}
- Key facts in context: [summary]
- Turn number: {n}
[VALIDATION] Output check
- Expected format: {schema}
- Actual format: {observed}
- Valid: {yes/no}
Log analysis patterns:
- Search for tool calls that precede failures
- Track token count growth through conversation
- Compare successful vs failed runs for divergence points
Minimal Reproduction
When debugging, create the smallest case that reproduces the failure.
Reduction process:
- Start with the full failing case
- Remove context elements one at a time
- Simplify the prompt while preserving failure
- Reduce tool set to minimum needed
- Document the minimal reproduction case
Benefits:
- Faster iteration on fixes
- Clearer root cause identification
- Portable test case for regression prevention
A/B Testing Prompts
When the cause is unclear, test variations systematically.
Structure:
Base case: [original prompt]
Variation A: [change one element]
Variation B: [change different element]
Run each variation N times (N >= 5)
Compare success rates
What to vary:
- Instruction ordering
- Specificity of requirements
- Format of examples
- Position of constraints (beginning vs end)
Context Inspection
Examine exactly what the agent saw.
Techniques:
- Dump full context at failure point
- Highlight what was present vs missing
- Check token counts for each context section
- Verify encoding/special characters
Questions to answer:
- Was the needed information in context?
- Where in the context was it? (position matters)
- What other information competed for attention?
- Did formatting obscure important content?
Trace Replay
For complex multi-step failures, replay the trace step by step.
Process:
- Capture full execution trace
- Identify the step where behavior first diverged
- Extract context as of that step
- Re-run from that point with variations
- Isolate the specific decision that went wrong
Useful for:
- Multi-agent coordination failures
- Long workflow debugging
- Intermittent failures that are hard to reproduce
Model Comparison
When suspecting model capability, test across models.
| If behavior... | Suggests... |
|---|---|
| Fails on all models | Prompt/context issue, not model |
| Fails only on smaller models | Task complexity exceeds model capability |
| Fails inconsistently on same model | Edge case or prompt ambiguity |
| Succeeds on API but fails embedded | Environment/configuration issue |
Caution: Model comparison is expensive. Use only after ruling out prompt, context, and tool issues.
Anti-Patterns
Debugging Without Logs
What it looks like: Agent fails. The engineer stares at the prompt and tries to guess what went wrong. Changes are made based on intuition rather than data.
Why it fails:
- Debugging the theory of what happened, not what actually happened
- Intermittent failures are impossible to diagnose
- Can't distinguish "agent didn't try" from "agent tried and failed"
- Waste time fixing problems that don't exist
Better alternative: Instrument before debugging. Add logging at decision points, tool calls, and validation gates. Use event logs to trace execution. Read agent outputs and intermediate states. Debug the data, not the mental model.
Changing Multiple Things at Once
What it looks like: Agent isn't working right. The engineer simultaneously changes the prompt structure, swaps the model, adjusts the context window, and modifies the tool permissions. One of those fixes it—but which one?
Why it fails:
- Can't identify root cause
- Might have fixed one thing but broken another
- No reproducible solution to apply elsewhere
- Regression hunting becomes impossible
Better alternative: Change one variable at a time. Make a hypothesis, change only what tests that hypothesis, verify the impact. If that didn't work, revert and try the next hypothesis. Keep notes on what was tried.
Not Reproducing Before Fixing
What it looks like: Agent failed once. The engineer makes a change without verifying the original failure can be reproduced or that the fix prevents it.
Why it fails:
- "Fix" might do nothing, failure was intermittent
- Can't tell if future failures are regressions or new issues
- No confidence the problem is actually solved
- Accumulate superstitious fixes that don't address real problems
Better alternative: First, reproduce the failure reliably (even if just 2-3 times in a row). Document the repro steps. Then make the fix. Then verify the fix prevents the failure in 5+ consecutive runs. Without reproduction, there's no verification.
Assuming It's the Model
What it looks like: Agent produces wrong output. Model capability gets blamed immediately: "Haiku isn't smart enough for this" or "Opus hallucinates here." Switch to bigger/newer model without investigating further.
Why it fails:
- Most agent failures are prompt, context, or tool issues, not model capability
- Bigger models are slower and more expensive
- Doesn't fix the actual problem—new model often fails the same way
- Misses learning opportunity about prompt design
Better alternative: Check the other "core four" first. Is the instruction clear? Does the agent have the right context? Are tools working correctly? Is the output format well-specified? Model capability is usually the last resort, not the first suspect.
Debugging in Production
What it looks like: Agent fails in production. The engineer modifies the production prompt to add debugging output or try different approaches. Iteration happens on the live system.
Why it fails:
- Every iteration affects real users or systems
- Can't experiment freely without consequences
- Pressure leads to quick hacks instead of root cause fixes
- Lost changes if rollback is needed
Better alternative: Reproduce the failure locally or in a dev environment. Use production logs to understand the failure mode, but fix and test in isolation. Only deploy to production after verifying the fix in a safe environment.
Ignoring Partial Successes
What it looks like: Agent works 70% of the time, fails 30%. Focus goes only to the failures, trying to fix what went wrong. Successes are never analyzed for what makes them different.
Why it fails:
- Successes often show what conditions enable correct behavior
- Failure analysis alone doesn't reveal missing prerequisites
- Can't identify environmental or input patterns that matter
- Miss the insight that "it works when X is true"
Better alternative: Compare successes to failures. What's different about the inputs, context, or environment when it works? Often the pattern is "succeeds when context includes Y" or "fails when input is ambiguous." This points directly to the fix.
The Infinite Retry Loop
What it looks like: Add retry logic to handle intermittent failures. Agent fails, retries, fails, retries... infinitely, never surfacing the underlying issue.
Why it fails:
- Masks systemic problems with tactical patches
- Wastes tokens and time on doomed attempts
- No signal to humans that something is fundamentally broken
- Logs fill with noise, hiding real issues
Better alternative: Limit retries (2-3 max). On persistent failure, escalate to human or halt with detailed error. Log each retry with context. If retries are frequently triggered, that's a signal to fix the root cause, not increase the retry limit.
Debugging War Stories
Document specific debugging sessions—what happened, how you figured it out:
Multi-Agent Debugging Strategies
[2025-12-09]: Multi-agent systems have distinct failure modes that require different debugging approaches than single-agent systems.
Check Single-Message Parallelism First
When parallel agents run slower than expected or produce suspiciously sequential results, check whether they were invoked in a single message. This is the most common multi-agent debugging issue—developers expect parallelism but the framework serializes execution because Task calls were spread across multiple messages. Look at your orchestrator's outputs: are all parallel Task calls in one response?
Graceful Degradation Patterns
Multi-agent systems should degrade gracefully when individual agents fail:
- If one expert fails, note the failure and continue with available analyses
- Recommend manual review for failed expert domains
- Include recovery instructions in output
- Don't let one failing subagent tank the entire workflow
Implement this by checking subagent return status before synthesis. The orchestrator should be resilient to partial failures.
Partial Success Handling
Unlike single-agent (which either works or doesn't), multi-agent systems have complex success states:
- Commit successful changes before reporting failures
- Allow selective retry via phase parameters
- Never leave the workflow in an inconsistent state
- Track which subagents succeeded vs. failed
Use Hooks to Trace Agent Transitions
SubagentStop hooks record what each subagent produced. ErrorEscalation surfaces failures to the orchestrator. These create an audit trail for debugging:
- Which subagent ran when?
- What did it return?
- Where did it fail?
Without these hooks, multi-agent failures are nearly impossible to diagnose.
Domain Isolation Debugging
Each agent stays focused on its domain and doesn't need to know about other agents. If you see cross-domain confusion, something is leaking context incorrectly. The orchestrator is responsible for synthesis—individual experts should be testable in isolation.
Coordination vs. Agent Failures
Two distinct failure modes:
- Agent failure: Individual subagent does wrong thing (debug like single-agent)
- Coordination failure: Orchestrator routes incorrectly, synthesizes badly, or mismanages state
For coordination failures, the problem is usually in the orchestrator prompt or the spec file, not in the subagents themselves. Test subagents in isolation to rule out agent-level issues before debugging coordination.
Sources: Orchestrator Pattern, Building agents with the Claude Agent SDK
Testing and Debugging Hooks
Hooks are automation that runs at specific events in your workflow. When they break, they're invisible—the system just doesn't behave as expected. Here's how to diagnose and fix hook issues.
Manual Hook Testing
Test hooks outside of Claude Code before deploying them. This isolates hook logic from the rest of the system.
Basic Testing Pattern:
# Pipe JSON to stdin
echo '{"prompt": "/orchestrators:test"}' | python .claude/hooks/orchestrator_context.py
# Check exit code
echo $?
# Test with project environment variables
cd /path/to/project
CLAUDE_PROJECT_DIR=$(pwd) echo '{"tool_name": "Write"}' | python .claude/hooks/orchestrator_guard.pyTesting with Realistic Input:
Create test JSON files that mirror what Claude Code actually sends:
# test-pretool-input.json
{
"sessionId": "test-123",
"projectDir": "/Users/jayminwest/Projects/agentic-engineering-knowledge-base",
"tool_name": "Write",
"parameters": {
"file_path": "/path/to/file.md",
"content": "test content"
}
}
# Test the hook
cat test-pretool-input.json | python .claude/hooks/orchestrator_guard.pyVerifying Exit Codes:
Exit codes control hook behavior:
- Exit 0: Success, allow operation
- Exit 2: Block operation, show stderr to Claude
- Other: Non-blocking error, visible in debug mode only
# Test blocking behavior
cat test-input.json | python hook.py
if [ $? -eq 2 ]; then
echo "Hook correctly blocked the operation"
fiTesting Environment File Interaction:
For hooks that read/write CLAUDE_ENV_FILE, test with a temporary file:
# Create temp environment file
export CLAUDE_ENV_FILE=$(mktemp)
# Run hook that writes to env file
echo '{"prompt": "/orchestrators:knowledge test"}' | python .claude/hooks/orchestrator_context.py
# Verify what was written
cat "$CLAUDE_ENV_FILE"
# Clean up
rm "$CLAUDE_ENV_FILE"Debugging Hook Issues
Use Debug Mode:
Run Claude Code with the --debug flag to see detailed hook execution:
claude --debugThis shows:
- Which hooks are triggered
- stdin/stdout/stderr content
- Exit codes
- Execution timing
- Decision results
Use /hooks Command:
Inside Claude Code, run /hooks to see all registered hooks and their configuration. This confirms:
- Hook is properly registered
- Matcher patterns are correct
- Hook file paths are valid
- Multiple hooks aren't conflicting
Common Hook Issues:
1. Hook Not Running:
Check matcher patterns. If you have:
{
"matcher": "Write",
"type": "command",
"command": "python hook.py"
}This only matches the exact tool name "Write". It won't match "Edit" or "MultiEdit". Use wildcards for broader matching:
{
"matchers": ["Write", "Edit", "MultiEdit"]
}2. Timeout Errors:
Default timeouts may be too short for complex hooks. Increase the timeout:
{
"type": "command",
"command": "python slow-hook.py",
"timeout": 30000
}If timeouts persist, optimize the hook script. Hooks should complete in seconds, not minutes.
3. JSON Parse Errors:
Hooks must handle malformed JSON gracefully:
try:
input_data = json.load(sys.stdin)
except json.JSONDecodeError:
# Don't crash - allow operation to continue
sys.exit(0)Crashing on bad JSON breaks Claude Code's workflow. Exit 0 to allow operations when you can't parse input.
4. Permission Denied:
Hook scripts must be executable:
chmod +x .claude/hooks/orchestrator_context.pyIf using #!/usr/bin/env -S uv run --script, ensure uv is installed and in PATH.
Verifying Orchestrator Enforcement
The orchestrator pattern uses two hooks working together:
orchestrator_context.py(UserPromptSubmit) - Sets context when orchestrator commands runorchestrator_guard.py(PreToolUse) - Blocks Write/Edit tools in orchestrator context
Verify Context is Set:
When you run /orchestrators:knowledge or /questions:orchestrator, check that the context is set:
# During a Claude Code session, in another terminal:
cat "$HOME/.claude/env/session-<id>.env"
# Should contain:
# export CLAUDE_ORCHESTRATOR_CONTEXT="knowledge"The session ID can be found via /hooks output or by listing ~/.claude/env/ files sorted by modification time.
Verify Tools are Blocked:
When orchestrator context is active, attempting to use Write or Edit should fail with a clear error message:
BLOCKED: Orchestrator 'knowledge' cannot use Write directly.
Delegate to a build agent using the Task tool instead.
Example: Task(subagent_type='knowledge-build-agent', prompt='...')
If you're not seeing this error when you should, check:
- Is
CLAUDE_ORCHESTRATOR_CONTEXTactually set in the env file? - Is the PreToolUse hook registered? (use
/hooksto verify) - Is the hook script executable and working? (test manually)
What to Look For in Error Messages:
Good hook errors should:
- Identify the problem clearly: "BLOCKED: Orchestrator 'X' cannot use Y"
- Explain why: "Delegate to a build agent"
- Show how to fix it: "Example: Task(...)"
If errors are vague or missing, add better error messages to your hooks. Write to stderr for blocking errors:
print(
f"BLOCKED: Orchestrator '{orchestrator}' cannot use {tool_name} directly.\n"
f"Delegate to a build agent using the Task tool instead.\n"
f"Example: Task(subagent_type='knowledge-build-agent', prompt='...')",
file=sys.stderr
)
sys.exit(2)Hook Logging Pattern
Hooks are hard to debug because they're invisible during normal operation. Add logging to understand what's happening.
Logging to File:
Don't use stdout (it becomes context for Claude). Write to a log file instead:
import json
import sys
from pathlib import Path
from datetime import datetime
LOG_FILE = Path.home() / ".claude" / "logs" / "orchestrator_guard.log"
def log_hook_execution(message: str, data: dict = None):
"""Log hook execution for debugging."""
LOG_FILE.parent.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().isoformat()
log_entry = f"[{timestamp}] {message}"
if data:
log_entry += f"\n{json.dumps(data, indent=2)}"
with LOG_FILE.open("a") as f:
f.write(log_entry + "\n\n")
def main():
try:
input_data = json.load(sys.stdin)
tool_name = input_data.get("tool_name", "")
log_hook_execution("PreToolUse triggered", {
"tool": tool_name,
"orchestrator": get_orchestrator_context()
})
# ... rest of hook logic
except Exception as e:
log_hook_execution(f"Hook error: {e}")
sys.exit(0)What to Log:
Log decision points, not every line:
- Hook triggered (with input summary)
- Context detected (orchestrator name, environment state)
- Decisions made (allow/block, why)
- Errors encountered
Reviewing Logs:
Tail the log during a Claude Code session:
tail -f ~/.claude/logs/orchestrator_guard.logThis shows hook execution in real-time, helping you understand:
- Which hooks are actually running
- What input they're receiving
- What decisions they're making
- Where errors occur
Log Rotation:
Hook logs can grow large. Implement basic rotation:
def rotate_log_if_needed():
"""Keep log file under 1MB."""
if LOG_FILE.exists() and LOG_FILE.stat().st_size > 1_000_000:
# Move old log to backup
backup = LOG_FILE.with_suffix(".log.old")
LOG_FILE.rename(backup)Conditional Logging:
Enable detailed logging only when debugging:
DEBUG = os.environ.get("CLAUDE_HOOK_DEBUG", "false") == "true"
def log_hook_execution(message: str, data: dict = None):
if not DEBUG:
return
# ... logging codeThen enable when needed:
export CLAUDE_HOOK_DEBUG=true
claude --debugThis keeps logs clean during normal operation while providing detail when troubleshooting.
Case Studies
Case Study 1: The Disappearing Instructions
Symptom: Code generation agent produced output that violated explicit constraints. The constraint "never use external dependencies" was clearly stated, but the agent added import requests to the code.
Investigation:
- Checked the prompt—constraint was present at line 5 of a 200-line system prompt
- Estimated token count: system prompt (4k) + conversation history (12k) + code context (8k) = 24k tokens
- Constraint was in the first 3% of context
- Model exhibited recency bias—late tokens dominated attention
Root cause: Context overflow creating effective instruction loss. The constraint existed but was too far from active generation for the model to weight it appropriately.
Fix:
- Moved critical constraints to the end of the system prompt
- Added constraint reminders in user messages
- Implemented a "constraints" section that gets repeated before code generation
Lesson: Position matters. Critical instructions near the end of context get more attention than those at the beginning in long contexts.
Case Study 2: The Tool Selection Oscillation
Symptom: Agent alternated between file_search and web_search for the same query, never settling on one approach. Task timed out after 15 iterations.
Investigation:
- Logged tool selection reasoning
- Both tools returned partial results
- Agent interpreted partial results as "wrong tool chosen"
- Each tool's partial success triggered switching to the other
Root cause: No guidance on tool selection criteria. Tools had similar descriptions. Agent had no way to determine which was "correct" when both returned something.
Fix:
- Added explicit tool selection criteria to prompt: "Use file_search for content already in context. Use web_search only for information not in any provided files."
- Added "stick with your choice unless result is empty" instruction
- Implemented a max-switches-per-query limit
Lesson: Tool descriptions must include differentiation criteria, not just capability descriptions.
Case Study 3: The Phantom Hallucination
Symptom: Agent confidently cited a function validate_user_input() that didn't exist in the codebase. Claimed it was in auth.py at line 45.
Investigation:
- Searched codebase—no such function
- Checked RAG retrieval—returned
validate_input()fromforms.py - Context also included
user_auth_check()fromauth.py - Agent combined parts of both into fabricated function name and location
Root cause: Conflation of similar-but-different retrieved results. RAG returned relevant but not exact matches; agent synthesized a plausible-sounding but non-existent reference.
Fix:
- Added verification step: "Before citing any function, use grep to confirm it exists"
- Improved RAG retrieval to return exact matches when available
- Added instruction: "If you cannot find the exact function, say so rather than approximating"
Lesson: Hallucinations often emerge from synthesis of partially-correct information, not pure fabrication. Verification tools prevent confident-but-wrong citations.
Case Study 4: The Multi-Agent State Drift
Symptom: In a three-agent pipeline (scout → planner → builder), the builder produced code that contradicted the planner's specifications. Scout found the right files, planner made a correct plan, builder ignored parts of the plan.
Investigation:
- Traced state passing between agents
- Scout returned: "Found 3 relevant files: auth.py, middleware.py, routes.py"
- Planner created spec referencing all three files
- Builder received spec but only "auth.py" was in its context—the spec mentioned the files but didn't include their contents
Root cause: Spec file referenced files by name, but builder didn't have file contents. Builder proceeded with incomplete context rather than requesting missing information.
Fix:
- Changed spec format to include file contents, not just names
- Added builder validation: "Verify all referenced files are accessible before proceeding"
- Implemented explicit context handoff verification at phase transitions
Lesson: "Mentioned in spec" is not "available in context." State handoffs must include actual content, not references to content.
Case Study 5: The Intermittent Success
Symptom: Identical prompt, identical context, but 60% success rate. Some runs produced correct output, others failed with apparent reasoning errors.
Investigation:
- Compared successful vs failed runs
- No obvious pattern in inputs
- Examined model temperature—set to 0.7
- Tracked reasoning chains: successful runs took different reasoning paths
Root cause: Prompt had multiple valid interpretation paths. At temperature 0.7, model sometimes followed the successful path, sometimes followed a plausible but wrong path.
Fix:
- Reduced temperature to 0.1 for this task
- Added explicit step-by-step reasoning structure to reduce path variance
- Made success criteria unambiguous
Lesson: Intermittent failures with identical inputs often indicate prompt ambiguity amplified by temperature. Low temperature + explicit structure reduces variance.
Debugging Checklist
Quick reference for systematic debugging:
Before Starting
- Can the failure be reproduced? (Try 3 times)
- Is the failure documented? (Inputs, outputs, error messages)
- Is there logging/tracing available?
Initial Diagnosis
- Categorize failure type (wrong output, stuck, early stop, tool error, hallucination, crash)
- Check which "core four" component is suspect (tools → context → prompt → model)
- Review recent changes (what was different when it worked?)
Investigation
- Examine exact context provided to agent
- Check token counts vs limits
- Review tool call inputs and outputs
- Compare successful vs failed runs
Fix Verification
- Change only one variable at a time
- Verify fix in 5+ consecutive runs
- Document what was tried and what worked
- Add logging to detect regression
Expanded Connections
- To Evaluation: Systematic evaluation catches bugs before they become debugging sessions. Eval failures provide reproducible test cases for debugging. Debug fixes should be validated against eval sets.
- To Prompt: Many "agent bugs" are actually prompt bugs. Ambiguous instructions, missing constraints, and unclear formats cause predictable failures. Debug the prompt before debugging the agent.
- To Context: Context issues cause hallucination, instruction drift, and state confusion. Debugging often reveals context wasn't what you thought. Context inspection should be routine.
- To Tool Use: Tool errors, wrong tool selection, and tool output misinterpretation cause observable failures. Tools are the easiest component to verify in isolation.
- To Model Behavior: Understanding model attention patterns, recency bias, and capability limits informs debugging hypotheses. Model behavior explains why position and format matter.
- To Production Concerns: Production debugging requires logging, monitoring, and observability infrastructure. Invest in debugging capability before production deployment.
- To Orchestrator Pattern: Multi-agent debugging requires understanding coordination failures vs agent failures. Orchestrator patterns have distinct failure modes documented in this chapter.
Open Questions
- How do you debug emergent behavior in multi-agent systems where no single agent is wrong?
- What's the minimum viable observability for agent systems in production?
- How do you distinguish model capability limits from prompt/context issues without expensive A/B testing?
- Can agents be designed to self-diagnose and report their own failure modes?