Agent Workflow & Orchestration Issues | Troubleshooting

When you chain multiple agents together for real work, the problems shift from "model got it wrong" to "the workflow went sideways" — handoffs lose context, one branch burns the whole budget, two agents edit the same file, a critical tool call is missing from the trace, checkpoint restore desyncs state. This hub is vendor-agnostic: orchestrator deadlocks, mis-routed tasks, retry storms, missing cycle detection, orphaned subprocesses, too-loose promotion criteria, prompt-template drift across versions. Difference vs. [[claude-code-agent-issues]] / [[codex-agent-issues]]: those hubs fix a single tool`s internal failure; this hub fixes the orchestration layer when multiple agents collaborate.

Common problems

Agent handoff loses context between steps
Previous agent`s conclusions don`t reach the next; use a structured handoff schema.
Agent budget exhausted halfway through the task
One branch eats the whole token budget; per-phase caps + early-stop.
Two parallel agents editing the same file
No file locking; use a file-ownership map or worktree isolation.
Agent orchestrator deadlocks waiting on each other
A waits for B, B waits for A; add timeouts + DAG validation.
Task routed to the wrong agent
Classifier sends a coding task to the writing agent; add a router validation step.
Agent state desyncs after restart
Checkpoint half-written; atomic writes + explicit version stamps.
Agent skipped a required validation step
"Looks fine" promoted prematurely; enforce a must-pass gate.
Agent output not machine-parseable
Decorated with markdown; use JSON mode + schema validation.
Flaky tool triggers an agent retry storm
No backoff; exponential backoff + circuit breaker.
Critical tool call missing from the agent trace
Synchronous execution unlogged; unify instrumentation entry points.
Promotion criteria too loose — bad output slips through
Only "non-empty" checked; layer structural, content, semantic gates.
Cost tracking misses sub-agent usage
Parent logged, subagents not; report into a central ledger.
One agent`s rate limit cascades into a chain failure
Shared API key; per-agent keys + global limiter.
Restored checkpoint is corrupted
Writes not fsync`d; append-only WAL + checksums.
Agent skipped a pre-flight check it was supposed to run
Marked "optional" in prompt; rewrite as must-do.
Agent`s subprocess orphaned after agent exits
No process group; spawn detached + cleanup hook.
Shared memory corrupted by overlapping agent writes
No CAS; use versioned writes or a single-writer pattern.
Cycle in agent call graph goes undetected
A→B→C→A; trace-id chain + max-depth guard.
Prompt templates drift between agent versions
Prod / test use different versions; hash + registry.
Agent output leaks secrets into downstream logs
Tool echoed full env; add a redaction filter pre-output.

🕸️ Agent Workflow & Orchestration Issues

Common problems