🕸️ Agent Workflow & Orchestration Issues
Agent orchestration, handoffs, budget exhaustion, parallel conflicts, state drift, trace loss, checkpoint corruption.
When you chain multiple agents together for real work, the problems shift from "model got it wrong" to "the workflow went sideways" — handoffs lose context, one branch burns the whole budget, two agents edit the same file, a critical tool call is missing from the trace, checkpoint restore desyncs state. This hub is vendor-agnostic: orchestrator deadlocks, mis-routed tasks, retry storms, missing cycle detection, orphaned subprocesses, too-loose promotion criteria, prompt-template drift across versions. Difference vs. [[claude-code-agent-issues]] / [[codex-agent-issues]]: those hubs fix a single tool`s internal failure; this hub fixes the orchestration layer when multiple agents collaborate.
Common problems
- Agent handoff loses context between steps Previous agent`s conclusions don`t reach the next; use a structured handoff schema.
- Agent budget exhausted halfway through the task One branch eats the whole token budget; per-phase caps + early-stop.
- Two parallel agents editing the same file No file locking; use a file-ownership map or worktree isolation.
- Agent orchestrator deadlocks waiting on each other A waits for B, B waits for A; add timeouts + DAG validation.
- Task routed to the wrong agent Classifier sends a coding task to the writing agent; add a router validation step.
- Agent state desyncs after restart Checkpoint half-written; atomic writes + explicit version stamps.
- Agent skipped a required validation step "Looks fine" promoted prematurely; enforce a must-pass gate.
- Agent output not machine-parseable Decorated with markdown; use JSON mode + schema validation.
- Flaky tool triggers an agent retry storm No backoff; exponential backoff + circuit breaker.
- Critical tool call missing from the agent trace Synchronous execution unlogged; unify instrumentation entry points.
- Promotion criteria too loose — bad output slips through Only "non-empty" checked; layer structural, content, semantic gates.
- Cost tracking misses sub-agent usage Parent logged, subagents not; report into a central ledger.
- One agent`s rate limit cascades into a chain failure Shared API key; per-agent keys + global limiter.
- Restored checkpoint is corrupted Writes not fsync`d; append-only WAL + checksums.
- Agent skipped a pre-flight check it was supposed to run Marked "optional" in prompt; rewrite as must-do.
- Agent`s subprocess orphaned after agent exits No process group; spawn detached + cleanup hook.
- Shared memory corrupted by overlapping agent writes No CAS; use versioned writes or a single-writer pattern.
- Cycle in agent call graph goes undetected A→B→C→A; trace-id chain + max-depth guard.
- Prompt templates drift between agent versions Prod / test use different versions; hash + registry.
- Agent output leaks secrets into downstream logs Tool echoed full env; add a redaction filter pre-output.