You build a CrewAI or LangGraph router with three specialists: a “code agent,” a “test agent,” and a “docs agent.” You submit “Write a unit test for the authentication module.” The router sends it to the docs agent, which produces a Markdown README about authentication. Or in an AutoGen setup, a “database migration” task routes to the general-purpose assistant instead of the migration specialist, which runs ALTER TABLE directly on the production connection instead of generating a migration file. Routing misfires waste tokens, produce wrong output, and — in worst cases — trigger the wrong side effects with the wrong tools.
Common causes
1. Router prompt is too vague — category descriptions overlap
The most common cause. You defined agent roles in natural language (“handles code tasks,” “handles test tasks”), but the router model must choose between them for ambiguous inputs. “Write a test” involves both code and tests. “Update the migration” involves both database and code. Overlapping category descriptions produce inconsistent routing.
How to spot it: Take the last 10 misrouted tasks and check which two agents’ descriptions are most similar. Any pair with a cosine similarity above 0.85 (when embedded) or overlapping keywords will produce consistent misrouting.
2. Few-shot examples in the router prompt are unrepresentative
The router has 3-4 example tasks per agent. Those examples all use specific terminology (“write a Jest test,” “create a Sequelize migration”). Real tasks use different phrasing (“add coverage for the login flow,” “bump the schema”). The model does not generalize from sparse examples to novel phrasing.
How to spot it: Collect 20 recent misrouted tasks and check whether any of them use phrasing similar to existing examples. If the misrouted tasks all use phrasing not in the examples, the example set is too narrow.
3. Router uses keyword matching instead of semantic classification
The router checks if "sql" in task.lower() and routes to the database agent. A task like “fix the SQL injection vulnerability in the auth layer” hits the database agent keyword but should go to the security agent. Keyword matching cannot handle context.
How to spot it: Read the routing code. If it contains in task.lower(), startswith, re.match on keywords, or a simple if/elif chain, it is keyword-based and will misfire on context-dependent tasks.
4. Missing “default” or “ambiguous” route — router picks the closest wrong match
When no good match exists, the router routes to the first agent or the one with the highest softmax probability even when that probability is 0.52 vs. 0.48. There is no “I’m not sure” path that escalates to a human or asks for clarification.
How to spot it: Add confidence logging to the router. If routed decisions with confidence below 0.7 correlate with misrouted outputs, a low-confidence threshold is needed.
5. Agent capability list is stale — agent was deprecated or renamed
The orchestrator’s routing table references agent_v2_code but the actual active agent is agent_v3_code_and_test. The v2 agent either no longer exists (routing fails silently and falls through to a default) or still exists but lacks recent capabilities (test writing was added in v3).
How to spot it: List all agent IDs in the routing table and compare against the list of currently active agent instances. Any ID in the routing table that doesn’t match an active agent is stale.
6. Task description is too short — router lacks enough signal
“Fix it” — two words — gives the router nothing to work with. It routes by guessing, and guesses wrong. Short tasks often occur when the orchestrator summarizes a larger task before routing.
How to spot it: Check the median character length of misrouted task descriptions versus correctly routed ones. If misrouted tasks are significantly shorter (under 30 words), brevity is the cause.
Shortest path to fix
Step 1: Log every routing decision with the task text and confidence score
def route_task(task: str, router_model) -> tuple[str, float]:
response = router_model.classify(
task,
labels=list(AGENT_REGISTRY.keys()),
return_scores=True
)
top_agent = response.labels[0]
confidence = response.scores[0]
logger.info(
"ROUTE: agent=%s confidence=%.3f task=%r",
top_agent, confidence, task[:120]
)
return top_agent, confidence
Review the last 50 routing decisions to find patterns in misroutes.
Step 2: Add a confidence threshold with an escalation path
CONFIDENCE_THRESHOLD = 0.75
def route_with_fallback(task: str) -> str:
agent, confidence = route_task(task, router_model)
if confidence < CONFIDENCE_THRESHOLD:
logger.warning(
"Low-confidence route (%.2f) — escalating to clarification agent",
confidence
)
return "clarification_agent"
return agent
The clarification agent asks one question to disambiguate, then re-routes with more context.
Step 3: Rewrite agent descriptions to be mutually exclusive
Replace vague descriptions with explicit scope boundaries:
AGENT_DESCRIPTIONS = {
"code_agent": (
"Writes, edits, or refactors production source code in .py, .ts, .go files. "
"Does NOT write tests, migration files, or documentation."
),
"test_agent": (
"Writes or edits test files (*.test.ts, test_*.py, *_spec.rb). "
"Does NOT edit production source files or migration files."
),
"migration_agent": (
"Generates database migration files using the project's migration framework. "
"Never runs migrations directly — only creates the migration file."
),
}
The “Does NOT” clauses are as important as the “Does” clauses for preventing overlap.
Step 4: Expand few-shot examples to cover diverse phrasing
For each agent, add at least 10 examples that cover:
- Direct phrasing (“write a test for X”)
- Indirect phrasing (“add coverage for X”)
- Jargon variants (“spec for X,” “unit test for X,” “test case for X”)
- Cross-domain tasks that should NOT route here (“fix the code that X tests” → code_agent, not test_agent)
TEST_AGENT_EXAMPLES = [
"Write a unit test for the authentication module",
"Add test coverage for the payment flow",
"Create a spec for the UserService class",
"The login tests are failing — update the test assertions",
# Counter-examples (what NOT to route here):
# "Fix the authentication module so the tests pass" → code_agent
# "Write docs for the test suite" → docs_agent
]
Step 5: Validate routing on a labeled evaluation set before deploying
ROUTING_EVAL = [
{"task": "Add a test for the JWT decoder", "expected": "test_agent"},
{"task": "Fix the JWT decoder implementation", "expected": "code_agent"},
{"task": "Document the JWT decoder API", "expected": "docs_agent"},
# ... 50+ examples
]
def evaluate_router(router):
correct = sum(
1 for ex in ROUTING_EVAL
if route_task(ex["task"], router)[0] == ex["expected"]
)
accuracy = correct / len(ROUTING_EVAL)
print(f"Router accuracy: {accuracy:.1%}")
assert accuracy >= 0.90, "Router accuracy below 90% threshold"
Run this as a CI check whenever router prompts or agent descriptions change.
Prevention
- Define agent capabilities using explicit scope boundaries with “Does NOT handle” clauses — ambiguity in descriptions directly causes misrouting.
- Build a labeled routing evaluation set of at least 50 examples before deploying any router, and enforce a 90% accuracy threshold in CI.
- Log every routing decision with confidence score; alert on decisions below 0.75 confidence.
- Add a “clarification agent” or human escalation path for low-confidence routing rather than guessing.
- Version your agent registry; any time an agent is added, removed, or renamed, run the routing evaluation suite before deploying.
- Keep task descriptions sent to the router at least 20 words — add a task-enrichment step if the orchestrator generates short tasks.
- Use semantic classification (embedding similarity or a classifier model) rather than keyword matching for anything beyond trivial routing.
- Review misrouted tasks weekly in production; use them to expand the evaluation set and improve examples.
FAQ
Q: Should I use a dedicated router model or build routing into the orchestrator LLM? A: For 3-5 agents, building routing into the orchestrator prompt works well. For 10+ agents, use a dedicated lightweight classifier (a fine-tuned small model or embedding similarity) — the orchestrator’s general-purpose model degrades in accuracy as the number of choices grows.
Q: How do I handle tasks that legitimately belong to two agents? A: Split the task before routing. Have a “task decomposer” step that breaks composite tasks into atomic subtasks, each of which maps cleanly to one agent. Do not try to route a composite task to a single agent.
Q: What if my router model changes between deployments and routing regresses? A: Pin the router model version explicitly and run the routing evaluation suite as a pre-deployment check. Treat a routing accuracy regression as a breaking change.
Q: Can vector-based routing replace prompt-based routing? A: Yes, and it often outperforms prompt-based routing for large agent registries. Embed each task and each agent’s capability description, then route to the agent with the highest cosine similarity. It is faster, cheaper, and more consistent than asking a large model to classify every task.