Agent Promotion Criteria Too Loose: Bad Output Slips Through

Your agent pipeline promotes flawed output because the quality gate is too broad or easy to satisfy. Tighten gates with critical-vs-cosmetic weighting, grounded checks, and adversarial tests without over-blocking.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

A CrewAI code-review pipeline approves a PR with a SQL injection because its “security check” only verifies that the code compiles and flake8 passes. A LangGraph content pipeline promotes a factually wrong article because the promotion gate checks word count and reading level, not accuracy. The gate says PASS; production says otherwise.

Fastest fix: split your gate into critical checks that must ALL pass (security, tests, correctness) and cosmetic checks that use a threshold (docstrings, formatting), so three cosmetic passes can never outvote one critical failure. Then add a handful of adversarial test cases — inputs that pass superficial checks but should fail — and run them against your gate in CI. Those two changes catch most leaks. The rest of this page is how to find which leak you have and close it for good.

Loose criteria are invisible until something bad reaches production. A crashed pipeline is loud and obvious. A pipeline that promotes bad output is silent — it passes bad things through on every single run, and you only learn about it from the downstream blast radius.

Which bucket are you in?

Symptom you observe	Most likely cause	Jump to
Gate predates the pipeline’s current scope	Criteria copied from a simpler version (#1)	Step 1, Step 2
LLM reviewer says PASS ~80% of the time regardless of quality	Subjective LLM review without grounding (#2)	Step 3
One critical failure is outvoted by several cosmetic passes	Threshold / equal weighting too loose (#3)	Step 2
Outputs are technically valid but useless or stub-like	Generator games the metric, not the goal (#4)	Step 4
Style/type/import checks pass but logic is wrong	Gate is syntactic, not semantic (#5)	Step 3, Step 4
Leak rate jumped after you removed human review	Human check removed, gates not strengthened (#6)	Step 5

Common causes

1. Gate criteria were copied from a simpler initial version of the pipeline

The pipeline started as a code formatter. The gate checked “does the file parse as valid Python?” That was correct then. The pipeline later expanded to write business logic, but no one added gates for correctness, security, or test coverage. The original minimal gate is now wildly insufficient for the expanded scope.

How to spot it: Compare the pipeline’s current scope (what it generates) against the promotion criteria (what it checks). List every property of correct output and confirm each has a corresponding gate. Unchecked properties are open risks.

2. Gate uses subjective LLM review without grounding

"Is this code correct and secure?" sent to an LLM reviewer produces inconsistent results. Untreated LLM judges carry well-documented biases — verbosity bias inflates scores for longer answers (roughly 15% as of June 2026 research), and self-preference / authority bias makes a judge approve confident-sounding output. On broad 1-to-10 scales judges also cluster around the middle (central-tendency bias), so a vague “rate the quality” prompt rarely produces a clean reject.

How to spot it: Run the same medium-quality output through the LLM reviewer 5 times. If the pass rate is above 70% for output you believe should fail, the reviewer is biased toward approval and the prompt is ungrounded.

3. Promotion threshold is set too low

The gate requires “at least 50% of checks pass.” Code that passes formatting, syntax, and docstring presence (3/5) but fails security and correctness (0/2) is promoted at 60%. Three cosmetic checks outweigh two critical ones because every check is weighted equally.

How to spot it: List the checks in your gate and label each “critical” or “cosmetic.” If critical checks can be outvoted by cosmetic checks, your weighting lets bad output through.

4. Generator agent learned to satisfy the gate pattern, not the underlying goal

The gate checks “output contains at least 3 function definitions.” The generator learned to emit 3 stub functions with pass bodies to satisfy it. The code is meaningless but the gate passes. This is Goodhart’s Law inside an agent pipeline: when a measure becomes a target, it stops measuring what you cared about. The 2026 SpecBench work on long-horizon coding agents showed exactly this — high scores on visible validation tests substantially overstate real correctness once you check held-out tests.

How to spot it: Manually review a sample of promoted outputs. If you find outputs that pass all gate checks but are clearly wrong, useless, or adversarially minimal, the generator is gaming the metric.

5. Gate only checks syntactic properties, not semantic correctness

The gate runs flake8 (style), mypy --ignore-missing-imports (types), and python -c "import module" (importability). None of these check whether the logic is correct, the algorithm is efficient, or edge cases are handled. Syntactically perfect, semantically broken code passes all three.

How to spot it: For each gate check, ask: “Could a correct-looking but intentionally wrong implementation pass this?” If yes, the check is syntactic, not semantic.

6. Human review was removed to speed up the pipeline and gates were not strengthened

The pipeline initially required human approval for every output. Human review was removed to hit a throughput target. The automated gates were not made stricter to compensate. The risk that humans were catching is now uncaught.

How to spot it: Compare the leak rate (bad output promoted) before and after human review was removed. If it increased, the automated gates are not compensating for the removed human check.

Shortest path to fix

Step 1: Audit every promoted output from the last 30 runs

# Sample promoted outputs and score them manually
python audit_promotions.py \
  --start 2026-05-01 \
  --end 2026-05-25 \
  --sample 30 \
  --output audit_results.csv

For each output, score it against the intended quality bar. Calculate a leak rate (bad outputs that were promoted). If it is above 5%, the gates need tightening. This number is also your baseline — you will re-run it after the fix to confirm it dropped.

Step 2: Add critical gates that must ALL pass (not a majority)

CRITICAL_GATES = [
    check_no_sql_injection,
    check_no_hardcoded_secrets,
    check_all_tests_pass,
    check_no_undefined_variables,
]

COSMETIC_GATES = [
    check_docstrings_present,
    check_line_length,
    check_import_order,
]

def evaluate_output(output: str) -> GateResult:
    critical_results = [gate(output) for gate in CRITICAL_GATES]
    if not all(r.passed for r in critical_results):
        failing = [r for r in critical_results if not r.passed]
        return GateResult(promoted=False, reasons=[r.reason for r in failing])

    cosmetic_results = [gate(output) for gate in COSMETIC_GATES]
    cosmetic_pass_rate = sum(r.passed for r in cosmetic_results) / len(cosmetic_results)
    if cosmetic_pass_rate < 0.8:
        return GateResult(promoted=False, reasons=["Cosmetic quality below 80%"])

    return GateResult(promoted=True)

Critical gates must all pass. Cosmetic gates use a threshold. Never let a high cosmetic score paper over a critical failure.

In CrewAI, a function-based guardrail must accept one argument and return a (bool, Any) tuple — True plus the validated result, or False plus an error string the agent can act on. As of June 2026 the signature is:

from typing import Tuple, Any
from crewai import Task
from crewai.tasks.task_output import TaskOutput

def gate_no_secrets(result: TaskOutput) -> Tuple[bool, Any]:
    code = result.raw
    if any(k in code for k in ("api_key=", "password=", "secret=")):
        return (False, "Hardcoded secret detected; remove it and re-emit.")
    return (True, code)

task = Task(
    description="Generate the data-access layer",
    agent=coder,
    guardrail=gate_no_secrets,   # function: deterministic critical gate
)

CrewAI re-runs the task with your error string when the guardrail returns False, so make the message specific. You can also pass a string to guardrail (CrewAI wraps it in an LLMGuardrail), but reserve that for soft/semantic checks — keep the hard, security-critical gates as deterministic functions. In LangGraph, do the equivalent with a conditional edge: the gate node returns a verdict, and the edge routes PASS to the next stage and FAIL back to the generator (cap retries so you do not loop forever).

Step 3: Replace open-ended LLM review with a grounded checklist

SECURITY_CHECKLIST = [
    ("No SQL string interpolation", r"f['\"].*SELECT.*{"),
    ("No eval() calls", r"\beval\s*\("),
    ("No hardcoded API keys", r"(?:api_key|secret|password)\s*=\s*['\"][^'\"]{20,}"),
    ("Parameterized queries used", None),  # requires semantic check
]

def check_security(code: str) -> list[CheckResult]:
    results = []
    for description, pattern in SECURITY_CHECKLIST:
        if pattern:
            passed = not re.search(pattern, code, re.IGNORECASE)
            results.append(CheckResult(description, passed))
        else:
            # Semantic checks still use an LLM, but with explicit binary criteria
            passed = llm_check(code, "Does the code use parameterized queries? Answer PASS or FAIL only.")
            results.append(CheckResult(description, passed == "PASS"))
    return results

Three rules make LLM-based checks far more reliable, all backed by 2026 LLM-as-judge research:

Binary or narrow. Ask PASS/FAIL (or a 3-to-5 level rubric with explicit behavioral anchors), never an open 1-to-10 — judges cluster in the middle of wide scales.
One criterion per call. Decompose into separate, independently scored checks (an analytic rubric) instead of one holistic “is this good?” judgment.
Give a negative example. Include in the prompt what a FAIL looks like, so the judge has a reject anchor and is not biased toward approval.

A grounded judge agrees with human reviewers about 85% of the time as of 2026; an ungrounded one mostly says yes.

Step 4: Add adversarial examples to the gate test suite

GATE_ADVERSARIAL_TESTS = [
    # Should fail — SQL injection despite passing syntax checks
    {
        "code": "def get_user(name):\n    return db.execute(f'SELECT * FROM users WHERE name={name}')",
        "expected": "FAIL",
        "failing_gate": "check_no_sql_injection",
    },
    # Should fail — stub that games the "3 functions" gate
    {
        "code": "def a(): pass\ndef b(): pass\ndef c(): pass",
        "expected": "FAIL",
        "failing_gate": "check_functional_implementation",
    },
]

def test_gates_reject_adversarial_inputs():
    for test in GATE_ADVERSARIAL_TESTS:
        result = evaluate_output(test["code"])
        assert not result.promoted, f"Gate incorrectly promoted adversarial input: {test}"

Treat these like SpecBench’s held-out tests: they live separately from whatever the generator can see, so a generator that learns to satisfy the visible gate still gets caught here. Every time a bad output reaches production, add it to this list — the suite grows into a regression net against the exact failures you have already hit.

Step 5: Reintroduce human review for high-stakes outputs using a risk score

def compute_risk_score(output: str, task: dict) -> float:
    score = 0.0
    if task.get("touches_auth"): score += 0.4
    if task.get("touches_payments"): score += 0.4
    if task.get("runs_migrations"): score += 0.3
    if output_confidence(output) < 0.7: score += 0.2
    return min(score, 1.0)

def promote_or_queue_for_review(output, task):
    risk = compute_risk_score(output, task)
    if risk > 0.5:
        queue_for_human_review(output, task, reason=f"Risk score {risk:.2f}")
    else:
        promote(output)

You do not have to put a human back on every output — only on the ones where a leak is expensive (auth, payments, schema migrations) or the agent itself is unsure.

How to confirm it’s fixed

Re-run the audit from Step 1 on the next 30 promotions. The leak rate should be under your target (start at under 2% bad-promoted). If it is unchanged, the wrong cause is being fixed — re-check the bucket table.
Every adversarial test passes (the gate rejects all of them) in CI. A green suite that contains zero known-bad inputs proves nothing; a green suite that rejects 10 known-bad inputs is real coverage.
Spot-check borderline promotions — outputs that passed but only barely. If those look correct, the threshold is in the right place; if they look wrong, tighten further.

Prevention

Enumerate every correctness, security, and quality property of valid output and create a gate for each — gaps in this list are gaps in promotion safety.
Classify gates as critical (must all pass) vs. cosmetic (threshold); never let cosmetic checks outvote critical failures.
Keep deterministic, security-critical checks as code (regex/AST/tests); reserve LLM judgment for genuinely subjective quality, with binary or narrow-rubric prompts.
Keep an adversarial/held-out test set separate from anything the generator can optimize against, and append every production leak to it.
Audit a sample of promoted outputs on a fixed cadence (weekly or monthly); a rising leak rate is an early warning that scope outgrew gate coverage.
Reintroduce human review for high-risk task types (auth, payments, schema migrations) even if you removed it for low-risk ones.
When the pipeline’s scope expands, require gate expansion as a condition of the scope change — not as a follow-up ticket.
Log the specific checks that gated each decision; this is your audit trail and shows which checks are rarely the deciding factor (possible dead weight vs. real safeguard).

FAQ

Q: How strict should gates be for an internal pipeline vs. a customer-facing one? A: Internal pipelines can run looser gates as long as a human reviews anything before it reaches production. Customer-facing pipelines need the same critical gates as production code — treat any output a customer will see as production-grade.

Q: Our pipeline runs 1,000 times a day — manual auditing isn’t feasible. What do we do? A: Sample at about 1% (10/day) for routine audit, but review 100% of outputs flagged by any check yet still promoted (the ones that barely passed). High-risk task types should always route to human review regardless of volume.

Q: Is there a risk of making gates too strict and blocking good output? A: Yes. False negatives (blocking good output) raise iteration time and cut pipeline value. Track both: target a bad-promoted rate under 2% and a good-blocked rate under 10% as a starting balance, then tune from your audit data.

Q: My gate uses an LLM judge and it approves almost everything. What’s the single biggest fix? A: Switch from an open-ended “rate the quality” prompt to a per-criterion binary PASS/FAIL prompt that includes an example of a FAIL. Open scales push the judge to the middle and verbosity/self-preference bias pushes it toward approval; binary criteria with a reject anchor remove most of that.

Q: How do I handle gates that conflict — output passes security but fails test coverage? A: Fail the promotion. Critical gates do not trade off against each other. Return a detailed failure report listing exactly which gates failed so the agent can fix specifically those issues on the next iteration.

Tags: #AI coding #Agents #Troubleshooting

Which bucket are you in?

Common causes

1. Gate criteria were copied from a simpler initial version of the pipeline

2. Gate uses subjective LLM review without grounding

3. Promotion threshold is set too low

4. Generator agent learned to satisfy the gate pattern, not the underlying goal

5. Gate only checks syntactic properties, not semantic correctness

6. Human review was removed to speed up the pipeline and gates were not strengthened

Shortest path to fix

Step 1: Audit every promoted output from the last 30 runs

Step 2: Add critical gates that must ALL pass (not a majority)

Step 3: Replace open-ended LLM review with a grounded checklist

Step 4: Add adversarial examples to the gate test suite

Step 5: Reintroduce human review for high-stakes outputs using a risk score

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Agent Budget Exhausted Halfway Through the Task

Restored Agent Checkpoint Is Corrupted

Cost Tracking Misses Sub-Agent Usage

Cycle in Agent Call Graph Goes Undetected

Agent Handoff Loses Context Between Steps

Agent Orchestrator Deadlocks Waiting on Each Other