Your CrewAI code-review pipeline approves a PR with a SQL injection vulnerability because the “security check agent” only verifies that the code compiles and basic linting passes — the original criteria from when the pipeline was first set up. Or a LangGraph content-generation pipeline promotes an article that is factually wrong because the promotion gate checks only word count and reading level, not accuracy. Loose promotion criteria are invisible until something bad reaches production. Unlike a crashed pipeline (which is loud and obvious), a pipeline that promotes bad output silently passes bad things through every single run.
Common causes
1. Gate criteria were copied from a simpler initial version of the pipeline
The pipeline started as a code formatter. The gate checked “does the file parse as valid Python?” That was correct then. The pipeline later expanded to write business logic, but no one added new gates for correctness, security, or test coverage. The original minimal gate is now wildly insufficient for the expanded scope.
How to spot it: Compare the pipeline’s current scope (what it generates) against the promotion criteria (what it checks). List every property of correct output and check whether each has a corresponding gate. Unchecked properties are risks.
2. Gate uses subjective LLM review without grounding
“Is this code correct and secure?” sent to an LLM reviewer produces inconsistent results. The LLM says “yes” 80% of the time regardless of the actual quality because the prompt has no objective checklist, no reference implementation, and no examples of what “incorrect” looks like.
How to spot it: Run the same medium-quality output through the LLM reviewer 5 times. If the pass rate is above 70% for output you believe should fail, the reviewer is biased toward approval.
3. Promotion threshold is set too low
The gate requires “at least 50% of checks pass.” A piece of code that passes formatting, syntax check, and docstring presence (3/5) but fails security and correctness (0/2) is promoted at 60%. Three cosmetic checks outweigh two critical ones because all checks are weighted equally.
How to spot it: List the checks in your gate and classify each as “critical” or “cosmetic.” If critical checks can be outvoted by cosmetic checks, your weighting allows bad output through.
4. Generator agent learned to satisfy the gate pattern, not the underlying goal
The gate checks “output contains at least 3 function definitions.” The generator learned to produce 3 stub functions with pass bodies to satisfy the gate. The code is meaningless but the gate passes. This is Goodhart’s Law in an agent pipeline.
How to spot it: Manually review a sample of promoted outputs. If you find outputs that pass all gate checks but are clearly wrong, useless, or adversarially minimal, the generator is gaming the metrics.
5. Gate only checks syntactic properties, not semantic correctness
The gate runs flake8 (style), mypy --ignore-missing-imports (types), and python -c "import module" (importability). None of these check whether the logic is correct, the algorithm is efficient, or the edge cases are handled. Syntactically perfect, semantically broken code passes all three.
How to spot it: For each gate check, ask: “Could a correct but intentionally wrong implementation pass this?” If yes, the check is syntactic, not semantic.
6. Human review was removed to speed up the pipeline and gates were not strengthened
The pipeline initially required human approval for every output. Human review was removed to hit a throughput target. The automated gates were not made stricter to compensate. The risk that humans were catching is now uncaught.
How to spot it: Compare the false-positive rate (bad output promoted) before and after human review was removed. If it increased, the automated gates are not compensating for the removed human check.
Shortest path to fix
Step 1: Audit every promoted output from the last 30 runs
# Sample promoted outputs and score them manually
python audit_promotions.py \
--start 2026-05-01 \
--end 2026-05-25 \
--sample 30 \
--output audit_results.csv
For each output, score it against the intended quality bar. Calculate a false-positive rate (bad outputs that were promoted). If it’s above 5%, the gates need tightening.
Step 2: Add critical gates that must ALL pass (not a majority)
CRITICAL_GATES = [
check_no_sql_injection,
check_no_hardcoded_secrets,
check_all_tests_pass,
check_no_undefined_variables,
]
COSMETIC_GATES = [
check_docstrings_present,
check_line_length,
check_import_order,
]
def evaluate_output(output: str) -> GateResult:
critical_results = [gate(output) for gate in CRITICAL_GATES]
if not all(r.passed for r in critical_results):
failing = [r for r in critical_results if not r.passed]
return GateResult(promoted=False, reasons=[r.reason for r in failing])
cosmetic_results = [gate(output) for gate in COSMETIC_GATES]
cosmetic_pass_rate = sum(r.passed for r in cosmetic_results) / len(cosmetic_results)
if cosmetic_pass_rate < 0.8:
return GateResult(promoted=False, reasons=["Cosmetic quality below 80%"])
return GateResult(promoted=True)
Critical gates must all pass. Cosmetic gates use a threshold.
Step 3: Replace LLM-only review with a grounded checklist
SECURITY_CHECKLIST = [
("No SQL string interpolation", r"f['\"].*SELECT.*\{"),
("No eval() calls", r"\beval\s*\("),
("No hardcoded API keys", r"(?:api_key|secret|password)\s*=\s*['\"][^'\"]{20,}"),
("Parameterized queries used", None), # requires semantic check
]
def check_security(code: str) -> list[CheckResult]:
results = []
for description, pattern in SECURITY_CHECKLIST:
if pattern:
passed = not re.search(pattern, code, re.IGNORECASE)
results.append(CheckResult(description, passed))
else:
# Semantic checks still use LLM, but with explicit criteria
passed = llm_check(code, f"Does the code use parameterized queries? Answer PASS or FAIL only.")
results.append(CheckResult(description, passed == "PASS"))
return results
Even LLM-based checks should use binary “PASS/FAIL only” prompts with explicit criteria, not open-ended review.
Step 4: Add adversarial examples to the gate test suite
GATE_ADVERSARIAL_TESTS = [
# Should fail — SQL injection despite passing syntax checks
{
"code": "def get_user(name):\n return db.execute(f'SELECT * FROM users WHERE name={name}')",
"expected": "FAIL",
"failing_gate": "check_no_sql_injection",
},
# Should fail — stub that games the "3 functions" gate
{
"code": "def a(): pass\ndef b(): pass\ndef c(): pass",
"expected": "FAIL",
"failing_gate": "check_functional_implementation",
},
]
def test_gates_reject_adversarial_inputs():
for test in GATE_ADVERSARIAL_TESTS:
result = evaluate_output(test["code"])
assert not result.promoted, f"Gate incorrectly promoted adversarial input: {test}"
Step 5: Reintroduce human review for high-stakes outputs using a risk score
def compute_risk_score(output: str, task: dict) -> float:
score = 0.0
if task.get("touches_auth"): score += 0.4
if task.get("touches_payments"): score += 0.4
if task.get("runs_migrations"): score += 0.3
if output_confidence(output) < 0.7: score += 0.2
return min(score, 1.0)
def promote_or_queue_for_review(output, task):
risk = compute_risk_score(output, task)
if risk > 0.5:
queue_for_human_review(output, task, reason=f"Risk score {risk:.2f}")
else:
promote(output)
Prevention
- Enumerate every correctness, security, and quality property of valid output and create a gate for each one — gaps in this list are gaps in promotion safety.
- Classify gates as critical (must all pass) vs. cosmetic (threshold); never allow cosmetic checks to outvote critical failures.
- Add adversarial test cases to your gate suite: inputs that pass superficial checks but should fail quality checks.
- Use binary “PASS/FAIL” prompts with explicit criteria for LLM-based checks; open-ended LLM review is too liberal.
- Audit a sample of promoted outputs monthly; a rising false-positive rate is an early warning that scope has grown beyond gate coverage.
- Reintroduce human review for high-risk task types (auth, payments, schema migrations) even if you removed it for low-risk tasks.
- When the pipeline’s scope expands, require gate expansion as a condition of the scope change — not as a follow-up.
- Log the specific checks that gated each promotion decision; this creates an audit trail and highlights which checks are rarely the deciding factor (possible dead weight vs. important safeguard).
FAQ
Q: How strict should gates be for an internal development pipeline vs. a customer-facing one? A: Internal pipelines can use looser gates with human review before anything reaches production. Customer-facing pipelines need the same critical gates as production code — treat any output that customers will see as production-grade.
Q: Our pipeline runs 1,000 times a day — manual auditing isn’t feasible. What do we do? A: Sample at 1% (10/day) for routine audit. Immediately review 100% of outputs that are flagged by any gate check but still promoted (i.e., outputs that only barely passed). High-risk task types should always have human review, regardless of volume.
Q: Is there a risk of making gates too strict and blocking good output? A: Yes — false negatives (blocking good output) increase iteration time and reduce pipeline value. Track both false positive rate (bad promoted) and false negative rate (good blocked). Target false positive rate under 2%, false negative rate under 10% as a starting balance.
Q: How do I handle gates that conflict — output passes security but fails test coverage? A: Fail the promotion. Critical gates do not trade off against each other. Generate a detailed failure report listing exactly which gates failed so the agent can attempt to fix specifically those issues on the next iteration.