AI-Generated Tests Pass But Feature Is Broken

Q: What mutation score should I require?

Start at 70% for the `break` threshold and raise it as the suite matures. Stryker's defaults flag below 60% as a warning and below 80% as not-yet-green. New code under active development can sit higher; legacy modules may need a temporary lower floor. The number matters less than having a floor that fails CI at all.

All green, ship it, prod breaks. The tests covered the happy path only and mocks shielded the real branches. Five causes and the fastest fix.

Published: May 21, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

CI is fully green, coverage says 92%, you merge the PR, and the very first real call in production blows up. You open the AI-generated tests and find: the database is mocked, the payment API is mocked, and the only assertion is expect(result).toBeDefined(). This is the single most common failure mode when Claude Code, Cursor, or Codex write tests autonomously. They optimize for “make the test pass,” not “make the real scenario pass.”

Fastest fix: hand-write one end-to-end test that replays the real failing input against a real (or staging) database and API. If that one test fails while the suite stays green, you have confirmed the unit tests never exercised the path, and you now have a regression gate. Then audit the mock list and strengthen weak assertions. Steps 1 and 2 below typically surface 50% of the false-green tests within an hour.

This is not anecdotal. As of June 2026, peer-reviewed benchmarks put LLM-generated unit tests at roughly 40% mutation score on real-world functions on average, and industry write-ups report scores near 20% on complex functions, meaning the majority of injected bugs slip past the suite. High line coverage with low mutation score is the measurable signature of this problem.

Common causes

Ordered by hit rate, highest first.

1. Tests only cover the happy path

AI defaults to “normal input returns normal output” and skips empty arrays, oversized strings, network timeouts, race conditions, and expired tokens, the branches that actually break in production.

// Typical AI-generated test
it("returns user data", async () => {
  const user = await getUser("123");
  expect(user).toBeDefined();
});
// Missing: getUser("") / getUser(null) / network down / 404 / 5xx

How to spot it: grep test files for throw / reject / error keywords. If a module has 10 tests and 0 error branches, this is it.

2. Critical dependencies mocked away, real integration untested

AI tends to mock the database, HTTP client, and filesystem to “make tests fast.” Result: schema mismatches, renamed API fields, and SQL errors slip through. This is the highest-severity bucket because the mock and the assertion are written together, so the test is really just checking the mock returns what you told it to return.

jest.mock("./db", () => ({
  getUser: jest.fn().mockResolvedValue({ id: "123", name: "Test" })
}));
// Real db.getUser returns { user_id, full_name } - schema changed long ago,
// test stays green

How to spot it: count jest.mock / vi.mock / unittest.mock.patch calls in the test file. More than 3 and you should audit which core dependencies are being shielded.

3. Assertions too weak, only verifying “something returned”

toBeDefined() / toBeTruthy() / toHaveBeenCalled() on their own carry almost no information. A function returning {} will pass.

// weak
expect(result).toBeDefined();
expect(saveUser).toHaveBeenCalled();

// strong
expect(result).toEqual({ id: "123", email: "a@b.com", status: "active" });
expect(saveUser).toHaveBeenCalledWith({ id: "123", email: "a@b.com" });
expect(saveUser).toHaveBeenCalledTimes(1);

How to spot it: search the test files for bare uses of toBeDefined / toBeTruthy / toHaveBeenCalled() without arguments. More than 5 hits and you need structural assertions.

4. High coverage but dead branches

100% line coverage is achievable by calling a function once with inputs that walk every if, but each branch only sees one input shape. Mutation testing (Stryker for JS/TS, mutmut for Python) exposes this immediately.

How to spot it: run npx stryker run. A mutation score below 60% means the tests check shape, not values. Or do it manually: change return a + b to return a - b in the unit under test. Does any test fail?

5. Test data the AI invented, not real samples

AI-fabricated test data is suspiciously tidy: "test@test.com", "John Doe", 123. Real data has emoji, oversized unicode, null fields, and leading whitespace. The edges are exactly where production breaks.

How to spot it: replay a scrubbed sample of production data through the same tests. If a lot of them fail, your AI test data is too idealized.

Which bucket am I in

Run these checks in order and stop at the first match.

Symptom you see	Most likely cause	Go to
Prod error is a missing/renamed field or SQL/HTTP error	Critical dependency mocked (cause 2)	Step 1 + Step 2
Test file is mostly `toBeDefined` / `toHaveBeenCalled()`	Weak assertions (cause 3)	Step 3
Coverage is high but no error-branch tests exist	Happy-path only (cause 1)	Step 4
Coverage high, but `stryker` score is low	Dead branches (cause 4)	Step 5
Fails only on real customer data	Invented test data (cause 5)	Step 1

Shortest path to fix

Ordered by ROI. Step 1 plus Step 2 typically expose 50% of the false-green tests within an hour.

Step 1: Hand-write one end-to-end test that reproduces the prod failure

Do not ask the AI to write it. Pull the real failing input from production logs and manually write one e2e test that uses the real DB and real API (or a staging mirror). If that one test fails, it proves the unit tests never covered the path.

// tests/e2e/checkout.e2e.test.ts
it("processes a real Stripe checkout end to end", async () => {
  const order = await placeOrder({
    userId: "real-staging-user-123",
    items: [{ sku: "SKU-001", qty: 2 }],
    paymentToken: process.env.STRIPE_TEST_TOKEN,
  });
  expect(order.status).toBe("paid");
  expect(order.stripeChargeId).toMatch(/^ch_/);
});

Run npm test -- checkout.e2e to execute it alone. If it fails, keep it as a regression gate.

Step 2: Audit the mock list, kick out core dependencies

List every mock in the repo:

grep -rn "jest.mock\|vi.mock" tests/ src/ | grep -v node_modules

Walk each entry and ask “is this a core dependency?” Rules of thumb:

Database, ORM, HTTP client, message queue, payments: do not mock. Use Testcontainers for a real DB and msw / nock for HTTP. Pin the container image to the same major version you run in production (for example postgres:16, not postgres:latest), and reuse one container per test suite rather than per test so startup does not dominate runtime.
Third-party SaaS (OpenAI, Stripe, SendGrid): mocking is OK, but validate the request payload shape, not just the response.
Time, RNG, filesystem: mocking is fine.

// Use msw to intercept real HTTP - schema mismatches fail immediately.
// msw v2 syntax (the current major as of June 2026):
import { http, HttpResponse } from "msw";
import { setupServer } from "msw/node";

const server = setupServer(
  http.get("/api/users/:id", () => HttpResponse.json({ id: "123", email: "a@b.com" }))
);

Step 3: Strengthen assertions to “value equals specific expectation”

Replace every toBeDefined with toEqual / toMatchObject, and every bare toHaveBeenCalled() with toHaveBeenCalledWith(...) carrying specific arguments.

// before
expect(emailSpy).toHaveBeenCalled();

// after
expect(emailSpy).toHaveBeenCalledWith({
  to: "user@example.com",
  template: "welcome",
  vars: { name: "Alice" }
});
expect(emailSpy).toHaveBeenCalledTimes(1);

Step 4: Give the AI a new test-generation prompt template

Paste this as a standing instruction in Cursor / Claude Code:

For <function>, write tests that include:
1. One real happy path (use specific real-looking data, not "test"/"foo")
2. At least 2 error branches: empty input, network failure, upstream 4xx/5xx
3. At least 1 edge case: unicode, empty string, oversized, race condition
4. Every assertion must check concrete values - bare toBeDefined / toBeTruthy banned
5. Do NOT mock the database, HTTP client, or payment API; use msw / Testcontainers
6. After generating, run mutation testing (npx stryker run); mutation score must be >= 70%

Step 5: Use mutation testing to expose fake coverage

Run periodically:

npx stryker run --mutate "src/**/*.ts"

Stryker flips + to -, > to >=, and true to false, then re-runs the suite. Tests that still pass are tests that do not matter. Stryker’s default thresholds (as of June 2026) are high: 80, low: 60, and break: null. With break unset the run never fails CI, so set it explicitly in stryker.conf.json to gate merges:

{
  "thresholds": { "high": 80, "low": 60, "break": 70 }
}

Any module below your break threshold goes back through the prompt template in Step 4.

How to confirm it is fixed

The hand-written e2e test from Step 1 now passes against staging, and it stays in CI as a regression gate.
grep -rn "jest.mock\|vi.mock" src/ tests/ shows no database, ORM, or payment client in the mock list.
A bare-assertion grep (toBeDefined, toBeTruthy, toHaveBeenCalled() with no args) returns near zero in the changed files.
npx stryker run reports a mutation score at or above your break threshold (70% is a reasonable starting bar), and the run exits non-zero if it drops below.
Manual sanity check: flip one operator in the unit under test (+ to -). At least one test must fail. If none do, the suite is still decorative.

Prevention

Bake into every test-generation prompt: “at least 2 error branches, do not mock DB/HTTP/payments, assertions check concrete values.”
In CLAUDE.md / .cursorrules, explicitly ban bare toBeDefined.
Add mutation testing (Stryker / mutmut) to CI with an explicit break threshold so a low score blocks merge.
Critical paths (checkout, auth, payments) must have at least 1 e2e test using real staging data.
After every production incident, add a regression test and require it in the PR template.
Monthly, review the mock list to make sure no core dependency is being shielded.

FAQ

Why does 92% coverage still ship a broken feature? Line coverage only records that a line ran, not that the assertion would catch a wrong value. A test that runs every line but only asserts toBeDefined() reports high coverage and catches almost nothing. Mutation score, not line coverage, is the metric that reflects whether your tests would notice a bug.

Is it ever OK to let AI write the tests? Yes, for the happy path and for boilerplate. The fix is not “ban AI tests,” it is “constrain them”: give it the Step 4 prompt template, forbid mocking core dependencies, require concrete-value assertions, and gate the result with mutation testing. AI is good at volume and weak at picking the inputs that break things.

What mutation score should I require? Start at 70% for the break threshold and raise it as the suite matures. Stryker’s defaults flag below 60% as a warning and below 80% as not-yet-green. New code under active development can sit higher; legacy modules may need a temporary lower floor. The number matters less than having a floor that fails CI at all.

Should I really run a real database in tests? For the integration and e2e layer, yes. Testcontainers spins up a real database in Docker so SQL quirks, migrations, connection pooling, and schema drift surface in the test instead of in production. Keep fast unit tests with mocks for pure logic, but never mock the DB on the path you ship.

My AI test mocks the function and then asserts the mock’s own return value. Is that useful? No. That is the most common false-green pattern: the test confirms the mock returns what you configured it to return, which tells you nothing about real behavior. Either drop the mock and hit the real dependency, or at minimum assert on the request payload the code sends, not on the canned response.

External references: StrykerJS configuration docs and the Mock Service Worker setupServer API.

Tags: #AI coding #Debug #Troubleshooting

Common causes

1. Tests only cover the happy path

2. Critical dependencies mocked away, real integration untested

3. Assertions too weak, only verifying “something returned”

4. High coverage but dead branches

5. Test data the AI invented, not real samples

Which bucket am I in

Shortest path to fix

Step 1: Hand-write one end-to-end test that reproduces the prod failure

Step 2: Audit the mock list, kick out core dependencies

Step 3: Strengthen assertions to “value equals specific expectation”

Step 4: Give the AI a new test-generation prompt template

Step 5: Use mutation testing to expose fake coverage

How to confirm it is fixed

Prevention

FAQ

Related

Related Articles

AI Added a Route That Bypasses Auth Middleware

AI Invented a Wrong API Signature That Does Not Exist

AI Migration Works on Dev, Fails on Prod Schema: The Fix

AI-Generated SQL Locks a Hot Table for Minutes

AI Keeps Using Deprecated Syntax Despite Lint Errors

AI Runs npm in a pnpm or Yarn Project (Lockfile Fix)