CI is fully green, coverage says 92%, you merge the PR, and the very first real call in production blows up. You open the AI-generated tests and find: the database is mocked, the payment API is mocked, and the only assertion is expect(result).toBeDefined(). This is the single most common failure mode when Claude Code / Cursor / Codex / Aider write tests autonomously — they optimize for “make the test pass,” not “make the real scenario pass.” This article breaks down five common root causes and the path to turn the test suite from decoration into a real safety net.
Common causes
Ordered by hit rate, highest first.
1. Tests only cover the happy path
AI defaults to “normal input returns normal output” and skips empty arrays, oversized strings, network timeouts, race conditions, expired tokens — the branches that actually break in production.
// Typical AI-generated test
it("returns user data", async () => {
const user = await getUser("123");
expect(user).toBeDefined();
});
// Missing: getUser("") / getUser(null) / network down / 404 / 5xx
How to spot it: grep test files for throw / reject / error keywords; if a module has 10 tests and 0 error branches, this is it.
2. Critical dependencies mocked away, real integration untested
AI tends to mock the database, HTTP client, and filesystem to “make tests fast.” Result: schema mismatches, renamed API fields, and SQL errors slip through.
jest.mock("./db", () => ({
getUser: jest.fn().mockResolvedValue({ id: "123", name: "Test" })
}));
// Real db.getUser returns { user_id, full_name } — schema changed long ago,
// test stays green
How to spot it: count jest.mock / vi.mock / unittest.mock.patch calls in the test file. More than 3 and you should audit which core dependencies are being shielded.
3. Assertions too weak — only verifying “something returned”
toBeDefined() / toBeTruthy() / toHaveBeenCalled() on their own carry almost no information. A function returning {} will pass.
// weak
expect(result).toBeDefined();
expect(saveUser).toHaveBeenCalled();
// strong
expect(result).toEqual({ id: "123", email: "a@b.com", status: "active" });
expect(saveUser).toHaveBeenCalledWith({ id: "123", email: "a@b.com" });
expect(saveUser).toHaveBeenCalledTimes(1);
How to spot it: search the test files for bare uses of toBeDefined / toBeTruthy / toHaveBeenCalled() without arguments. More than 5 hits and you need structural assertions.
4. High coverage but dead branches
100% line coverage is achievable by calling a function once with inputs that walk every if, but each branch only sees one input shape. Mutation testing (Stryker / mutmut) exposes this immediately.
How to spot it: run npx stryker run. Mutation score < 60% means the tests check shape, not values. Or manually: change return a + b to return a - b in the unit under test — does any test fail?
5. Test data the AI invented, not real samples
AI-fabricated test data is suspiciously tidy: "test@test.com", "John Doe", 123. Real data has emoji, oversized unicode, null fields, leading whitespace. The edges are exactly where production breaks.
How to spot it: replay a scrubbed sample of production data through the same tests. If a lot of them fail, your AI test data is too idealized.
Shortest path to fix
Ordered by ROI. Step 1 + Step 2 typically expose 50% of the false-green tests within an hour.
Step 1: Hand-write one end-to-end test that reproduces the prod failure
Don’t ask the AI to write it. Pull the real failing input from production logs and manually write one e2e test that uses the real DB / real API (or a staging mirror). If that one test fails, it proves the unit tests never covered the path.
// tests/e2e/checkout.e2e.test.ts
it("processes a real Stripe checkout end to end", async () => {
const order = await placeOrder({
userId: "real-staging-user-123",
items: [{ sku: "SKU-001", qty: 2 }],
paymentToken: process.env.STRIPE_TEST_TOKEN,
});
expect(order.status).toBe("paid");
expect(order.stripeChargeId).toMatch(/^ch_/);
});
Run npm test -- checkout.e2e to execute it alone; if it fails, keep it as a regression gate.
Step 2: Audit the mock list, kick out core dependencies
List every mock in the repo:
grep -rn "jest.mock\|vi.mock" tests/ src/ | grep -v node_modules
Walk each entry and ask “is this a core dependency?” Rules of thumb:
- Database, ORM, HTTP client, message queue, payments: do not mock. Use testcontainers / msw / nock for realistic simulation.
- Third-party SaaS (OpenAI, Stripe, SendGrid): mocking is OK, but validate the request payload shape.
- Time, RNG, filesystem: mocking is fine.
// Use msw to intercept real HTTP — schema mismatches fail immediately
import { setupServer } from "msw/node";
const server = setupServer(
http.get("/api/users/:id", () => HttpResponse.json({ id: "123", email: "a@b.com" }))
);
Step 3: Strengthen assertions to “value equals specific expectation”
Replace every toBeDefined with toEqual / toMatchObject, and every bare toHaveBeenCalled() with toHaveBeenCalledWith(...) carrying specific arguments.
// before
expect(emailSpy).toHaveBeenCalled();
// after
expect(emailSpy).toHaveBeenCalledWith({
to: "user@example.com",
template: "welcome",
vars: { name: "Alice" }
});
expect(emailSpy).toHaveBeenCalledTimes(1);
Step 4: Give the AI a new test-generation prompt template
Paste this as a standing instruction in Cursor / Claude Code:
For <function>, write tests that include:
1. One real happy path (use specific real-looking data, not "test"/"foo")
2. At least 2 error branches: empty input, network failure, upstream 4xx/5xx
3. At least 1 edge case: unicode, empty string, oversized, race condition
4. Every assertion must check concrete values — bare toBeDefined / toBeTruthy banned
5. Do NOT mock the database, HTTP client, or payment API; use msw / testcontainers
6. Run mutation testing (npx stryker run) — mutation score must be ≥ 70%
Step 5: Use mutation testing to expose fake coverage
Run periodically:
npx stryker run --mutate "src/**/*.ts"
Stryker flips + to -, > to >=, true to false, then re-runs the suite. Tests that still pass = tests that don’t matter. Any module with mutation score < 60% goes back through the prompt template above.
Prevention
- Bake into every test-generation prompt: “at least 2 error branches + don’t mock DB/HTTP/payments + assertions check concrete values”
- In
CLAUDE.md/.cursorrulesexplicitly ban baretoBeDefined - Add mutation testing (Stryker / mutmut) to CI; mutation score < 70% blocks merge
- Critical paths (checkout, auth, payments) must have ≥ 1 e2e test using real staging data
- After every production incident, add a regression test; require it in the PR template
- Monthly, review the mock list to make sure no core dependency is being shielded
Related
- AI removed working logic
- Build passes locally fails cloud
- AI suggested stale dependency
- AI pre-commit review workflow
Tags: #AI coding #Debug #Troubleshooting