Should I use AI for unit or integration tests?

Unit tests are safest. Integration tests need real environment setup that models frequently get wrong (wrong port, wrong fixture path, wrong cleanup).

Will AI write better tests than me?

It generates more cases faster and surfaces edge cases you forget. Final quality still depends on your review and on the mutation score, not the model.

My coverage tool says 95%. Why does this workflow still find bugs?

Coverage measures lines executed, not behaviors verified. A function with no assertions can hit 100% line coverage.

Can I generate tests for legacy code with no documentation?

Yes. Have the model read the function and write the contract first. Verify the contract is correct before generating tests against it.

How long does this take per function?

15 to 30 minutes for non-trivial pure functions, longer for async or stateful code. It pays back the first time it catches a real regression instead of letting it ship.

Should I automate mutation testing in CI?

For most teams, manual mutation inside the prompt is enough for day-to-day work. Wire Stryker-JS or PIT into CI when test quality is a measured KPI, and run it on critical paths only since full-repo mutation runs are slow.

AI Tool Tutorials

AI Test Generation Workflow: Tests You Can Actually Trust

AI-written tests often pass while testing nothing. This adversarial workflow plus mutation testing gives you real coverage, with the 2026 tooling that proves it.

Published: May 17, 2026 Updated: Jun 05, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

TL;DR

Asking an AI to “add tests for this function” produces tests that mirror the implementation: they pass forever and catch nothing. The fix is to make generation adversarial. Ask the model for edge cases first, write tests second, then mutate the function and confirm the tests fail. Budget 15 to 30 minutes per non-trivial function. Verify the result with a mutation tester (Stryker-JS for JavaScript/TypeScript, PIT for Java, mutmut for Python), because as of June 2026 that is the only tool that reliably tells you whether AI-generated tests assert anything at all.

Why “add tests” produces tests that test nothing

When you prompt Claude Code, Cursor, or any LLM IDE with “write tests for this function,” the model reads the implementation and writes assertions that match what the code currently does. If the code has a bug, the test encodes the bug as correct behavior. The suite goes green, your coverage number climbs, and the next regression ships unnoticed.

This is not a hypothetical. Thoughtworks’ Technology Radar Vol 34 (April 2026) moved mutation testing into its Trial ring specifically because, in their words, “in an era of LLM-generated tests, mutation testing is what tells you whether your tests actually assert anything.” The same Radar notes the related failure mode: agents prompted to write tests routinely hallucinate selectors, fixture paths, and APIs from training data that do not exist in your codebase. A test green on the first run is the symptom, not the goal.

The workflow below flips the order. You force the model to enumerate failure modes before it writes a single assertion, then you prove the tests bite.

Who this is for

Developers using Claude Code (Opus 4.7 / Sonnet 4.6), Cursor, GitHub Copilot, or Codex to add tests to existing untested code. It is most valuable for utility libraries, parsing functions, billing and pricing logic, permission checks, and anywhere a wrong answer is silent and expensive. Tech leads writing coverage guidelines can hand this workflow to a team as a defensible definition of “what good looks like.”

When this is the wrong tool

End-to-end browser flows that need a real Chromium, real network, and realistic timing. Use Playwright with handwritten scenarios (or a Playwright MCP server so the agent reads selectors from the live DOM instead of guessing).
Property-based testing, where the value is thousands of randomized inputs. Use fast-check (JS) or Hypothesis (Python) directly.
Performance benchmarks and UI snapshot tests, where the real signal is timing or a visual diff, not logic.

Before you start

Write the function’s contract in three lines: inputs, outputs, side effects, error modes. If you cannot, the model cannot either.
Pin the test framework in context. Otherwise the model writes Jest when your repo uses Vitest.
Add a CLAUDE.md (Claude Code) or .cursor/rules (Cursor) file noting conventions: assertion style, fixture locations, naming, and what to mock versus use real.
Run the suite once on green. You need to know which failures are new versus pre-existing.
Flag hidden side effects (filesystem, network, Date.now()). Each needs an explicit mock or fake timer.

Step by step

Pick one function. Ask for failure modes, not tests. Prompt the model to list 10 inputs that could break it: empty, null, very large, negative, unicode, boundary values, locale-specific, DST transitions, race conditions. Say explicitly: do not write tests yet.
Read the list and add 2 to 4 domain cases the model missed. These are usually the costly ones: a coupon that expired yesterday at midnight UTC, a customer with zero subscriptions, a price that rounds to a half-cent.
Ask for one test per case, arrange-act-assert, descriptive name. Example prompt:

Write one test per case from the list above. Use Vitest. Arrange-act-assert.
Test name pattern: should [behavior] when [condition].
Do not refactor the function under test. Do not add helpers unless necessary.

Run the tests. Any that pass on the first run are suspect. Open them and confirm the assertion exercises the path. The classic tell is expect(result).toBeDefined() where it should be expect(result).toEqual(specificValue).
For each failing test, decide: is the test wrong or the function wrong? If the function is wrong, you found a real bug. Fix the function. If the test is wrong, fix the test.
Run mutation testing to prove the tests bite. Manually, ask the model to mutate the function in three plausible ways (flip a comparison, remove a guard clause, swap two parameters) and check each is caught. For an objective number, run a mutation tester (see the table below) and read the mutation score, not the line-coverage percentage.
Commit tests and any function fixes as separate commits so the regression history stays clean.

Mutation testing tools (as of June 2026)

Line coverage tells you which lines ran. Mutation score tells you which lines your tests actually verify. These are the maintained options by language:

Language	Tool	Latest (June 2026)	Notes
JS / TS	Stryker-JS	v9.6.x (Apr 2026)	Vitest 4.x and Jest runners; HTML report on by default; the de facto standard
Java / JVM	PIT (pitest)	Maven plugin 1.15.x	Operates on bytecode; JUnit 5 plugin; Maven + Gradle integration
Python	mutmut / cosmic-ray	actively maintained	mutmut for quick local runs; cosmic-ray for build-tool integration
.NET	Stryker.NET	current	Same Stryker engine, C# mutators

A mutation score in the 70 to 90% range on critical code is a realistic target. Chasing 100% wastes time on equivalent mutants that no test can distinguish.

Coverage tooling note (Vitest)

If you use Vitest, set coverage.provider deliberately. Since Vitest 3.2 the default v8 provider uses AST-based remapping, so you get V8’s speed with Istanbul-level accuracy, which removes the old reason to switch to istanbul. Either way, treat coverage as a side effect, not a goal: 60% coverage with strong assertions beats 95% with weak ones, and a high line-coverage number with all the gaps in error-handling paths is worse than an honest lower one.

First-run exercise

Pick the smallest pure function you have: a string parser, date utility, or ID generator. Run the full workflow end to end, including a real mutation run. Most developers find the model catches 60 to 80% of edge cases on the first list, while the human additions account for the genuinely surprising bugs. For a second pass, change one variable (a different model, or a CLAUDE.md with your conventions) and rerun. The diff in mutation score tells you which factor mattered.

Quality checklist

Every test has a non-trivial assertion. No standalone toBeTruthy() or toBeDefined(), which pass on almost anything.
Test names describe behavior, not implementation. “returns null on empty input” yes; “calls slice with 0” no.
No test mocks the function under test. Models sometimes do this to force a pass.
Mutation score recorded, not just line coverage. At least the three hand-injected bugs were caught.
Tests and fixes committed separately, so regressions trace cleanly.

Common mistakes

Asking for tests directly, skipping the edge-case step. The tests stay shallow and mirror the implementation.
Not running the tests. Models occasionally write a test whose intent is right but whose assertion is wrong.
Generating tests and fixing failures in the same prompt. The model often “fixes” by editing the test to pass instead of fixing the function.
Treating 100% line coverage as the goal. Coverage without strong assertions catches no bugs.
Forgetting fake timers and mocks for time- or network-dependent code. Tests pass locally and fail in CI.
Letting the model move existing assertions into helper functions during generation, which makes the diff unreadable.

Advanced tips

For branching functions, prompt explicitly: one test per branch (true and false for each conditional, every switch case, every error path).
For async or time-dependent code, instruct the model to set up fake timers and show one worked example in CLAUDE.md.
For parsing functions, add an explicit malformed-input section to the edge-case list. Models under-test garbage inputs by default.
For database-touching code, ask for an in-memory or transaction-rollback pattern. Otherwise the model suggests broad mocks that prove nothing.
Save three prompts as snippets: edge-case-list, write-tests-from-cases, mutation-check. Each is a one-line tweak per function.

FAQ

Should I use AI for unit or integration tests? Unit tests are safest. Integration tests need real environment setup that models frequently get wrong (wrong port, wrong fixture path, wrong cleanup).
Will AI write better tests than me? It generates more cases faster and surfaces edge cases you forget. Final quality still depends on your review and on the mutation score, not the model.
My coverage tool says 95%. Why does this workflow still find bugs? Coverage measures lines executed, not behaviors verified. A function with no assertions can hit 100% line coverage.
Can I generate tests for legacy code with no documentation? Yes. Have the model read the function and write the contract first. Verify the contract is correct before generating tests against it.
How long does this take per function? 15 to 30 minutes for non-trivial pure functions, longer for async or stateful code. It pays back the first time it catches a real regression instead of letting it ship.
Should I automate mutation testing in CI? For most teams, manual mutation inside the prompt is enough for day-to-day work. Wire Stryker-JS or PIT into CI when test quality is a measured KPI, and run it on critical paths only since full-repo mutation runs are slow.

Tags: #AI coding #Tutorial #Workflow

TL;DR

Why “add tests” produces tests that test nothing

Who this is for

When this is the wrong tool

Before you start

Step by step

Mutation testing tools (as of June 2026)

Coverage tooling note (Vitest)

First-run exercise

Quality checklist

Common mistakes

Advanced tips

FAQ

Related

Related Articles

AI Changelog Generation: From Commits to a Release Note Humans Read

AI-Assisted Database Migrations — Reversible, Backfilled, Tested

AI for Incident Postmortems Without Sanitizing the Lessons

AI Merge Conflict Resolution: When to Trust the Auto-Merge

AI On-Call Debugging: From Page to Fix Without Panic

AI PR Descriptions: From Diff to Reviewable