What this tutorial solves
Asking an AI to “add tests for this function” usually produces tests that mirror the implementation — they pass forever, they catch no bugs, and they give you a false sense of coverage. Real AI test generation has to be adversarial: ask for edge cases first, write tests second, then mutate the code to prove the tests actually fail when they should. This workflow takes about 15 to 30 minutes per non-trivial function and routinely surfaces 2 to 4 real bugs in code you thought was solid.
Who this is for
Developers using Claude Code, Cursor, Copilot, or any LLM IDE to add tests to existing untested code. Especially useful for utility libraries, parsing functions, billing logic, permission checks, and anywhere a wrong answer is silent and costly. Also for tech leads writing test-coverage guidelines: this workflow encodes a defensible “what good looks like” you can hand to the team.
When to reach for it
Untested utility functions you inherited, business logic with branching behavior (pricing, tax, eligibility), code that just absorbed a bug fix and needs a regression test, and any pre-refactor moment where you need a safety net before touching the implementation. Also valuable when onboarding to a new codebase — generating tests is the fastest way to learn what a function actually does.
When this is NOT the right tool
End-to-end browser tests where you need real Chromium, real network, and realistic timing — use Playwright with handwritten scenarios. Property-based tests where the value is generating thousands of randomized inputs — use fast-check or Hypothesis directly. Performance benchmarks. Anything UI snapshot-based where the actual signal is visual diff, not logic.
Before you start
- Be explicit about the function’s contract: inputs, outputs, side effects, error modes. If you cannot summarize it in three lines, AI cannot either.
- Choose the test framework AI should use and put the answer in context. Otherwise it picks Jest when you have Vitest.
- Have a
CLAUDE.mdor.cursorrulesfile noting conventions: assertion style, fixture locations, naming, what to mock vs use real. - Run the test suite once on green before starting. You need to know which failures are new vs pre-existing.
- Identify whether the function has hidden side effects (filesystem, network, time). Each requires explicit mock setup.
Step by step
- Pick one function. Ask AI to list 10 inputs that could break it: empty, null, very large, negative, unicode, boundary values, locale-specific, DST transitions, race conditions. Tell it explicitly: do not write tests yet.
- Read the list. Add 2 to 4 cases you can think of that AI missed — usually the domain-specific ones (a coupon code that expired yesterday at midnight UTC, a customer with zero subscriptions).
- Now ask AI to write one test per case, arrange-act-assert structure, descriptive name. Example prompt:
Write one test per case from the list above. Use Vitest. Arrange-act-assert.
Test name: should <behavior> when <condition>.
Do not refactor the function under test. Do not add helpers unless necessary.
- Run the tests. Any that pass on day 1 are suspect — open them and verify the assertion actually exercises the path. Common failure:
expect(result).toBeDefined()instead ofexpect(result).toEqual(specific value). - For each failing test, decide: is the test wrong or is the function wrong? If the function is wrong, you just found a real bug — fix the function. If the test is wrong, fix the test.
- After tests pass, run mutation testing manually: ask AI to mutate the function in 3 plausible ways (flip a comparison, remove a guard, swap a parameter). Do the tests catch each mutation? If not, write the missing test.
- Commit tests and any function fixes as separate commits so the regression history is clean.
First-run exercise
Pick the smallest pure function you have — string parser, date utility, ID generator. Run the full workflow end to end including mutation testing. Most developers find AI catches 60 to 80% of the edge cases on the first list, and the human additions account for the genuinely surprising bugs. For the second run, change one thing: use a different model, or add a CLAUDE.md with your conventions and rerun. The diff in test quality tells you which factor mattered most.
Quality check
- Every test has a non-trivial assertion. No
toBeTruthy()ortoBeDefined()standalone — those pass on almost anything. - Test names describe behavior, not implementation. “returns null on empty input” yes; “calls slice with 0” no.
- No test mocks the function under test. AI sometimes does this to “make it pass”.
- Coverage is a side effect, not a goal. A 60% coverage with strong assertions beats 95% with weak ones.
- Mutation testing passed: at least 3 hand-injected bugs were caught by the tests.
How to reuse this workflow
- Save three prompts as snippets: edge-case-list, write-tests-from-cases, mutation-check. Each is one-line tweak per function.
- Maintain a
test-patterns/directory with examples of good tests for your domain. Reference them in the prompt — AI mimics shape. - Keep an
ai-test-misses/log: every time a bug ships and tests did not catch it, add the case to the edge-case prompt template. - Treat tests like content: review the diff carefully before merging. AI-written tests get rubber-stamped, then rot.
- Rerun the workflow on critical functions every quarter — the function evolved, the model evolved, the edge cases evolved.
Recommended workflow
Pick function → ask AI for 10 edge cases (no tests yet) → human adds 2 to 4 domain cases → AI writes one test per case → run → fix failures (test or function) → mutation check via 3 hand-injected bugs → commit tests separately from fixes.
Common mistakes
- Asking for tests directly without the edge-case list step. The tests will be shallow and mirror the implementation.
- Not running the tests. AI sometimes writes a test whose assertion is wrong but whose intent is right.
- Generating tests and fixing failures in the same prompt. AI often “fixes” by editing the test to pass instead of fixing the function.
- Treating 100% coverage as the goal. Coverage without strong assertions catches no bugs.
- Forgetting fake timers and mocks for time- or network-dependent code. Tests pass locally, fail in CI.
- Letting AI move existing assertions into helper functions during test generation. Suddenly diffs are unreadable.
Advanced tips
- For functions with branches, prompt explicitly: write one test per branch (true and false for each conditional, every switch case, every error path).
- For async or time-dependent code, instruct AI to set up fake timers explicitly. Show one example in
CLAUDE.md. - Keep a “test patterns” file describing conventions: where fixtures live, naming, mock vs real, allowed external dependencies. AI follows it within the same context.
- For parsing functions, include a malformed-input section in the edge-case list explicitly. AI under-tests garbage inputs.
- For database-touching code, ask AI for tests using an in-memory or transaction-rollback pattern. Otherwise it suggests broad mocks that prove nothing.
Output checklist
- Edge-case list reviewed by a human before tests are written.
- Tests run locally and pass.
- At least one mutation test confirms the tests actually fail when code breaks.
- No test that always passes regardless of behavior.
- Tests and fixes committed separately so regressions trace cleanly.
FAQ
- Should I use AI for unit or integration tests?: Unit tests are the safest. Integration tests need real environment setup that AI often gets wrong (wrong port, wrong fixture path, wrong cleanup).
- Will AI write better tests than me?: It generates more cases, faster, and surfaces edge cases you forget. Final quality still depends on your review.
- My coverage tool says 95%. Why does the workflow still find bugs?: Coverage measures lines executed, not behaviors verified. A function with no assertions can hit 100% lines.
- Can I generate tests for legacy code with no documentation?: Yes. Have AI read the function and produce a contract first. Verify the contract is right before generating tests against it.
- How long does this take per function?: 15 to 30 minutes for non-trivial pure functions, longer for async or stateful code. Compared to letting bugs ship and triaging in prod, it pays back the first time it catches a real regression.
- Should I run mutation testing automatically?: Stryker (JS) and PIT (Java) exist for this. For most teams, manual mutation in the prompt is enough; automation is overkill unless test quality is a measured KPI.
Related
Tags: #AI coding #Tutorial #Workflow