Codex Test Suggestions Are Too Generic

"Test happy path and error path" — useless in 2026. Force tests bound to the function's actual signature, edge cases, and bug history.

You asked Codex to suggest tests for parseInvoice(input: InvoiceRaw): Invoice. It returned: “test happy path, test error path, test edge cases, test boundary conditions, test invalid input.” None of these reference the actual function signature. None mention the real shape of InvoiceRaw or what makes an invoice “invalid” in your domain. The same suggestions would apply to literally any function.

Generic test suggestions come from generic prompts. The fix is to ground the request in the real signature, the real existing test conventions, and — most importantly — the actual bug history of the area. A test that catches a bug you’ve already had is worth ten tests for hypothetical edge cases.

Common causes

Ordered by hit rate, highest first.

1. Prompt didn’t name the function under test

“Suggest tests for this file” with a 200-line file → Codex hedges by listing tests that apply to all functions in the file. Result: generic patterns, not function-specific cases.

How to spot it: Re-read your prompt. If the function being tested isn’t named with name + path, Codex worked from the file as a whole.

2. Codex skimmed the function, didn’t read the type definitions

The function takes InvoiceRaw. Codex never opened InvoiceRaw’s definition (in types/invoice.ts), so it doesn’t know which fields are optional, what enums exist, or what the validation contract is.

How to spot it: Suggested tests treat all inputs as generic objects. No mention of specific fields, enum values, or constraints from the type.

3. Codex didn’t check existing tests

Existing tests cover happy path. Codex suggests “add a happy path test” — duplicating what’s there. Or it uses jest.mock when your existing tests use vi.mock.

How to spot it: Diff Codex’s suggestions against the existing test file. If 50% are duplicates or wrong-framework, the prompt didn’t include “read existing tests first.”

4. No bug history fed in

Codex doesn’t know that you’ve had three bugs in this area: a timezone bug, a leap-day bug, and a unicode normalization bug. Suggestions don’t include regression tests for any of them.

How to spot it: Compare suggested tests against your closed-bug history (git log --grep="fix" for the file). If known-bug shapes aren’t covered, you didn’t feed them in.

5. Mock setup is wrong because Codex guessed dependencies

Codex suggested mocking axios — but your code uses ky. Or it suggested mocking the database — but your tests use a real test DB. The mock setup is shaped right but pointed at the wrong target.

How to spot it: Imports in suggested tests don’t match imports in real tests in the same area.

6. Coverage thinking dominated over usefulness

“Add a test for line 47” because line 47 is uncovered — but line 47 is if (debug) console.log(...). The test adds zero value while bumping coverage %. Codex over-indexed on coverage metric, not bug-catching potential.

How to spot it: Suggested tests target trivial branches (logging, fallback strings). Coverage chasing, not real risk reduction.

Shortest path to fix

Ordered by ROI. Steps 1 and 2 together turn generic suggestions into actionable tests.

Step 1: Anchor to function + types + existing tests

Use this template:

Suggest tests for `parseInvoice(input: InvoiceRaw): Invoice` in `src/parsers/invoice.ts`.

Before suggesting:
1. Read `src/parsers/invoice.ts` and quote the function signature.
2. Read `src/types/invoice.ts` and quote `InvoiceRaw` and `Invoice` types.
3. Read `src/parsers/invoice.test.ts` and summarize what's already covered.

Then suggest 5 NEW tests (no duplicates of existing coverage):
- Each test names the specific input shape and expected output.
- Use the actual types — no generic `any` or fake field names.
- Match the existing file's style (vitest, `assert.deepStrictEqual`).

The “read first” steps force grounding.

Step 2: Feed in bug history

Past bugs in this area (from `git log --grep="fix.*invoice"`):
- 2026-01: leap-day not parsed when format was "2024-02-29"
- 2026-03: unicode normalization (NFC vs NFD) caused field-name mismatch
- 2026-04: empty array `lineItems: []` returned NaN total instead of 0

Suggest a regression test for each. Place them in a `describe("regression", ...)` block.

A regression test for a known bug is worth more than 10 hypothetical edge cases.

Step 3: Ask for adversarial inputs

What's the worst input that `parseInvoice` could receive and still be considered valid?
- Longest possible field
- Empty arrays/strings
- Unicode surrogate pairs
- Numbers near JS `Number` boundary (Infinity, MAX_SAFE_INTEGER)
- Mismatched currency / locale combos

For each, write a test case showing the input + expected behavior.

Step 4: For property-based testing, ask explicitly

Suggest 3 property-based tests using `fast-check`:
- Property 1: Round-trip — for any valid InvoiceRaw, parse(serialize(x)) === x.
- Property 2: Sum invariant — total === sum of lineItems.amount.
- Property 3: Currency consistency — all line items match invoice's currency.

Generate the `fast-check` setup and assertions for each.

Step 5: Reject suggestions that don’t compile

After Codex suggests tests, drop them into the test file and run:

pnpm vitest run src/parsers/invoice.test.ts --reporter=verbose

If any test fails to compile (referencing non-existent fields, wrong types), reject:

Test `parseInvoice handles tax: undefined` references `input.tax` — but the type is `taxes: TaxLine[]` (plural). Re-write to match the actual type.

This trains the session to ground in real types.

Step 6: For new functions, ask Codex to write the test first

For greenfield functions:

You're about to implement `parseInvoice`. BEFORE the implementation:
1. Read the type definitions for InvoiceRaw and Invoice.
2. Write 6 tests covering: happy path, each enum branch, boundaries, unicode, leap-year edge, error throw.
3. Tests should fail to compile (no implementation yet) — that's expected.
4. Then implement the function to make all tests pass.

Test-first with Codex produces tighter tests than test-after.

Prevention

  • Every test-suggestion prompt anchors to function signature + types + existing test file
  • Maintain a per-area “known bug shapes” doc; reference in test prompts so regression tests don’t get re-forgotten
  • Demand suggestions reference actual fields/types — reject anything with generic any or fake fields
  • Run suggested tests immediately; reject any that don’t compile against the real signature
  • For new code, test-first with Codex gives tighter coverage than test-after
  • Coverage % is not the goal — bugs caught per test is. Drop tests that only chase coverage

Tags: #Codex #Coding agent #Troubleshooting #Debug #Generic tests