Codex Test Suggestions Are Too Generic

"Test the happy path and error path" is useless filler. Force Codex to bind tests to the real function signature, the actual types, edge cases, and bug history.

Published: May 17, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You asked Codex to suggest tests for parseInvoice(input: InvoiceRaw): Invoice. It returned: “test the happy path, test the error path, test edge cases, test boundary conditions, test invalid input.” None of these reference the actual function signature. None mention the real shape of InvoiceRaw or what makes an invoice “invalid” in your domain. The same five bullets would apply to literally any function in any codebase.

Fastest fix: generic test suggestions come from generic prompts. In one prompt, name the function with name + file path, tell Codex to read the type file and the existing test file first, then ask for NEW tests bound to the real types. That single change turns boilerplate into runnable tests. Everything below makes those tests sharper — bug-history regression tests, adversarial inputs, and a check that they actually compile.

A test that catches a bug you have already had is worth ten tests for hypothetical edge cases. Grounding beats volume.

Which bucket are you in?

Match the smell to the cause, then jump to the fix.

What the suggestions look like	Most likely cause	Fix
Five generic bullets, no field names	Prompt named a file, not a function	Step 1
Inputs treated as plain objects, no enum/field detail	Codex never read the type definition	Step 1 (point 2)
Duplicates of tests you already have; wrong test framework	Codex never read the existing test file	Step 1 (point 3)
No regression test for bugs you have actually shipped	No bug history fed in	Step 2
Mocks point at the wrong library (`axios` vs `ky`)	Codex guessed dependencies	Step 1 + reject loop
Tests target trivial lines (logging, fallbacks)	Coverage chasing, not risk reduction	Step 3

Common causes

Ordered by hit rate, highest first.

1. The prompt named a file, not the function under test

“Suggest tests for this file” pointed at a 200-line file makes Codex hedge: it lists tests that apply to every function in the file at once. Result: generic patterns, not function-specific cases.

How to spot it: re-read your prompt. If the function being tested is not named with both its name and its file path, Codex worked from the file as a whole.

2. Codex skimmed the function and never read the type definitions

The function takes InvoiceRaw. Codex never opened that type’s definition (in src/types/invoice.ts), so it does not know which fields are optional, what enums exist, or what the validation contract is.

How to spot it: suggested tests treat every input as a generic object. No mention of specific fields, enum values, or constraints from the type.

3. Codex did not check the existing tests

Existing tests already cover the happy path. Codex suggests “add a happy path test,” duplicating what is there. Or it writes jest.mock(...) when your existing suite uses vi.mock(...).

How to spot it: diff Codex’s suggestions against the existing test file. If half are duplicates or use the wrong framework, the prompt did not say “read the existing tests first.”

4. No bug history was fed in

Codex does not know you have had three bugs in this area: a timezone bug, a leap-day bug, and a Unicode-normalization bug. The suggestions include zero regression tests for any of them.

How to spot it: compare the suggested tests against your closed-bug history. Run git log --grep="fix" -- src/parsers/invoice.ts for the file. If known-bug shapes are not covered, you did not feed them in.

5. Mock setup is wrong because Codex guessed the dependencies

Codex suggested mocking axios, but your code uses ky. Or it suggested mocking the database, but your tests run against a real test DB. The mock setup is shaped correctly and pointed at the wrong target.

How to spot it: imports in the suggested tests do not match the imports in real tests in the same directory.

6. Coverage thinking dominated over usefulness

“Add a test for line 47” because line 47 is uncovered, but line 47 is if (debug) console.log(...). The test adds zero value while bumping the coverage percentage. Codex over-indexed on the coverage metric instead of bug-catching potential.

How to spot it: suggested tests target trivial branches (logging, fallback strings). That is coverage chasing, not real risk reduction.

Shortest path to fix

Ordered by ROI. Steps 1 and 2 together turn generic suggestions into actionable tests.

Step 1: Anchor to function + types + existing tests

Use this template:

Suggest tests for `parseInvoice(input: InvoiceRaw): Invoice` in `src/parsers/invoice.ts`.

Before suggesting anything:
1. Read `src/parsers/invoice.ts` and quote the function signature back to me.
2. Read `src/types/invoice.ts` and quote the `InvoiceRaw` and `Invoice` types.
3. Read `src/parsers/invoice.test.ts` and summarize what is already covered.

Then suggest 5 NEW tests (no duplicates of existing coverage):
- Each test names the specific input shape and the expected output.
- Use the actual types: no generic `any`, no invented field names.
- Match the existing file's style (vitest, `assert.deepStrictEqual`).

The “read first” steps force grounding. Codex’s default mode reads files before acting, but naming the exact paths removes the guesswork about which files matter.

Step 2: Feed in the bug history

Past bugs in this area (from `git log --grep="fix" -- src/parsers/invoice.ts`):
- 2026-01: leap-day not parsed when the format was "2024-02-29"
- 2026-03: Unicode normalization (NFC vs NFD) caused a field-name mismatch
- 2026-04: empty array `lineItems: []` returned a NaN total instead of 0

Suggest one regression test for each. Place them in a `describe("regression", ...)` block.

A regression test for a known bug is worth more than ten hypothetical edge cases, because it guards a failure mode this code has already exhibited.

Step 3: Ask for adversarial inputs (and skip the trivial branches)

What is the worst input `parseInvoice` could receive and still be considered valid?
- Longest possible field value
- Empty arrays and empty strings
- Unicode surrogate pairs
- Numbers near the JS `Number` boundary (`Infinity`, `Number.MAX_SAFE_INTEGER`)
- Mismatched currency / locale combinations

For each, write a test case showing the input and the expected behavior.
Do NOT add tests for trivial branches like debug logging or fallback strings.

The last line directly kills cause 6 (coverage chasing).

Step 4: For property-based tests, ask explicitly

Suggest 3 property-based tests using `fast-check`:
- Property 1: round-trip — for any valid InvoiceRaw, parse(serialize(x)) deep-equals x.
- Property 2: sum invariant — total equals the sum of lineItems.amount.
- Property 3: currency consistency — every line item matches the invoice's currency.

Generate the `fast-check` setup and assertions for each.

Step 5: Reject suggestions that do not compile

After Codex suggests tests, drop them into the test file and run them. With Vitest:

pnpm vitest run src/parsers/invoice.test.ts --reporter=verbose

If any test fails to compile (referencing non-existent fields or the wrong types), reject it by name:

Test `parseInvoice handles tax: undefined` references `input.tax`, but the type is `taxes: TaxLine[]` (plural). Re-write it to match the actual type.

Naming the exact mismatch trains the rest of the session to ground itself in the real types instead of guessing.

Step 6: For new functions, ask Codex to write the test first

For greenfield functions:

You are about to implement `parseInvoice`. BEFORE the implementation:
1. Read the type definitions for `InvoiceRaw` and `Invoice`.
2. Write 6 tests covering: happy path, each enum branch, boundaries, unicode, the leap-year edge, and the error throw.
3. The tests should fail to compile (no implementation yet) — that is expected.
4. Then implement the function so all tests pass.

Test-first with Codex produces tighter tests than test-after, because the tests describe the contract before the implementation can bias them.

Make it stick: AGENTS.md and a reusable prompt

The steps above fix one session. To stop re-typing “read the type file, match the existing framework” on every request, push the conventions down into config so every Codex run inherits them.

Put test conventions in AGENTS.md. As of June 2026, Codex reads AGENTS.md files before doing any work. It merges from your global file (~/.codex/AGENTS.md) down through the Git root to your current directory, with files closer to your working directory overriding earlier ones (discovery stops at the project_doc_max_bytes limit, 32 KiB by default). Add a short testing section:

## Testing conventions
- Test runner: Vitest. Run a single file with `pnpm vitest run <path>`.
- Assertions: `assert.deepStrictEqual` from `node:assert/strict`. Do not introduce Jest.
- Mock HTTP with `ky`'s test helpers, never `axios`.
- Group regression tests in a `describe("regression", ...)` block.
- Never add tests for debug logging or pure fallback branches just to raise coverage.

Now “match the existing framework” and “skip coverage filler” are enforced without you saying them every time.

Save the test-suggestion prompt as a reusable skill. OpenAI deprecated ~/.codex/prompts custom prompts in favor of skills (reusable instructions Codex can invoke explicitly or implicitly and that ship in your repo). Turn the Step 1 + Step 2 template into a skill with a placeholder so you can run it against any function and have Codex pull the bug history itself. That makes the grounded prompt the default, not a thing you remember to type.

How to confirm it is fixed

You have fixed the genericness when all of these hold for the suggested tests:

Every test names a concrete input and expected output drawn from the real type — no any, no invented fields.
pnpm vitest run <path> compiles and runs the suggestions without type errors.
At least one regression test maps to a real entry in git log --grep="fix" -- <file>.
Zero duplicates of tests already in the existing file (diff to confirm).
No test targets a trivial branch (logging, fallback strings) purely for coverage.

If any check fails, the prompt was still under-grounded — return to Step 1 and name what was missing.

Prevention

Every test-suggestion prompt anchors to the function signature, the types, and the existing test file.
Keep test conventions (runner, assertion style, mock library) in AGENTS.md so Codex inherits them without being told.
Maintain a per-area “known bug shapes” note and reference it in test prompts so regression tests do not get re-forgotten.
Demand that suggestions reference actual fields and types; reject anything with generic any or invented fields.
Run suggested tests immediately and reject any that do not compile against the real signature.
For new code, test-first with Codex gives tighter coverage than test-after.
Coverage percentage is not the goal — bugs caught per test is. Drop tests that only chase coverage.

FAQ

Codex still lists generic tests even when I name the function. Why? It probably did not read the supporting files. Make the read step explicit and demand evidence: “Quote the function signature and the InvoiceRaw type back to me before suggesting tests.” If Codex cannot quote them, it did not open them, and any test it writes is a guess.

Should I tell Codex to chase a coverage number? No. A coverage target pushes Codex toward tests for trivial branches (logging, fallbacks) that add a percentage point and catch nothing. Ask for tests that guard real failure modes and known bugs instead; coverage will follow the useful ones.

Why does Codex keep using the wrong test framework or mock library? It defaults to the most common library it has seen (often Jest and axios) when your prompt does not pin the choice. Put the runner and mock library in AGENTS.md, and the wrong-framework suggestions stop appearing.

Which model should Codex CLI use for this? The default GPT-5.5 model in Codex CLI (the default since the April 23, 2026 release) handles grounded test generation well once the prompt names the files. The model is rarely the bottleneck here — under-specified prompts are.

Is test-first or test-after better with Codex? Test-first. Writing the tests before the implementation forces Codex to commit to a contract, so the tests describe intended behavior instead of rubber-stamping whatever the implementation happens to do.

Tags: #Codex #Coding agent #Troubleshooting #Debug #Generic tests