Codex Code Review Quality Feels Shallow

"Consider error handling" / "add tests" — bullets that work on any PR. Fix with specific review questions tied to file:line.

You asked Codex to review a PR. The output: 8 bullets that read like a code-review template — “consider error handling,” “add tests for edge cases,” “is this thread-safe?”, “watch for null inputs.” Useful in 2018 when AI couldn’t read code; useless now. You can’t act on any of it, and you can’t tell whether Codex even opened the changed files.

A shallow review is almost always a vague prompt. “Review this PR” gets the model’s default summary mode. Specific questions, anchored to file:line, get specific answers. Below: the six prompt patterns that produce reviews worth reading.

Common causes

Ordered by hit rate, highest first.

1. The prompt was “review this PR” with no criteria

Generic prompt → generic review. Codex defaults to a template because nothing in the prompt tells it what to look for in this specific code.

How to spot it: Re-read the prompt. If it doesn’t name a dimension (security / perf / correctness / API contract) or a specific concern, you got a template.

2. Codex skimmed the diff, didn’t read whole files

The diff alone often lacks context. A 3-line change might be correct or catastrophic depending on the rest of the function. Without reading the whole file, the review can only comment on what’s visible in the diff hunk.

How to spot it: Review comments only reference lines that are in the diff, not surrounding context. Codex missed the bigger picture.

3. Diff was too large; Codex sampled

A 1,500-line PR exceeds context. Codex read the first few files in detail, then sampled or summarized the rest. Comments are dense on early files, vague on later ones.

How to spot it: Comment density per file is uneven — first 3 files get specific feedback, last 7 files get one generic comment each.

4. Codex used a “reviewer voice” trained for safety

Without anchoring, Codex falls back to non-committal phrasing: “consider”, “you might want to”, “watch out for”. These never claim a definite issue, so they’re un-actionable but un-rebuttable.

How to spot it: Count uses of “consider”, “might”, “could potentially”. High count = hedging mode, not real review.

5. No threat model / context provided

Codex doesn’t know the PR is for a payments path. It reviews like generic CRUD. Specific concerns (idempotency, race conditions, fraud surface) never surface because nothing told Codex this is high-stakes code.

How to spot it: Review doesn’t mention domain-specific risks even when they’re obviously relevant.

6. Review prompt asked for everything at once

“Review for correctness, security, performance, style, and accessibility” gets a thin pass at each — Codex tries to cover all dimensions and ends up shallow on all of them.

How to spot it: Review touches many dimensions but goes deep on none.

Shortest path to fix

Ordered by ROI. Step 1 alone produces 80% of the depth gain.

Step 1: Replace “review this PR” with focused questions

Use this template:

Review this PR for [specific concern].

For each issue:
- File:line (mandatory — no generic comments)
- One-sentence problem
- One-sentence proposed fix
- Severity (P0 ship-blocker / P1 before merge / P2 follow-up)

Do not comment on style, formatting, or anything caught by ESLint.
Do not suggest "consider X" — either flag X as an issue or don't mention it.

The “do not consider X” line is critical — it removes the hedge-mode escape hatch.

Step 2: Ask one question per review

Don’t bundle dimensions. Run separate review passes:

Pass 1: "Review for correctness. Are there any cases where this code produces a wrong result for valid input?"

Pass 2: "Review for null/undefined handling. List every place the new code dereferences a value without checking for null."

Pass 3: "Review for API contract changes. Does this PR change any public function signature or behavior?"

Each pass returns deeper, file-anchored feedback than one combined pass would.

Step 3: Provide context up front

Context: This PR is in a payments path. We process ~10K transactions/day.
We care about:
- Idempotency (same request twice should not double-charge)
- Race conditions (two parallel webhooks updating the same order)
- Audit trail (every state change must log who/when/from-state)

Review the diff against these concerns specifically. Ignore other dimensions.

Domain context elevates a generic review to a domain review.

Step 4: Demand adversarial framing

Try to break this code:
- What's the worst input that could reach `processPayment(req)`?
- What's the worst concurrent timing (race condition) that could occur?
- What state could violate the function's pre/post conditions?

For each, paste the input + expected vs. actual behavior.

Adversarial questions force Codex out of “looks plausible” mode into “find a failure” mode.

Step 5: Verify by spot-check

Don’t trust the review wholesale. Pick 2 comments, verify them:

Review claimed: "billing.ts:42 has a race condition between read and update."
Open billing.ts:42, read the surrounding code, decide for yourself.

If the spot-check passes, the rest of the review is more credible. If it fails (“there’s nothing on line 42”), the review is hallucinating and you need to re-prompt.

Step 6: Reject reviews without file:line anchors

If Codex returns “consider error handling” with no line reference, push back:

"Consider error handling" is not actionable. For each concern:
- Specific file:line
- The exact line(s) of code that have the problem
- What goes wrong with what input

Re-do the review with these constraints. Drop any concern you can't anchor.

This trains the session — subsequent reviews stay specific.

Prevention

  • Maintain a “review prompts” directory: one prompt per area (security, perf, correctness)
  • Always provide domain context (this is payments, this is auth, this is internal tooling) at the start
  • Require file:line anchors and severity for every comment — reject the rest
  • Run multiple narrow review passes instead of one wide pass
  • Spot-check 2 comments per review to calibrate trust
  • AI review is a complement, not a replacement, for human review on important PRs

Tags: #Codex #Coding agent #Troubleshooting #Debug #Shallow review