Production incidents almost always trace back to code that looked fine at review time. This workflow uses AI as a second pair of eyes to flag the suspicious spots a human review keeps missing — error swallowing, race conditions, silent fallbacks — before the module ships. Aimed at developers maintaining live code who want a 20-minute audit habit, not a 3-hour formal review.
What this covers
Have AI surface likely bug spots in a module before they cause incidents — the targeted, follow-up-driven version of “review this code.”
Who this is for
Developers maintaining production code, on-call engineers preparing a feature for launch, and tech leads asked to bless a PR from a teammate they do not have full context on. Also useful for solo founders who own the whole codebase and need a synthetic reviewer.
When to reach for it
- Before launching a customer-facing feature
- Before deleting “dead” code that has been around for a while
- Auditing a legacy module you are about to touch for the first time
- After a near-miss incident, to find sibling bugs in the same area
Before you start
- Collect the module file(s), the conventions doc (or
CONTRIBUTING.md), and a short note about how the module is used in production. - Pick a model with strong reasoning — code audits reward the slower, deeper model far more than draft-writing tasks.
- Decide ahead of time what you will do with each finding: ticket, fix-now, or “watch only.” Without this, the list ends up ignored.
Step by step
- Point AI at the module and your conventions doc with this opener: “I am auditing this module before launch. List likely bug spots grouped by category: error handling, edge cases, race conditions, input validation, resource cleanup, silent fallbacks.”
- For each flagged spot, ask: “What is the smallest failing input or scenario that would trigger this bug? Write a test that would catch it.”
- Ask the model to rate each finding by likelihood and blast radius (1-5 scale each). Sort by likelihood times blast radius.
- Triage: top-quartile findings get fixed in this PR, middle get tickets, bottom get added to a “low-risk maybe” doc.
- Re-run the audit after fixing — sometimes new bugs appear next to the ones you patched.
Example prompt
You are auditing this Node.js module for production-readiness.
Conventions: errors must surface to the caller, never be swallowed.
Async code must handle cancellation. No global state.
For each function, output:
- 2-3 likely bug spots (one line each)
- Smallest input or sequence that triggers them
- Severity: critical / high / medium / low
- A failing test (vitest, async/await style)
Skip cosmetic issues. Focus on correctness, races, and resource leaks.
First-run exercise
- Pick one ~200-line module you wrote in the last month — recent enough that you remember the context.
- Run the prompt above. Time-box to 30 minutes including triage.
- Mark each finding as “real bug,” “would-be-nice,” or “false alarm.” Aim for at least 1 real bug per audit; if you get zero across 3 audits, your prompt is too generic.
- Save the version of the prompt that produced the most real bugs as your team template.
Quality check
- Every “critical” finding should come with a reproducible test the model wrote, not just a worry.
- Did the model flag anything that touches user data, money, or auth? Those go to the top regardless of likelihood.
- Did the model invent functions or fields that do not exist? If yes, your context was incomplete — re-feed it the actual file, not a paraphrase.
How to reuse this workflow
- Save the working prompt as a Cursor snippet or ChatGPT Custom GPT named “bug-audit.” Replace only the module each run.
- Add a 4-line “bug audit summary” section to every PR that uses this workflow: top finding, fix status, test added, ticket link.
- Keep a
bug-audit-misses.mdlog. When a real incident happens, check if the audit caught it — that tells you where to improve the prompt.
Recommended workflow
Module + conventions doc → categorized bug list → tests for top findings → triage matrix → fix + ticket → re-audit. The whole loop should fit in one focused session per module.
Common mistakes
- Asking “review this code” instead of “find likely bugs” — you get style nits instead of correctness flags.
- Skipping the test-writing step. A finding without a failing test is just a vibe.
- Letting the model pick severity unchallenged. Disagree out loud when its 1-5 scoring feels off.
- Auditing files in isolation when the bug lives at the seam between two modules.
- Treating every flag as something to fix — that buries the genuinely scary ones.
- Auditing only your own code. Sibling files written by teammates often share the same bug pattern.
FAQ
- Should I include tests in the context window?: Yes — existing tests tell the model what you already cover, so it can focus on uncovered cases.
- Can this replace code review?: No. Use it as a pre-review pass so the human reviewer can spend time on design, not on hunting for null-pointer cases.
- What if the model invents bugs that are not real?: Treat each finding as a hypothesis. The test it writes is the evidence; if the test passes against current code, dismiss the finding.
- Does this work on a whole codebase?: Not well. Run it per module of <500 lines. Past that, the model misses cross-cutting issues and over-flags style.
Related
Tags: #AI coding #Tutorial