Should I include existing tests in the context?

Yes. Existing tests tell the model what you already cover, so it can focus on uncovered cases instead of re-flagging handled ones.

Can this replace code review?

No. Use it as a pre-review pass so the human reviewer spends time on design and trade-offs, not on hunting for null-pointer cases. Even the best dedicated tools top out around 82-85% F1 on the OpenSSF CVE Benchmark, so a human still owns the final call.

What if the model invents bugs that are not real?

Treat each finding as a hypothesis. The test it writes is the evidence; if the test passes against current code, dismiss the finding. Keep the "use only the code I gave you" rule to cut hallucinated helpers.

Which model should I use?

As of June 2026, Claude Opus 4.7 or GPT-5.5 in Thinking mode for the audit itself. Both reason about runtime behavior rather than skimming surface style; a fast/Instant model will miss the races and silent fallbacks that matter most.

Does this work on a whole codebase?

Not well. Run it per module under ~500 lines. Past that, the model misses cross-cutting issues and over-flags style. For repo-wide coverage, layer a dedicated PR tool on top.

AI Tool Tutorials

Bug Audit Prompt Workflow: Find Bugs Before They Ship

A 20-minute AI bug-audit habit that beats 'review this code': category-grouped findings, a failing test per bug, and likelihood × blast-radius triage. Models and tools current to June 2026.

Published: May 17, 2026 Updated: Jun 05, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Production incidents almost always trace back to code that looked fine at review time: a swallowed error, a race that only fires under load, a fallback that silently returns stale data. This workflow uses a reasoning model as a second pair of eyes to flag those spots before the module ships. It is the targeted, follow-up-driven version of “review this code” — built for developers maintaining live code who want a 20-minute audit habit, not a 3-hour formal review.

TL;DR

Don’t say “review this code.” Say “find likely bugs, grouped by category, with a failing test for each.” Category prompts surface correctness flags; open-ended prompts surface style nits.
Run it per module under ~500 lines, with the real file (not a paraphrase) and your conventions doc in context.
Make the model rate each finding by likelihood (1-5) and blast radius (1-5), then triage by the product. Top quartile gets fixed in this PR.
Use a strong reasoning model — Claude Opus 4.7 or GPT-5.5 (Thinking), not a fast draft model. Audits reward depth.
A finding without a reproducible failing test is a vibe, not a bug. Make the test the unit of work.

Why “review this code” underperforms

Open-ended review prompts optimize for coverage of the diff, so the model spreads attention evenly and surfaces whatever is cheapest to notice: naming, formatting, missing comments. A bug audit inverts that. By naming the failure categories up front — error handling, edge cases, race conditions, input validation, resource cleanup, silent fallbacks — you push the model to reason about what could go wrong at runtime instead of what looks off on the page.

This matters because the dedicated AI review tools that have shipped agentic redesigns work the same way. Cursor’s Bugbot was rebuilt in late 2025 from a fixed multi-pass pipeline into an agent that reasons over the diff and decides where to dig deeper; as of June 2026 it adds Default/High/Custom effort levels so you can tell it to think longer on risky PRs (Cursor docs). The hand-rolled prompt below borrows that posture: categorize, then go deep where the category is hot.

Who this is for

Developers maintaining production code, on-call engineers preparing a feature for launch, and tech leads asked to bless a PR from a teammate whose context they do not fully share. Also useful for solo founders who own the whole codebase and need a synthetic reviewer that never gets tired at 2 a.m.

When to reach for it

Before launching a customer-facing feature, especially anything touching money, auth, or user data.
Before deleting “dead” code that has been around long enough that nobody remembers why it exists.
Auditing a legacy module you are about to touch for the first time.
After a near-miss incident, to find sibling bugs in the same area — the same author often repeats the same mistake.

Before you start

Collect the module file(s), the conventions doc (or CONTRIBUTING.md), and a one-line note on how the module is used in production (request path, cron job, queue consumer).
Pick a model with strong reasoning. As of June 2026, Claude Opus 4.7 (87.6% on SWE-bench Verified, 64.3% on the harder SWE-bench Pro) and GPT-5.5 in Thinking mode are the two that reliably reason about runtime behavior rather than pattern-matching surface issues. A fast/Instant model will skim. See ChatGPT vs Claude vs Gemini for the trade-offs.
Decide ahead of time what you will do with each finding: ticket, fix-now, or “watch only.” Without this, the list gets generated and then ignored.

Step by step

Point the model at the module and your conventions doc with this opener: “I am auditing this module before launch. List likely bug spots grouped by category: error handling, edge cases, race conditions, input validation, resource cleanup, silent fallbacks. Use only the code I gave you.”
For each flagged spot, ask: “What is the smallest failing input or sequence that would trigger this bug? Write a test that would catch it.”
Ask the model to rate each finding by likelihood (1-5) and blast radius (1-5). Sort by likelihood × blast radius.
Triage: top-quartile findings get fixed in this PR, middle get tickets, bottom go to a “low-risk maybe” doc.
Re-run the audit on the patched file. Fixing one bug often exposes a sibling next to it, and the model re-reasons over the new control flow.

The prompt

You are auditing this Node.js module for production-readiness.

Conventions: errors must surface to the caller, never be swallowed.
Async code must handle cancellation. No global state.

For each function, output:
- 2-3 likely bug spots (one line each)
- Smallest input or sequence that triggers them
- Severity: critical / high / medium / low
- A failing test (vitest, async/await style)

Rules:
- Use only the code I gave you. Do not assume helper functions exist.
- If you reference a function or field, it must appear in the source above.
- Skip cosmetic issues. Focus on correctness, races, and resource leaks.

The “use only the code I gave you” line matters more than it looks. Open-ended audits are where reasoning models hallucinate most — inventing a validateInput() helper that does not exist and then “finding a bug” in it. Pinning the model to the supplied source cuts that failure mode hard.

Do it yourself or use a dedicated tool?

A hand-rolled prompt audit is free, runs on code you cannot paste into a third-party service, and lets you tune the category list to your domain. A dedicated PR-review tool runs automatically on every push and scales across a team. Use the prompt workflow for deep audits of risky modules; layer a tool on top for blanket coverage. Independent accuracy data (June 2026) helps set expectations:

Tool / approach	Independent accuracy signal	Cost (June 2026)	Best for
Hand-rolled prompt (Opus 4.7 / GPT-5.5)	Depends on prompt + context; your test is the gatekeeper	Your existing Plus/Pro/Max plan	Deep audit of one risky module
Cursor Bugbot	~70%+ of its flags resolved before merge; Autofix merges 35%+ of proposed changes	Usage-based, ~$1.00-$1.50 per run	Automatic per-PR review in a Cursor team
DeepSource	84.51% F1 on the OpenSSF CVE Benchmark (highest tested); <5% false positives on its deterministic pass	Paid plans per repo	Teams wanting low-noise security coverage
Greptile	~82% F1; higher catch rate but more false positives	Paid plans per seat	Catch-everything reviews, noise tolerated
CodeRabbit	59.39% accuracy / 36.19% F1 on OpenSSF CVE Benchmark	Free tier + paid seats	Lightweight inline PR comments

Accuracy figures are from the OpenSSF CVE Benchmark — the only public, independent dataset (200+ real CVEs) used to compare these tools head-to-head. The takeaway for the prompt workflow: even the best dedicated tools miss real bugs and over-flag, which is exactly why every finding in your audit must carry a failing test as proof.

First-run exercise

Pick one ~200-line module you wrote in the last month — recent enough that you remember the context.
Run the prompt above. Time-box to 30 minutes including triage.
Mark each finding as “real bug,” “would-be-nice,” or “false alarm.” Aim for at least 1 real bug per audit; if you get zero across 3 audits, your prompt is too generic — tighten the category list to your stack.
Save the version of the prompt that produced the most real bugs as your team template.

Quality check

Every “critical” finding should come with a reproducible test the model wrote, not just a worry. If the test passes against current code, the finding was a false alarm — dismiss it.
Did the model flag anything that touches user data, money, or auth? Those go to the top regardless of likelihood score.
Did the model invent functions or fields that do not exist? If yes, your context was incomplete — re-feed it the actual file, not a paraphrase, and keep the “use only the code I gave you” rule in the prompt.

How to reuse this workflow

Save the working prompt as a Cursor snippet or a ChatGPT Custom GPT named “bug-audit.” Replace only the module each run.
Add a 4-line “bug audit summary” to every PR that uses this workflow: top finding, fix status, test added, ticket link.
Keep a bug-audit-misses.md log. When a real incident happens, check whether the audit caught it. The misses tell you which category to strengthen in the prompt.

Common mistakes

Asking “review this code” instead of “find likely bugs by category” — you get style nits instead of correctness flags.
Skipping the test-writing step. A finding without a failing test is just a vibe.
Letting the model pick severity unchallenged. Disagree out loud when its 1-5 scoring feels off; you have context it doesn’t.
Auditing files in isolation when the bug lives at the seam between two modules. Feed both files for interface bugs.
Treating every flag as something to fix. That buries the genuinely scary ones under busywork.
Auditing only your own code. Sibling files written by teammates often share the same bug pattern, which is why the post-incident sweep pays off.

FAQ

Should I include existing tests in the context?: Yes. Existing tests tell the model what you already cover, so it can focus on uncovered cases instead of re-flagging handled ones.
Can this replace code review?: No. Use it as a pre-review pass so the human reviewer spends time on design and trade-offs, not on hunting for null-pointer cases. Even the best dedicated tools top out around 82-85% F1 on the OpenSSF CVE Benchmark, so a human still owns the final call.
What if the model invents bugs that are not real?: Treat each finding as a hypothesis. The test it writes is the evidence; if the test passes against current code, dismiss the finding. Keep the “use only the code I gave you” rule to cut hallucinated helpers.
Which model should I use?: As of June 2026, Claude Opus 4.7 or GPT-5.5 in Thinking mode for the audit itself. Both reason about runtime behavior rather than skimming surface style; a fast/Instant model will miss the races and silent fallbacks that matter most.
Does this work on a whole codebase?: Not well. Run it per module under ~500 lines. Past that, the model misses cross-cutting issues and over-flags style. For repo-wide coverage, layer a dedicated PR tool on top.

Tags: #AI coding #Tutorial

TL;DR

Why “review this code” underperforms

Who this is for

When to reach for it

Before you start

Step by step

The prompt

Do it yourself or use a dedicated tool?

First-run exercise

Quality check

How to reuse this workflow

Common mistakes

FAQ

Related

Related Articles

AI Changelog Generation: From Commits to a Release Note Humans Read

AI-Assisted Database Migrations — Reversible, Backfilled, Tested

AI for Incident Postmortems Without Sanitizing the Lessons

AI Merge Conflict Resolution: When to Trust the Auto-Merge

AI On-Call Debugging: From Page to Fix Without Panic

AI PR Descriptions: From Diff to Reviewable