How is a bug audit different from a code review?

A review covers a diff and asks whether a change is safe. An audit covers a whole module or repo and drills into one bug family deeply. They are complementary, not interchangeable — run both.

How often should I run a bug audit?

Before major releases, right after inheriting a codebase, post-incident on the module that failed, and quarterly on production-critical paths.

Why does AI miss obvious bugs sometimes?

Usually because it cannot see the call site or the type definition. Either expand the context you paste, or run the audit inside Claude Code or Cursor so the agent can `Read` the surrounding files itself.

Should I trust the severity the model assigns?

Treat it as a starting point and re-rank by business impact. The model does not know which code path carries revenue or PII.

What about false positives?

Expect a real rate — independent 2026 studies on LLM static bug detection confirm it is non-trivial. The cost of a missed prod bug usually exceeds the cost of confirming false positives, so accept the asymmetry and verify with a test.

Can AI fix the bugs it finds?

Yes, but in a separate pass. Mixing diagnosis and remediation makes both worse. Cursor's BugBot, for example, splits the steps: it flags issues on the PR first, then spawns a Cloud Agent to propose a fix you review.

Prompt Library

Bug Audit Prompts: Hunt Hidden Bugs Before Prod

13 copy-ready prompts that hunt specific bug families — race, null, off-by-one, leaks — before they ship. Tuned for Claude Opus 4.7, Cursor, and Codex (June 2026).

Published: May 17, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

A bug audit is not a code review. A review covers a diff and asks “is this change OK?” An audit covers a whole module and asks “what category of bug is hiding here?” The two need different prompts. The prompts below each target one bug family — race conditions, null dereferences, off-by-one, leaks, money math — because a model told to “find all bugs” returns a shallow, hallucination-prone list, while a model told to “find every place two requests can mutate this shared map” returns something you can actually verify.

TL;DR

Run one bug family per pass. Never ask one prompt to find everything.
Every finding must carry file:line, a concrete trigger scenario, and a severity. No scenario means no real bug.
Diagnose first, fix second — separate prompts. Mixing the two degrades both.
Pair the audit with prompt #13 to turn each finding into a minimal failing test. Tests are the only proof.
Best models for this (June 2026): Claude Opus 4.7 (87.6% SWE-bench Verified) for the reasoning, run inside Claude Code or Cursor so the agent can Grep and Read the surrounding code.

Who this is for

On-call engineers prepping a release, solo founders shipping code no one else reviews, security-adjacent teams who cannot afford a regression, and anyone chasing the root cause of a live incident.

When not to use these prompts

Skip them on throwaway scripts or once-a-month automation — the overhead outweighs the payoff. And never mix an audit with a refactor in the same prompt. Two goals means two passes; combining them produces a vague diff plus a half-finished bug list.

Anatomy of a good bug-audit prompt

Every prompt below carries the same six elements. Drop any one and quality falls off a cliff.

Element	What it does	Failure if omitted
Bug family	Pick ONE (race / null / off-by-one / leak / timezone)	“Find all bugs” returns a shallow grab-bag
Scope	Which files, functions, or commits	Model wanders the whole repo, misses the hot path
Trigger scenario	The exact input or interleaving that fires	”Race here” is a label, not a verifiable claim
Evidence	`file:line` + a repro path or test idea	You cannot confirm a finding you cannot locate
Severity	Critical / High / Med	Flat enumeration with no ranking
Output format	Numbered list or `file \| line \| scenario \| fix` table	Wall of prose you have to re-parse

Best for

Pre-release audit
Inherited codebase debugging
Incident root-cause hunt
Refactor safety net
Pre-launch regression sweep

Which tool runs these best (June 2026)

Paste-into-chat works for a single file, but a real audit needs the model to see call sites and type definitions. That argues for an agent that can read the repo itself.

Tool	Model	Why it helps an audit	Cost (June 2026)
Claude Code	Opus 4.7 / Sonnet 4.6	Agent runs `Grep` for danger patterns, `Read`s call sites; bundled in Claude Pro	Pro $20/mo
Cursor	Opus 4.7, GPT-5.5, Gemini 3.1 Pro	In-IDE; BugBot add-on auto-reviews GitHub PRs and proposes fixes	Pro $20/mo; BugBot $40/seat
Codex	GPT-5.5	Strong terminal autonomy (82.7% Terminal-Bench 2.0)	In ChatGPT Plus $20/mo
Plain chat	Opus 4.7 / GPT-5.5	Fine for one self-contained file; no repo context	Plus/Pro tier

For pure bug-reasoning quality, Opus 4.7 leads with 87.6% on SWE-bench Verified vs Gemini 3.1 Pro’s 80.6% (as of June 2026). Use it for the audit; use whichever tool can read your repo for the legwork.

13 copy-ready prompt templates

Replace [paste] with your code and [framework] with your test runner. Keep the structure — the named bug family and the file:line demand are what make these work.

1. Race condition hunt

Audit the code below for race conditions: shared-state mutation, missing
locks, check-then-act gaps, unsynchronized map/slice writes. For each
finding give: file:line, the exact interleaving where two goroutines or
threads collide, severity, and a suggested mitigation.

[paste]

2. Null / undefined hunt

Audit for likely null/undefined dereference. List call sites where the
input could plausibly be null/undefined and isn't checked. For each:
file:line, the upstream path that yields null, severity.

[paste]

3. Off-by-one hunt

Hunt off-by-one errors: loop bounds, array slicing, pagination offsets,
date arithmetic, inclusive-vs-exclusive ranges. For each: file:line, the
input size where it breaks, fix.

[paste]

4. Error-handling audit

Audit error handling: swallowed errors, generic catch-all blocks, errors
logged but not propagated, missing context on rethrow. List each
suspicious site plus suggested logging or propagation.

[paste]

5. Resource-leak hunt

Audit for resource leaks: open files, DB connections, event listeners,
subscriptions, timers, goroutines. Flag every open-without-close pattern
and every early return that skips cleanup. For each: file:line, the path
that leaks.

[paste]

6. Timezone bug hunt

Audit for timezone bugs: implicit local time, naive datetime, conversions
during DST transitions, storing local instead of UTC, day-boundary math.
List each plus how it fails and on which dates.

[paste]

7. State-machine inconsistency

Below is a state-machine-like flow. List impossible states, unreachable
transitions, and missing guards. Then suggest one cleaner state model with
explicit allowed transitions.

[paste]

8. Boundary-input hunt

For each function below, list boundary inputs (empty, single element, max,
negative, zero, special chars, unicode, very large) where behavior is
unclear. Suggest one test per case.

[paste]

9. Float / money-math hunt

Audit this code for floating-point and money-arithmetic bugs: 0.1 + 0.2
accumulation drift, currency rounding at the wrong layer, mixing cents and
dollars, division-before-multiplication losing precision, tax or discount
applied in an inconsistent order. For each: file:line, the input that
produces a wrong total, suggested fix (Decimal type, integer cents, etc.).

[paste]

Optimization: For invoice or order code, add: Also flag any place where rounding happens twice in the same calculation chain.

10. Idempotency / retry hunt

Audit for retry-safety bugs: external API calls without idempotency keys,
DB writes that double-fire on retry, webhook handlers that aren't
idempotent, message consumers without dedupe. For each: file:line, what
double-fires, suggested key/window/dedupe strategy.

[paste]

11. Cache-coherence hunt

Audit for cache bugs: writes that update the DB but not the cache, cache
keys missing tenant/user scoping, stale reads after writes, TTLs longer
than the data's natural change rate, cache-stampede risk. For each:
file:line, the stale-read scenario, fix sketch.

[paste]

12. Unicode / encoding hunt

Audit for string and encoding bugs: byte-length vs character-length
confusion, lowercasing non-ASCII, slugs that drop emoji or CJK,
surrogate-pair truncation, NFC-vs-NFD normalization mismatches across
DB and UI, header or URL decoding inconsistencies. For each: file:line,
an input that breaks, fix.

[paste]

13. Audit → minimal failing test

Run this last. It turns each audit finding into a runnable repro — the step that actually proves a bug is real.

Take each finding from the bug audit above and write the minimal failing
test that reproduces it in [framework]. Each test: one assertion,
deterministic input, no mocks unless strictly needed. Mark which tests
fail today vs which need infra (DB / queue / timezone faking).

Findings: [paste]

Variables to swap: [framework] (vitest, jest, pytest, go test, etc.).

Common mistakes

Mixing categories into one “find all bugs” prompt.
Findings with no file:line.
Trusting model confidence without a spot-check.
Asking for the fix in the same prompt as the audit — diagnosis blurs.
Never converting findings into tests, so the audit becomes a read-only doc no one acts on.

How to push results further

One family per pass. Cross-pollination dilutes findings.
Demand a trigger scenario, not a label. “Race here” is hallucination-prone; “if request A finishes after B but before C” is verifiable.
Pair with prompt #13. Only the test proves the bug is real. Independent 2026 industry studies on LLM bug detection report meaningful false-positive rates, so treat every finding as a hypothesis until a failing test confirms it.
Pre-filter with Grep. On large codebases, let Claude Code or Cursor Grep for danger patterns — catch (e) {}, setTimeout, Date(, == on currency — and audit only the matches.
Add a confidence threshold: Only report findings you would bet $50 on. Cuts noise sharply.
Re-run the same prompt after the fix. If findings reappear at new file:lines, you have a systemic issue, not a one-off.
Keep an ignore list of known false-positives in the prompt so each pass stops re-surfacing them.

FAQ

How is a bug audit different from a code review?: A review covers a diff and asks whether a change is safe. An audit covers a whole module or repo and drills into one bug family deeply. They are complementary, not interchangeable — run both.
How often should I run a bug audit?: Before major releases, right after inheriting a codebase, post-incident on the module that failed, and quarterly on production-critical paths.
Why does AI miss obvious bugs sometimes?: Usually because it cannot see the call site or the type definition. Either expand the context you paste, or run the audit inside Claude Code or Cursor so the agent can Read the surrounding files itself.
Should I trust the severity the model assigns?: Treat it as a starting point and re-rank by business impact. The model does not know which code path carries revenue or PII.
What about false positives?: Expect a real rate — independent 2026 studies on LLM static bug detection confirm it is non-trivial. The cost of a missed prod bug usually exceeds the cost of confirming false positives, so accept the asymmetry and verify with a test.
Can AI fix the bugs it finds?: Yes, but in a separate pass. Mixing diagnosis and remediation makes both worse. Cursor’s BugBot, for example, splits the steps: it flags issues on the PR first, then spawns a Cloud Agent to propose a fix you review.

For deeper background, see Anthropic’s Claude Code docs and Cursor’s BugBot page.

Tags: #Prompt #AI coding #Bug audit