Which model should I use?

As of June 2026, Claude Opus 4.7 with adaptive thinking at high/xhigh effort, or GPT-5.5 Thinking on Extended thinking time. Both reason before answering. Speed-tier models (GPT-5.5 Instant, Claude Sonnet 4.6 without thinking) give weaker critique — fine for drafting the doc, not for attacking it.

Does this replace human design review?

No, it's a pre-filter. Your senior teammate's time goes much further on a doc that already survived AI critique and arrives with mitigations and rejected alternatives noted.

What if the AI's critique is wrong?

Often it will be, and that's fine. Wrong critique still surfaces an assumption worth documenting. Just don't apply mitigations for issues that aren't real.

How long does it take?

20-40 minutes per design with a reasoning model — the thinking modes are slower than Instant. Against weeks of refactoring, it's the best ROI in the toolkit.

Can I skip the steelman step to save time?

Don't. Without it the critique skews one-sided and you'll over-correct a design that was mostly fine.

Plus or Pro for ChatGPT?

Plus ($20/mo) is enough — it includes up to 3,000 GPT-5.5 Thinking messages/week. The $200 Pro plan only matters if your design docs are large enough to need the full 1M-token context.

AI Tool Tutorials

AI Architecture Review Workflow

Use a reasoning-grade AI as a structured devil's advocate to catch 3-5 real architecture flaws per design doc before you commit weeks of code.

Published: May 17, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Architecture mistakes are paid for in weeks of refactoring, not days. The cheapest way to find them is to argue with someone smart before you write code — but most teams don’t have that someone free on a Tuesday morning. This walks through using a reasoning-grade model (Claude Opus 4.7 with adaptive thinking, or GPT-5.5 Thinking) as a structured devil’s advocate that surfaces 3-5 real issues per design doc, in about 25 minutes.

TL;DR

Write a one-page design doc with four sections: Goal, Constraints, Approach, Alternatives considered.
Run a fixed five-prompt sequence: steelman → devil’s advocate → minimal mitigations → alternative architectures → failure scenarios.
Use a reasoning model, not a speed model. As of June 2026, that means Claude Opus 4.7 (set effort to high or xhigh) or GPT-5.5 Thinking on Extended time. Skip Instant/Haiku-class models — their critique stays shallow.
Output is a sharpened doc with mitigations and explicitly-rejected alternatives baked in. That doc is what you hand to a human reviewer, not the raw critique.

Which model to use (June 2026)

The single biggest determinant of critique quality is whether you used a model that actually reasons before answering. Both leading options expose a thinking/effort control you have to turn up.

Option	How to enable	Context window	Best for	Notes
Claude Opus 4.7 (adaptive thinking)	Default on Claude.ai / Claude Code; set effort to `high` or `xhigh` for design review	1M tokens (standard pricing)	Long design docs, distributed-systems failure analysis	Adaptive thinking is the only mode; the model decides how deep to think. Knowledge cutoff Jan 2026.
GPT-5.5 Thinking	Pick GPT-5.5 Thinking in the model picker, set thinking time to Extended	~320 pages in-app on Plus; full 1M only on the $200 Pro plan	Quick second opinion, cross-checking Claude’s critique	Plus/Business get up to 3,000 Thinking messages/week. Pro adds Light/Heavy thinking-time rungs.
Speed models (GPT-5.5 Instant, Claude Sonnet 4.6 no-think)	Default model, thinking off	—	Drafting the doc itself, not reviewing it	Critique is generic. Do not use for the attack step.

Pricing as of June 2026: Claude Pro is $20/mo ($17/mo billed annually, now bundling Claude Code) and ChatGPT Plus is $20/mo. Both tiers are enough to run this workflow daily. Official pages: Claude pricing and ChatGPT pricing.

When to run this

Before writing code on any feature involving new data models, new services, non-trivial state management, distributed coordination, payment/auth flows, or anything where rollback would be painful. Rule of thumb: if undoing this would take more than 2 days, run the review.

Skip it for trivial features, well-trodden patterns your team already has a standard for (no point asking AI to reconsider your CRUD endpoint pattern), and throwaway time-boxed spikes.

Before you start

Have a one-page design doc. Bullet points are fine, but it must include: goal, constraints, proposed approach, and 2-3 alternatives you considered.
Decide what “good” means for THIS design in measurable terms. “p95 latency under 200ms” is good; “scalable” is not. The model critiques against your stated criteria, so vague criteria yield vague critique.
Open a reasoning model with thinking turned up (see the table above). For a doc you’ll iterate on, a Claude Project or Custom GPT holding the prompt sequence saves re-typing.

The five-prompt sequence

Run these in order, in one conversation, pasting the design doc once at the top.

Steelman. “Steelman this design. Give the 3 strongest reasons it is the right call, each tied to a specific constraint in the doc.” Forcing the model to defend the design first prevents one-sided takedowns later.
Devil’s advocate. “Now play devil’s advocate. Find the 5 biggest weaknesses. For each: name the failure mode, state when it would trigger, and estimate the cost when it does. Be specific — no ‘this could be slow.’”
Minimal mitigations. For each weakness: “What is the smallest mitigation that does NOT change the overall architecture?” This separates fixable concerns from architectural blockers.
Alternative architectures. “List 2 alternative architectures I should have considered, each with explicit tradeoffs against my proposal.” If it surfaces an option you genuinely hadn’t weighed, the review just earned its keep.
Failure scenarios (high-stakes designs). “Walk through 3 realistic failure scenarios. What state could become inconsistent during a partition, restart, or partial failure? Where does retry logic get fragile?”

Then update the doc with the mitigations, the rejected alternatives (with reasons), and a “considered failures” section. That updated doc is what goes to a human reviewer.

Targeted prompts by design type

Swap in the one that fits after step 4:

Data-model designs: “Walk through 5 realistic queries against this schema. Which require joins or denormalization that aren’t in the design? Where do you see N+1 patterns?”
Distributed systems: “Where can this design partially fail? What state can become inconsistent during a network partition or a node restart mid-write?”
API designs: “Generate 3 example calls that look correct but violate an unstated assumption in the implementation.”
Migrations: “Walk the cutover step by step. At which step is the system in a state where both the old and new path are live? What breaks if the migration pauses there for an hour?”

Run it in Claude Code for live code

If the design touches an existing codebase, run the review inside Claude Code’s plan mode instead of a chat window. Plan mode is read-only: Claude reads the relevant files, asks clarifying questions, and produces a step-by-step plan with all write tools disabled until you approve it. That means the same steelman → attack sequence runs against your actual code, not just a prose summary — so the model can spot, for example, that your “stateless service” design actually reads a module-level cache.

For repeat use, save the five-prompt sequence as a .claude/agents/architecture-reviewer.md subagent so any teammate can invoke the same structured critique. See the AI agent code review workflow for the subagent setup pattern.

Quality check on the critique

Did the model find issues you genuinely hadn’t considered, or restate what you’d already documented? The latter is fine but low value.
Is each weakness verifiable? “This could be slow” is vibes; “this issues N+1 queries once a user has more than 50 items” is testable.
Did the alternatives have real tradeoffs, or were they strictly-worse options padded in to make yours look good? Strictly-worse alternatives are noise.
Are the mitigations actually minimal, or did the model quietly redesign the system? Push back on creeping redesign — hold it to “comment, don’t rewrite.”

Calibrate it on a shipped design first

Before you trust this on a live decision, run it once on a design you already shipped and know the outcome of. Compare the model’s predicted weaknesses against what actually went wrong in production: did it catch the real issues, or fixate on theoretical ones? Note which prompt phrasings produced sharp versus vague critique. After ~10 reviews you’ll see the pattern — reasoning models catch race conditions, missing failure modes, and schema-evolution traps well, and miss organizational/political constraints almost entirely (they can’t see your roadmap or your on-call rotation).

Common mistakes

Asking “is this design good?” — you get yes-and-fluff. Use the steelman-then-attack sequence.
Using a speed model (GPT-5.5 Instant, Sonnet 4.6 with thinking off) for the attack step. Critique stays surface-level. Turn thinking up.
Letting the AI redesign instead of critique. Hold it to “comment, don’t rewrite” until you’ve heard the full critique.
Skipping the steelman step. Without it you get one-sided takedowns that miss the design’s real strengths and over-correct it.
Treating AI critique as authority. It surfaces issues; you decide which matter, given context the model can’t see.
Running this after you’ve written the code. Sunk-cost bias rejects every critique. Do it before code.
Handing humans the raw AI critique. Give them the sharpened design with mitigations baked in — that’s the point.

FAQ

Which model should I use?: As of June 2026, Claude Opus 4.7 with adaptive thinking at high/xhigh effort, or GPT-5.5 Thinking on Extended thinking time. Both reason before answering. Speed-tier models (GPT-5.5 Instant, Claude Sonnet 4.6 without thinking) give weaker critique — fine for drafting the doc, not for attacking it.
Does this replace human design review?: No, it’s a pre-filter. Your senior teammate’s time goes much further on a doc that already survived AI critique and arrives with mitigations and rejected alternatives noted.
What if the AI’s critique is wrong?: Often it will be, and that’s fine. Wrong critique still surfaces an assumption worth documenting. Just don’t apply mitigations for issues that aren’t real.
How long does it take?: 20-40 minutes per design with a reasoning model — the thinking modes are slower than Instant. Against weeks of refactoring, it’s the best ROI in the toolkit.
Can I skip the steelman step to save time?: Don’t. Without it the critique skews one-sided and you’ll over-correct a design that was mostly fine.
Plus or Pro for ChatGPT?: Plus ($20/mo) is enough — it includes up to 3,000 GPT-5.5 Thinking messages/week. The $200 Pro plan only matters if your design docs are large enough to need the full 1M-token context.

Tags: #AI coding #Tutorial #Workflow

TL;DR

Which model to use (June 2026)

When to run this

Before you start

The five-prompt sequence

Targeted prompts by design type

Run it in Claude Code for live code

Quality check on the critique

Calibrate it on a shipped design first

Common mistakes

FAQ

Related

Related Articles

AI Changelog Generation: From Commits to a Release Note Humans Read

AI-Assisted Database Migrations — Reversible, Backfilled, Tested

AI for Incident Postmortems Without Sanitizing the Lessons

AI Merge Conflict Resolution: When to Trust the Auto-Merge

AI On-Call Debugging: From Page to Fix Without Panic

AI PR Descriptions: From Diff to Reviewable