How many reviews do I need to cluster?

At least 50 for meaningful root-cause clusters. Under 50, read each one individually — clustering noise looks like signal at small N.

Which AI model handles a big review export best?

Claude (Opus 4.7 / Sonnet 4.6) and Gemini 3.1 Pro both carry 1M-token context, enough for a 5,000-8,000-review CSV in one pass. ChatGPT Plus caps in-app context near 320 pages, so batch large files there.

How do I tell a bug spike from a UX problem?

Bug spikes correlate to release dates; UX problems persist across releases. Use prompt 2 to map reviews to releases.

Should I act on a single passionate review?

Only if it describes a clear bug others might hit silently. Otherwise wait for the cluster to form.

Can AI predict which fix will lift the rating most?

It can estimate, but the real signal is post-fix rating velocity in the 4 weeks after release. Verify; do not assume.

What if reviews contradict each other?

Contradiction usually means a polarizing feature or a segment-specific issue. Use prompt 4 (persona matrix) to disentangle paid vs free or new vs power users.

Prompt Library

Negative App Review Analysis Prompts for Root-Cause Themes

15 AI prompts that cluster 1-2 star reviews by root cause, separate symptoms from real problems, and surface the 3 fixes that lift your rating most. Updated June 2026.

Published: May 19, 2026 Updated: Jun 14, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

A 1-2 star review is rarely about what it says on the surface. A user writes “crashes on my phone” when the root cause was a login-flow regression in last week’s release. The 15 prompts below cluster negative reviews by root cause (not by topic), tie each cluster to a product area, separate one-time bug bursts from chronic patterns, and produce a fix-priority list that maps to your next sprint.

This matters because the math is brutal: most apps that fall below 4.0 stars never climb back, and a 0.5-star drop can roughly halve install-to-download conversion. Yet apps that reply to 30-50% of reviews average 3.77 stars versus 3.25 for apps that reply to under 1% (AppFollow, 2026) — so the public reply is half the recovery, and analysis is the other half.

TL;DR

Paste your raw 1-2 star reviews into the prompts below; AI clusters them by root cause, not by surface topic.
Start with prompt 1 (root-cause cluster) and prompt 2 (release correlation) — most rating drops trace to a single release.
Use Claude (Opus 4.7 / Sonnet 4.6, 1M-token context) or Gemini 3.1 Pro (1M) to process a full export in one pass; ChatGPT Plus caps in-app context near 320 pages, so batch large CSVs there.
Recent reviews dominate: both stores weight the last ~90 days most, so fix the cause first, then push fresh reviews to dilute the old ones.

Which model to run these on (June 2026)

Review analysis is a long-context job: you paste hundreds or thousands of rows, then ask for clustering across all of them at once. Pick the model by how much you can fit in one pass.

Model	Standard context	Best for review work	Notes
Claude Opus 4.7 / Sonnet 4.6	1M tokens	Full CSV export in one pass; nuanced clustering	Sonnet 4.6 is the cheaper workhorse; Opus 4.7 for the hardest root-cause calls
Gemini 3.1 Pro	1M tokens	Large exports; included in Google AI Pro ($19.99/mo)	Strong at table output
ChatGPT (GPT-5.5, Plus $20/mo)	~320 pages in-app	Smaller batches; quick interactive triage	Full 1M context only on the $200 Pro tier

Rule of thumb: a CSV of roughly 5,000-8,000 full-text reviews fits in a single 1M-token Claude or Gemini session, so you avoid the “analyze the first 500” workaround. For anything larger, split by month or by release window — prompt 2 below already works release-by-release.

Who this is for, and when to skip it

Built for mobile app PMs, support leads at app studios, growth teams tracking rating velocity, and founders recovering from a bad release.

Skip these prompts when an app has under 50 reviews — read each one by hand instead. Skip them for one-off troll or extortion reviews too; those are a moderation and reporting job, not an analysis job.

The six elements every review-analysis prompt needs

Strong results come from prompts that spell out all six:

Role: who the AI plays (senior PM / solo founder / product designer / indie dev / growth lead).
Context: stage (idea / MVP / growth / scale), team size, traffic or ARR, platform (web / iOS / Android), audience, constraints.
Goal: one concrete deliverable — one cluster table, one fix-priority list, one reply template set.
Constraints: timeline (this sprint / this quarter), scope cuts, must-not-break flows (billing, login, compliance).
Output format: table, checklist, ticket-ready JSON, or labeled blocks you paste straight into Linear / Notion / Jira.
Examples / signal: 1-2 reviews you already understand the root cause of, plus 1 you find ambiguous, so the model calibrates.

Best moments to run a full review sweep

Post-release rating-drop investigation (within 48 hours of a spike)
Quarterly rating-velocity review
Pre-launch risk assessment against a competitor’s recent 1-star themes
Roadmap input from real-user pain
Critical-bug burn-down prioritization

15 copy-ready prompt templates

1. Root-cause cluster (not topic cluster)

The core template. Forces causal grouping, not surface-word grouping.

You are a product analyst. Below are {N} 1-2 star reviews of {app}. Cluster by ROOT CAUSE, not by topic. Same root cause may manifest as different complaints; same complaint may have different root causes. For each cluster: count, hypothesized root cause, 3 representative verbatim, suggested verification (logs, code area, recent release).

Reviews: {paste}

Variables to swap: N, reviews, app

Optimization: If clusters look like topic clusters, add: “Each cluster name must be a hypothesis ending in a verb (‘login flow regressed after auth refactor’), not a noun phrase (‘login issues’).“

2. Release-impact correlation

Below are 1-2 star reviews for the last 90 days, with timestamps. Map them to our recent releases ({list with dates}). For each release: review count spike, dominant complaint, hypothesized regression. Identify any release that triggered a sustained spike.

Reviews: {paste}
Releases: {paste}

3. Crash vs feature-miss vs UX-friction split

Classify each of these 1-2 star reviews into: crash / data-loss, missing feature, UX friction, pricing complaint, support complaint, abuse / spam. For each bucket, count and % of total. Output a 6-row table with examples per bucket.

Reviews: {paste}

4. Persona × root-cause matrix

Below are reviews tagged with inferred persona (free / paid / new / power user). Cluster by root cause, then show distribution across personas. Highlight any root cause that disproportionately affects paid users — those move revenue.

Reviews: {paste}

5. “Story behind the rating” reconstruction

For each of these 5 representative reviews, reconstruct the likely user story: what they were trying to do, where it broke, what they tried next, what made them rate 1 star. Mark each step with confidence level. This becomes empathy fuel for the team.

Reviews: {paste}

6. Severity scoring

For each root-cause cluster, score severity on 4 axes: (1) frequency of occurrence, (2) impact when it occurs (annoyance / blocker / data loss), (3) user segment affected, (4) recoverability. Output a 4-column severity table.

Clusters: {paste}

7. Fix-priority list (sprint-ready)

From this analysis of 1-2 star reviews, produce the 5 fixes most likely to lift the rating in 8 weeks. For each: estimated effort, expected rating impact, dependencies, success metric. Mark any "fix" that is actually a comms issue (not a real bug).

Analysis: {paste}

8. False-claim filter

Some of these reviews report bugs that are not real bugs (user error, feature exists). For each review: classify as real bug / user error / feature exists / unclear. For "user error" and "feature exists", suggest a help-center or in-product fix.

Reviews: {paste}

9. Rating-velocity dashboard

Design a 6-metric dashboard for rating velocity: avg rating last 7/30/90 days, % of reviews 1-2 star, time-to-respond, %-of-1-2-star with developer reply, % of repeat-complaint themes, post-release rating delta. Define each metric and its alarm threshold.

10. Chronic vs spike pattern detector

Below are 1-2 star reviews for the last 12 months. For each root cause cluster, classify as: chronic (consistent monthly), spike (concentrated weeks), seasonal (returns periodically). Recommend different response strategies for each pattern.

Reviews: {paste}

11. Localization-skewed pain detection

Cluster these 1-2 star reviews by language / locale. For each locale: top 3 complaints. Highlight any locale where the dominant complaint is different from the global pattern — likely a localization or regional issue.

Reviews: {paste}

12. Competitor-trigger detection

Scan these 1-2 star reviews for mentions of competitor apps or "{competitor} is better at X". List each mention with context. Output: which competitors users compare us to, on what dimensions, with what frequency. This becomes positioning input.

Reviews: {paste}

13. Update-broke-things review pattern

Identify reviews complaining that an update made things worse. For each: which feature/flow they say regressed, when they noticed, whether they will downgrade if possible. Group by version. Recommend whether to roll back or fast-forward.

Reviews: {paste}

14. Recovery-action checklist per cluster

For each root cause cluster from this analysis, produce a recovery checklist: (1) immediate fix, (2) prevention work, (3) user comms (review reply template, in-app message, email), (4) PR risk level, (5) owner. Output as a per-cluster card.

Clusters: {paste}

15. Quarterly rating retrospective

Write a quarterly retrospective: starting and ending rating, dominant 1-2 star themes per month, what we fixed, what we missed, what changed in rating velocity. End with 3 thematic bets for next quarter and 1 metric to declare them successful.

Quarter data: {paste}

Common mistakes

Clustering by topic (“login problems”) instead of root cause (“auth refactor broke OAuth refresh on iOS 17”).
Mistaking a review spike caused by one release for a chronic problem.
Treating user-error reports as bugs without verification.
Ignoring localization-skewed patterns hidden in global counts.
Acting on a single passionate 1-star review instead of the cluster.
Fixing the dominant complaint without checking if it is just the most VOCAL minority.
Skipping comms recovery — the fix matters but the public reply matters too.

Turning clusters into a rating recovery

Analysis is only half the job. A recovery loop that actually moves the number looks like this:

Find the cause fast. Run prompts 1 and 2; ship a hotfix inside 48 hours of the spike — that window is what separated a 6-week recovery (4.4 → 3.6 → 4.3) from apps that never came back.
Reply to a representative review per cluster the moment the fix ships. Apple gives you up to ~5,970 characters per reply and the reply appears within 24 hours; Google Play caps replies at 350 characters but they post immediately. Responding to reviews is associated with roughly a 0.7-star lift on Google Play.
Refresh the recent window. Both stores weight the last ~90 days most, and apps holding 4.5+ over the trailing 90 days convert at about 1.7x the rate of apps under 4.0. After a major version that fixes the complaints, you can opt into Apple’s rating reset to start the running average fresh.
Track velocity weekly during recovery, monthly once stable.

For the public-reply half, how to reply to App Store reviews with AI covers a per-reply workflow that respects both stores’ character limits. For deeper sourcing on review-management benchmarks, see AppFollow’s 2026 review-management guide.

Common workflow tips

Always pair review analysis with release-date mapping; most rating drops trace to a specific release.
Cluster by root cause, not topic — this is the single biggest lever.
Cross-reference review themes with support tickets; convergence raises your confidence.
Tag every cluster with severity AND frequency; both drive prioritization.
Compare global vs locale-specific patterns; regional issues hide inside global averages.

FAQ

How many reviews do I need to cluster?: At least 50 for meaningful root-cause clusters. Under 50, read each one individually — clustering noise looks like signal at small N.
Which AI model handles a big review export best?: Claude (Opus 4.7 / Sonnet 4.6) and Gemini 3.1 Pro both carry 1M-token context, enough for a 5,000-8,000-review CSV in one pass. ChatGPT Plus caps in-app context near 320 pages, so batch large files there.
How do I tell a bug spike from a UX problem?: Bug spikes correlate to release dates; UX problems persist across releases. Use prompt 2 to map reviews to releases.
Should I act on a single passionate review?: Only if it describes a clear bug others might hit silently. Otherwise wait for the cluster to form.
Can AI predict which fix will lift the rating most?: It can estimate, but the real signal is post-fix rating velocity in the 4 weeks after release. Verify; do not assume.
What if reviews contradict each other?: Contradiction usually means a polarizing feature or a segment-specific issue. Use prompt 4 (persona matrix) to disentangle paid vs free or new vs power users.

Tags: #Prompt #Product startup #App Store #App review