至少多少条才能聚类？

至少 50 条才有意义。低于 50 逐条人工读——样本太小时噪声会被当成信号。

哪个 AI 模型最适合处理大批评论导出？

Claude（Opus 4.7 / Sonnet 4.6）和 Gemini 3.1 Pro 都是 1M token 上下文，足够把 5,000-8,000 条评论的 CSV 一次跑完。ChatGPT Plus 应用内上下文约 320 页，大文件要分批。

怎么区分 bug 暴增 vs UX 问题？

Bug 暴增关联发版日期；UX 问题跨版本持续。用 Prompt 2 做映射。

一条激烈评论值得动作吗？

只在它描述的是别人可能默默踩到的明确 bug 时。否则等集群成形。

AI 能预测哪个修复最能提评分吗？

能估，但真正的信号是修复后 4 周的评分速率。要验证，不要假设。

评论互相矛盾怎么办？

矛盾通常意味着两极化功能或某个 segment 特有的问题。用 Prompt 4（persona 矩阵）拆开付费 vs 免费、新用户 vs 重度用户。

AI 提示词库

差评分析 Prompt：根因聚类模板

15 个 AI Prompt：按根因聚类 1-2 星评论、分离症状与真问题、找出最能拉评分的 3 项修复。2026 年 6 月更新。

发布于: 2026/05/19 更新于: 2026/06/14 作者: AI Productivity Guide Team 🌐 查看英文版本

1-2 星评论几乎从不是表面那回事。用户写”手机上一直崩”，真正的根因往往是上周那次发版里登录流程回归了。下面 15 个 Prompt 按根因聚类（不是按主题）、把每个集群关联到产品模块、区分一次性 bug 暴增和慢性模式，再产出能直接对齐下个 sprint 的修复优先级。

为什么值得认真做：数字很残酷——跌破 4.0 星的 App 大多再也回不来，而 0.5 星的下滑就可能把安装转化率砍掉近一半。但回复 30%-50% 评论的 App 平均 3.77 星，回复率低于 1% 的只有 3.25 星（AppFollow，2026）。也就是说，公开回复是恢复评分的一半，分析是另一半。

TL;DR

把原始 1-2 星评论粘进下面的 Prompt，AI 会按根因而非表面主题聚类。
从 Prompt 1（根因聚类）和 Prompt 2（版本关联）开始——评分跌幅多半能追到某一次发版。
用 Claude（Opus 4.7 / Sonnet 4.6，1M token 上下文）或 Gemini 3.1 Pro（1M）一次吃下整份导出；ChatGPT Plus 应用内上下文约 320 页，大 CSV 要分批。
近期评论权重最高：两大商店都最看重最近约 90 天，所以先修根因，再推新评论稀释旧的。

该用哪个模型跑（2026 年 6 月）

评论分析是典型的长上下文活：你一次粘进几百到几千行，再让模型跨全部行做聚类。按你能一次塞进多少来选模型。

模型	标准上下文	适合的评论场景	备注
Claude Opus 4.7 / Sonnet 4.6	1M token	整份 CSV 一次过；细腻的根因聚类	Sonnet 4.6 是更便宜的主力；最难的根因判断交给 Opus 4.7
Gemini 3.1 Pro	1M token	大导出；含在 Google AI Pro（$19.99/月）内	表格输出能力强
ChatGPT（GPT-5.5，Plus $20/月）	应用内约 320 页	小批量；快速交互分诊	完整 1M 上下文只在 $200 Pro 档

经验法则：约 5,000-8,000 条带全文的评论 CSV，能在单个 1M token 的 Claude 或 Gemini 会话里一次跑完，省掉”先分析前 500 条”的折中。更大就按月份或按发版窗口拆——下面的 Prompt 2 本来就是逐版本跑的。

给谁用，什么时候别用

为移动 App PM、App 工作室客服负责人、关注评分速率的增长团队、以及要从糟糕版本里恢复的创始人而写。

评论不到 50 条的 App 别用这些 Prompt——逐条人工读。一次性喷子或敲诈评论也别用：那是审核和举报的活，不是分析的活。

每个评论分析 Prompt 都要有的六个要素

结果好不好，看 Prompt 有没有把这六项讲清楚：

角色：让 AI 扮演谁（资深 PM / 独立创始人 / 产品设计师 / 独立开发者 / 增长负责人）。
上下文：阶段（想法 / MVP / 增长 / 规模化）、团队规模、流量或 ARR、平台（web / iOS / Android）、受众、限制。
目标：一个具体交付物——一张聚类表、一份修复优先级、一组回复模板。
限制：时间线（本 sprint / 本季度）、要砍的范围、不能动的流程（计费、登录、合规）。
输出格式：表格、清单、可贴 ticket 的 JSON、或带标签的段落，能直接粘到 Linear / Notion / Jira。
示例 / 信号：1-2 条你已经知道根因的评论，加 1 条你觉得含糊的，让模型校准判断。

什么时候该跑一次全量评论扫描

发版后评分下滑调查（暴增后 48 小时内）
季度评分速率复盘
上线前风险评估，对照竞品近期的 1 星主题
从真实用户痛点反推 roadmap
关键 bug 燃尽优先级

15 个可直接复制的 Prompt 模板

1. 根因聚类（不是主题聚类）

核心模板。强迫按因果分组，不按表面词分组。

You are a product analyst. Below are {N} 1-2 star reviews of {app}. Cluster by ROOT CAUSE, not by topic. Same root cause may manifest as different complaints; same complaint may have different root causes. For each cluster: count, hypothesized root cause, 3 representative verbatim, suggested verification (logs, code area, recent release).

Reviews: {paste}

可替换变量： N、评论、App

优化建议： 聚类像主题聚类时追加：“Each cluster name must be a hypothesis ending in a verb (‘login flow regressed after auth refactor’), not a noun phrase (‘login issues’).“

2. 版本影响关联

Below are 1-2 star reviews for the last 90 days, with timestamps. Map them to our recent releases ({list with dates}). For each release: review count spike, dominant complaint, hypothesized regression. Identify any release that triggered a sustained spike.

Reviews: {paste}
Releases: {paste}

3. 崩溃 / 缺功能 / UX 摩擦分桶

Classify each of these 1-2 star reviews into: crash / data-loss, missing feature, UX friction, pricing complaint, support complaint, abuse / spam. For each bucket, count and % of total. Output a 6-row table with examples per bucket.

Reviews: {paste}

4. Persona × 根因矩阵

Below are reviews tagged with inferred persona (free / paid / new / power user). Cluster by root cause, then show distribution across personas. Highlight any root cause that disproportionately affects paid users — those move revenue.

Reviews: {paste}

5. “评分背后的故事”重建

For each of these 5 representative reviews, reconstruct the likely user story: what they were trying to do, where it broke, what they tried next, what made them rate 1 star. Mark each step with confidence level. This becomes empathy fuel for the team.

Reviews: {paste}

6. 严重度评分

For each root-cause cluster, score severity on 4 axes: (1) frequency of occurrence, (2) impact when it occurs (annoyance / blocker / data loss), (3) user segment affected, (4) recoverability. Output a 4-column severity table.

Clusters: {paste}

7. 修复优先级（可入 sprint）

From this analysis of 1-2 star reviews, produce the 5 fixes most likely to lift the rating in 8 weeks. For each: estimated effort, expected rating impact, dependencies, success metric. Mark any "fix" that is actually a comms issue (not a real bug).

Analysis: {paste}

8. 误判过滤

Some of these reviews report bugs that are not real bugs (user error, feature exists). For each review: classify as real bug / user error / feature exists / unclear. For "user error" and "feature exists", suggest a help-center or in-product fix.

Reviews: {paste}

9. 评分速率仪表盘

Design a 6-metric dashboard for rating velocity: avg rating last 7/30/90 days, % of reviews 1-2 star, time-to-respond, %-of-1-2-star with developer reply, % of repeat-complaint themes, post-release rating delta. Define each metric and its alarm threshold.

10. 慢性 vs 突峰模式

Below are 1-2 star reviews for the last 12 months. For each root cause cluster, classify as: chronic (consistent monthly), spike (concentrated weeks), seasonal (returns periodically). Recommend different response strategies for each pattern.

Reviews: {paste}

11. 本地化偏斜识别

Cluster these 1-2 star reviews by language / locale. For each locale: top 3 complaints. Highlight any locale where the dominant complaint is different from the global pattern — likely a localization or regional issue.

Reviews: {paste}

12. 竞品触发识别

Scan these 1-2 star reviews for mentions of competitor apps or "{competitor} is better at X". List each mention with context. Output: which competitors users compare us to, on what dimensions, with what frequency. This becomes positioning input.

Reviews: {paste}

13. 更新打破功能模式

Identify reviews complaining that an update made things worse. For each: which feature/flow they say regressed, when they noticed, whether they will downgrade if possible. Group by version. Recommend whether to roll back or fast-forward.

Reviews: {paste}

14. 每集群恢复动作清单

For each root cause cluster from this analysis, produce a recovery checklist: (1) immediate fix, (2) prevention work, (3) user comms (review reply template, in-app message, email), (4) PR risk level, (5) owner. Output as a per-cluster card.

Clusters: {paste}

15. 季度评分回顾

Write a quarterly retrospective: starting and ending rating, dominant 1-2 star themes per month, what we fixed, what we missed, what changed in rating velocity. End with 3 thematic bets for next quarter and 1 metric to declare them successful.

Quarter data: {paste}

容易踩的坑

按主题聚（“登录问题”）而不是根因（“auth 重构后 iOS 17 OAuth 刷新失败”）。
把一次发版引起的暴增当慢性问题。
把用户误用直接当 bug 处理，未核实。
忽略隐藏在全局计数里的本地化偏斜。
只因一条激烈 1 星就动作，忽略整个集群。
修了最大声的少数派抱怨，没核实是否真的代表多数。
只修不沟通——修复重要，公开回复也重要。

把集群变成评分恢复

分析只是一半。真正能把数字拉回来的恢复闭环长这样：

快速定位根因。 跑 Prompt 1 和 Prompt 2；在暴增后 48 小时内发出热修。正是这个窗口，把 6 周恢复（4.4 → 3.6 → 4.3）和那些再也没回来的 App 区分开。
每个集群挑一条代表评论公开回复，修复一上线就回。Apple 每条回复上限约 5,970 字符、24 小时内出现；Google Play 每条回复上限 350 字符但立即生效。在 Google Play 上回复评论平均能带来约 0.7 星的提升。
刷新近期窗口。 两大商店都最看重最近约 90 天，近 90 天稳住 4.5+ 的 App 转化率约为 4.0 以下 App 的 1.7 倍。在一次修掉这些抱怨的大版本之后，可以选择启用 Apple 的评分重置，让平均分从头开始。
恢复期每周追踪评分速率，稳定后改月度。

公开回复这一半，用 AI 回 App Store 评论的方法讲了一套同时照顾两大商店字符上限的逐条回复流程。想看评论管理的更多基准数据，参考 AppFollow 的 2026 评论管理指南。

常用工作流技巧

评论分析必配发版日期映射——评分跌幅多数能追到某次发版。
按根因聚类，不按主题聚——这是最大杠杆。
与支持工单交叉核对，趋同度提高置信度。
每个集群都标严重度 + 频率；二者都决定优先级。
全局 vs locale 对比，区域问题藏在全局均值后面。

FAQ

至少多少条才能聚类？: 至少 50 条才有意义。低于 50 逐条人工读——样本太小时噪声会被当成信号。
哪个 AI 模型最适合处理大批评论导出？: Claude（Opus 4.7 / Sonnet 4.6）和 Gemini 3.1 Pro 都是 1M token 上下文，足够把 5,000-8,000 条评论的 CSV 一次跑完。ChatGPT Plus 应用内上下文约 320 页，大文件要分批。
怎么区分 bug 暴增 vs UX 问题？: Bug 暴增关联发版日期；UX 问题跨版本持续。用 Prompt 2 做映射。
一条激烈评论值得动作吗？: 只在它描述的是别人可能默默踩到的明确 bug 时。否则等集群成形。
AI 能预测哪个修复最能提评分吗？: 能估，但真正的信号是修复后 4 周的评分速率。要验证，不要假设。
评论互相矛盾怎么办？: 矛盾通常意味着两极化功能或某个 segment 特有的问题。用 Prompt 4（persona 矩阵）拆开付费 vs 免费、新用户 vs 重度用户。