需要多大样本？

取决于基线和你的最小可检测效应。常用锚点：在约 40% 基线、80% 功效下检测 5 个点提升，每组 1000+ 起。务必用样本量计算器按你的实际数字算一遍。

什么算好留存？

因品类而异。截至 2026 年 6 月跨 App 中位数约为 D1 26% / D7 13% / D30 7%，但游戏和金融在后段更高，电商更低。对照你的品类和自己的历史，别看全局平均。

D7 还是 D30 更重要？

过了 PMF 之后，D7 能可靠预测 D30。在 PMF 之前，D1 激活最有用、也最快出结果。

怎么区分新奇效应和真实提升？

把测试时长延长到原来的约 2 倍。如果到第 6 周提升消退，那就是新奇，不是留存。

读出该用哪个 AI 模型？

用推理模型——GPT-5.5 Thinking、Claude Opus 4.7 或 Gemini 3.1 Pro——因为读出（模板 9）要的是链式统计判断。起草假设和 backlog 用快模型就行。

样本太小怎么办？

要么跨时间合并 cohort（产品变化快时有风险），要么预先承诺只做方向性解读，写清注意事项，且不做上线决策。

AI 提示词库

用户留存实验 Prompt：D1 / D7 / D30 模板

15 个留存实验 Prompt：设计单变量 D1 / D7 / D30 测试、对照 2026 行业基准估样本量、区分真实提升和新奇噪音、用统计诚实读出结果。

发布于: 2026/05/19 更新于: 2026/06/14 作者: AI Productivity Guide Team 🌐 查看英文版本

大多数留存工作都是把忙碌当进展：一个版本里塞 6 个改动，看着 D7 涨了 2 个点，却说不清是谁推动的。下面 15 个 Prompt 逼你反着来——一次只测一个变量、用站得住的 cohort 窗口、上线前就定好最小可检测效应、读出结果时承认哪些只是噪音。覆盖 D1 激活、D7 习惯形成、D30 持续使用、分 segment 救援，以及被低估的”砍功能”实验。

TL;DR

一次实验只动一个变量。捆绑改动等于放弃归因。
上线前就把测试估好：定基线、定最小可检测效应（MDE），用 A/B 样本量计算器算样本，别等跑完再补。
“好留存”要锚定你的品类。截至 2026 年 6 月，App 留存中位数大致是 D1 26% / D7 13% / D30 7%，但金融类健康的 D30 和电商类健康的 D30 完全是两回事。
分析类 Prompt（读出、pre-mortem、依赖图）交给推理模型——GPT-5.5 Thinking、Claude Opus 4.7 或 Gemini 3.1 Pro；起草假设用快模型即可。
每个实验都配一条 kill criterion，任何”胜利”都要在新 cohort 上复现后再全量发布。

适合哪些人

增长 PM、留存小组 lead、消费类 App 创始人，以及跑应用内或邮件实验的生命周期营销。

什么时候不建议用

日活大约不到 1000 就别用了——样本太小给不了留存测试足够的统计功效，只会追着幻影跑。一次性购买或纯交易型产品也别用，回访本来就不是目标。

锚定目标用的留存基准（2026 年 6 月）

目标提升要对照你的品类，而不是全局平均。下面是跨 App 的中位数，你自己的历史留存曲线永远是更靠谱的基线。

品类	D1	D7	D30
全 App（中位数）	约 26%	约 13%	约 7%
游戏	29-33%	约 16%	约 9%
金融	22-30%	约 18%	约 9%
电商	18-25%	约 11%	约 5%
资讯	最高约 36%（iOS D1）	—	—

数字是截至 2026 年 6 月的行业中位数，会随时间漂移，只当作 sanity check，不要当目标。同样 5% 的 D30，金融团队会警铃大作，电商 App 却可能很健康——每类产品嵌进用户生活的方式不一样。

这些 Prompt 该用哪个模型

起草假设、模板 11 的 backlog：快模型（GPT-5.5 Instant、Claude Sonnet 4.6）就够，还便宜。
读出、pre-mortem、依赖图（模板 9、10、15）：用推理模型——GPT-5.5 Thinking、Claude Opus 4.7 或 Gemini 3.1 Pro——因为这些要的是链式统计判断，不是模式补全。
直接贴长留存导出：截至 2026 年 6 月，Opus 4.7、Sonnet 4.6、Gemini 3.1 Pro 都是 1M token 上下文，整张 cohort 表贴进去也不会被截断。ChatGPT Plus 应用内上下文上限约 320 页；完整 1M 窗口在 $200 的 Pro 档。

Prompt 结构

留存实验 Prompt 一般带这六个要素：

角色：让 AI 扮演谁（增长 PM / 独立创始人 / 产品分析师 / 生命周期营销）。
上下文：阶段（MVP / 增长 / 规模化）、DAU 或 ARR、平台（web / iOS / Android）、受众、限制。
目标：一个具体交付物——一个实验设计、一份读出、一个季度 backlog。
限制：时间线（本 sprint / 本季度）、不能动的流程（计费、合规）、禁止动作（捆绑改动）。
输出格式：表格、清单，或可直接贴进 Linear / Notion / Jira 的 ticket 级 JSON。
示例 / 信号：1-2 个你信得过的参考实验，加 1 个想避开的反例。

15 个可直接复制的 Prompt 模板

1. 单变量 D1 提升

默认款。强制单变量纪律。

You are a growth PM. Design a D1 retention experiment for [product]: (1) hypothesis (specific behavior change), (2) single variable manipulated, (3) control vs variant, (4) target lift + minimum detectable effect, (5) sample size per arm, (6) duration, (7) primary metric (D1 retention), (8) 3 guardrail metrics, (9) kill criteria. Banned: bundling multiple changes.

Context: [product, current D1, segment, hypothesized cause]

可替换变量： product、current D1、segment、假设原因

优化建议： 假设模糊时追加：“Rewrite the hypothesis in the form: ‘If we change X for users who Y, D1 retention will rise from A% to B% because Z.‘“

2. D7 习惯回路实验

Design a D7 retention experiment focused on habit formation. Hypothesis must name: trigger (what brings them back), action (what they do), reward (what they get), investment (what makes the loop sticky). Specify the variable changed in one layer of the loop, with metric definition and guardrails. Duration: at least 21 days.

3. D30 持续参与

Design a D30 retention experiment. Hypothesis: which user behavior in week 1 predicts D30 retention, and what nudge increases that behavior. Specify the cohort definition, the predictor metric, the intervention, the success threshold. Note: D30 tests need at least 6 weeks of data and large samples.

4. Cohort 定义审计

Below is my proposed cohort for a retention test. Audit it: (1) is the cohort window correct (e.g., new users in the week of a fixed start date), (2) is the comparison cohort matched, (3) are external factors controlled (release dates, marketing campaigns), (4) is the cohort size sufficient for the target MDE. Recommend the smallest fix.

Cohort def: [paste]

5. 激活事件重定义

For [product], define the activation event that best predicts D7 retention. Steps: (1) list 5 candidate events, (2) describe how to test each as a predictor, (3) recommend the most predictive one with reasoning. End with the cohort split for the next test.

6. 分 segment 留存救援

D7 retention for [segment] is 30% below our global average. Design 3 segment-specific experiments to close the gap. For each: hypothesis, variable, expected lift, why it works for this segment specifically. Mark which one to run first.

7. 推送频次测试

Design a notification-cadence experiment. Variants: 0 / 1 / 3 / 7 push notifications per week in the first 14 days. Define: variant assignment, primary metric (D14 retention), guardrails (opt-out rate, complaint volume, app-store rating), winner-call criteria.

8. 砍 onboarding 步骤反测

Design a counter-experiment where we REMOVE an onboarding step ([specific step]) for half the users. Hypothesis: completion rate rises, D1 retention rises, but [feature] adoption drops. Define how to measure each, and how long to wait before calling the result.

9. 读出模板（统计诚实）

Below is the result of a retention experiment. Write the read-out: (1) hypothesis tested, (2) sample size achieved per arm, (3) result with 95% confidence interval, (4) whether it crossed the pre-set minimum detectable effect, (5) guardrail movement, (6) ship / kill / iterate decision, (7) what we learned even if it failed.

Result data: [paste]

10. 实验 pre-mortem

Before launch, run a pre-mortem on this experiment: 5 reasons it could produce a misleading result (selection bias, seasonality, contamination across arms, ceiling effect, novelty effect). For each: how to detect it, how to mitigate it. End with the kill criterion that should force an immediate stop.

11. 季度留存押注 backlog

For [product] with this retention curve [paste], produce a backlog of 12 retention experiments for next quarter. For each: hypothesis, target metric (D1/D7/D30), estimated effort (S/M/L), expected lift (small/medium/large). Sort by impact / effort.

12. 邮件留存测试

Design an email-based retention experiment for [product]: (1) trigger condition (e.g., 3 days since last login), (2) email variants (control = no email, variant A = soft nudge, variant B = personalized recommendation), (3) success metric (return rate within 7 days), (4) sample size per arm, (5) what would invalidate the result.

13. “砍功能”留存测试

We suspect [feature] is hurting retention. Design a counter-test: for a random 5% of users, hide the feature entirely. Measure D7 / D14 retention vs control. Define the threshold at which we kill the feature for everyone.

14. 留存曲线诊断

Below is our retention curve (D0 to D60). Diagnose: where the steepest drop happens, what behavior change correlates with the cliff, which segment is most affected. Recommend the next experiment to test the diagnosis.

Curve: [paste]

15. 多实验依赖图

We have 5 retention experiments in flight. Identify: which can run in parallel safely, which contaminate each other, which require sequencing. Output a dependency graph and a recommended schedule for the next 8 weeks.

Experiments: [paste]

上线前怎么把测试估好

样本量由四个输入决定：基线留存率、你在意的最小可检测效应（值得上线的最小提升）、显著性水平（一般 0.05）、功效（一般 0.80）。在写一行代码之前，先把它们丢进 Evan Miller 的样本量计算器。一个实用锚点：在约 40% 基线、80% 功效下检测 5 个点的提升，每组大致需要 1000+ 用户。要求更小的 MDE，样本量和测试时长都会成倍上涨，所以要诚实地问自己：真正会改变路线图的最小提升是多少。

容易踩的坑

一次实验改 3 个以上变量——根本无法归因。
一个点的波动就宣布”胜利”，没有置信区间也没查显著性。
忘记 guardrails——D1 上去了但流失率飙升，整体是亏的。
Cohort 窗口太短——D30 至少要 6+ 周，没有捷径。
新用户和老用户 cohort 混在同一份读出里。
忽视新奇效应——4 周实验可能掩盖第 6 周已经回落。
单次实验就拍板，不做复现。

怎么把结果再推一步

开跑前用”如果 X 则 Y by Z%，因为 W”句式把假设写死。
至少留 30% 永不触动的对照组，用来检测交叉污染。
读出时给置信区间，不只给点估计。
每个实验都配 kill criterion——否则会无限期跑下去。
任何”胜利”先在新 cohort 上复现，再 100% 发布。
提前定最小可检测效应；事后再定就是统计作弊。
持续记录测了什么、学到什么——多数团队到下季度就忘了上季度做过哪些实验。

FAQ

需要多大样本？: 取决于基线和你的最小可检测效应。常用锚点：在约 40% 基线、80% 功效下检测 5 个点提升，每组 1000+ 起。务必用样本量计算器按你的实际数字算一遍。
什么算好留存？: 因品类而异。截至 2026 年 6 月跨 App 中位数约为 D1 26% / D7 13% / D30 7%，但游戏和金融在后段更高，电商更低。对照你的品类和自己的历史，别看全局平均。
D7 还是 D30 更重要？: 过了 PMF 之后，D7 能可靠预测 D30。在 PMF 之前，D1 激活最有用、也最快出结果。
怎么区分新奇效应和真实提升？: 把测试时长延长到原来的约 2 倍。如果到第 6 周提升消退，那就是新奇，不是留存。
读出该用哪个 AI 模型？: 用推理模型——GPT-5.5 Thinking、Claude Opus 4.7 或 Gemini 3.1 Pro——因为读出（模板 9）要的是链式统计判断。起草假设和 backlog 用快模型就行。
样本太小怎么办？: 要么跨时间合并 cohort（产品变化快时有风险），要么预先承诺只做方向性解读，写清注意事项，且不做上线决策。