How big a sample do I need?

It depends on the baseline rate and your minimum detectable effect. A common anchor: 1,000+ users per arm to detect a 5-point lift on a ~40% base rate at 80% power. Verify with a sample-size calculator for your exact numbers.

What counts as good retention?

It is category-specific. As of June 2026 the cross-app medians sit near D1 26% / D7 13% / D30 7%, but gaming and fintech run higher on later days while e-commerce runs lower. Compare against your category and your own history.

D7 or D30 — which matters more?

D7 reliably predicts D30 once you have product-market fit. Before PMF, D1 activation is the most useful and fastest test to run.

How do I separate novelty effects from real lifts?

Extend the test to roughly 2x your original duration. If the lift fades by week 6, it was novelty, not retention.

Which AI model should I use for the read-out?

A reasoning model — GPT-5.5 Thinking, Claude Opus 4.7, or Gemini 3.1 Pro — because read-outs (template 9) chain statistical judgment. Fast tiers are fine for drafting hypotheses and backlogs.

What if my sample is too small?

Either combine cohorts over time (risky for a fast-changing product) or pre-commit to a directional read with explicit caveats and no ship decision.

Prompt Library

User Retention Experiment Prompts for D1, D7, D30 Lifts

15 retention-experiment prompts that design single-variable D1/D7/D30 tests, size them against 2026 benchmarks, separate real lifts from novelty noise, and read out results with statistical honesty.

Published: May 19, 2026 Updated: Jun 14, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Most retention work confuses motion for progress. A team ships six changes in one release, watches D7 move two points, and cannot say what caused it. These 15 prompts force the opposite discipline: one variable per test, a cohort window that holds up, a minimum detectable effect set before launch, and a read-out that admits when a result is noise. Coverage spans D1 activation, D7 habit formation, D30 sustained use, segment rescue, and the under-used “kill the feature” test.

TL;DR

Run one variable per experiment. Bundled changes make attribution impossible.
Size the test before you launch: set baseline, minimum detectable effect (MDE), and sample with an A/B sample-size calculator, not after the fact.
Anchor “good” to your category. Median app retention as of June 2026 sits near D1 26% / D7 13% / D30 7%, but a healthy fintech D30 looks nothing like a healthy e-commerce D30.
Route the analytical prompts (read-out, pre-mortem, dependency mapping) to a reasoning model — GPT-5.5 Thinking, Claude Opus 4.7, or Gemini 3.1 Pro. Use a fast model for drafting hypotheses.
Pair every experiment with a kill criterion and replicate any win on a fresh cohort before shipping to 100%.

Who this is for

Growth PMs, retention-squad leads, consumer-app founders, and lifecycle marketers running in-product or email experiments.

When not to use these prompts

Skip them under roughly 1,000 daily active users — small samples cannot give a retention test enough statistical power, and you will chase ghosts. Skip them too for one-time purchases or pure transactional products where return visits are not the point.

Retention benchmarks to anchor your targets (June 2026)

Set your target lift against your category, not a global average. These are cross-app medians; your own historical curve is always the better baseline.

Category	D1	D7	D30
All apps (median)	~26%	~13%	~7%
Gaming	29-33%	~16%	~9%
Fintech	22-30%	~18%	~9%
E-commerce	18-25%	~11%	~5%
News	up to ~36% (iOS D1)	—	—

Figures are pooled industry medians as of June 2026 and drift over time; treat them as sanity checks, not goals. A 5% D30 that would alarm a fintech team can be perfectly healthy for an e-commerce app, because each product fits into users’ lives differently.

Which model to run these on

Hypothesis drafting and template 11 backlogs: a fast tier (GPT-5.5 Instant, Claude Sonnet 4.6) is enough and cheaper.
Read-out, pre-mortem, dependency mapping (templates 9, 10, 15): use a reasoning model — GPT-5.5 Thinking, Claude Opus 4.7, or Gemini 3.1 Pro — because these require chained statistical judgment, not pattern completion.
Long retention exports pasted inline: Opus 4.7, Sonnet 4.6, and Gemini 3.1 Pro all carry 1M-token context as of June 2026, so you can drop a full cohort table in without truncation. ChatGPT Plus caps in-app context near 320 pages; the full 1M window is on the $200 Pro tier.

Prompt anatomy

A retention-experiment prompt should carry six elements:

Role: who the AI plays (growth PM / solo founder / product analyst / lifecycle marketer).
Context: stage (MVP / growth / scale), DAU or ARR, platform (web / iOS / Android), audience, constraints.
Goal: one concrete deliverable — one experiment design, one read-out, one quarterly backlog.
Constraints: timeline (this sprint / quarter), must-not-break flows (billing, compliance), banned moves (bundling).
Output format: table, checklist, or ticket-ready JSON you can paste into Linear / Notion / Jira.
Examples / signal: 1-2 reference experiments you trust, plus 1 anti-example to avoid.

15 copy-ready prompt templates

1. Single-variable D1 lift experiment

The default. Forces single-variable discipline.

You are a growth PM. Design a D1 retention experiment for [product]: (1) hypothesis (specific behavior change), (2) single variable manipulated, (3) control vs variant, (4) target lift + minimum detectable effect, (5) sample size per arm, (6) duration, (7) primary metric (D1 retention), (8) 3 guardrail metrics, (9) kill criteria. Banned: bundling multiple changes.

Context: [product, current D1, segment, hypothesized cause]

Variables to swap: product, current D1, segment, hypothesized cause

Optimization: If the hypothesis is vague, add: “Rewrite the hypothesis in the form: ‘If we change X for users who Y, D1 retention will rise from A% to B% because Z.‘“

2. D7 habit-loop experiment

Design a D7 retention experiment focused on habit formation. Hypothesis must name: trigger (what brings them back), action (what they do), reward (what they get), investment (what makes the loop sticky). Specify the variable changed in one layer of the loop, with metric definition and guardrails. Duration: at least 21 days.

3. D30 sustained-engagement test

Design a D30 retention experiment. Hypothesis: which user behavior in week 1 predicts D30 retention, and what nudge increases that behavior. Specify the cohort definition, the predictor metric, the intervention, the success threshold. Note: D30 tests need at least 6 weeks of data and large samples.

4. Cohort definition audit

Below is my proposed cohort for a retention test. Audit it: (1) is the cohort window correct (e.g., new users in the week of a fixed start date), (2) is the comparison cohort matched, (3) are external factors controlled (release dates, marketing campaigns), (4) is the cohort size sufficient for the target MDE. Recommend the smallest fix.

Cohort def: [paste]

5. Activation event redefinition

For [product], define the activation event that best predicts D7 retention. Steps: (1) list 5 candidate events, (2) describe how to test each as a predictor, (3) recommend the most predictive one with reasoning. End with the cohort split for the next test.

6. Segmented retention rescue

D7 retention for [segment] is 30% below our global average. Design 3 segment-specific experiments to close the gap. For each: hypothesis, variable, expected lift, why it works for this segment specifically. Mark which one to run first.

7. Notification-cadence test

Design a notification-cadence experiment. Variants: 0 / 1 / 3 / 7 push notifications per week in the first 14 days. Define: variant assignment, primary metric (D14 retention), guardrails (opt-out rate, complaint volume, app-store rating), winner-call criteria.

8. Onboarding-step removal test

Design a counter-experiment where we REMOVE an onboarding step ([specific step]) for half the users. Hypothesis: completion rate rises, D1 retention rises, but [feature] adoption drops. Define how to measure each, and how long to wait before calling the result.

9. Read-out template (statistical honesty)

Below is the result of a retention experiment. Write the read-out: (1) hypothesis tested, (2) sample size achieved per arm, (3) result with 95% confidence interval, (4) whether it crossed the pre-set minimum detectable effect, (5) guardrail movement, (6) ship / kill / iterate decision, (7) what we learned even if it failed.

Result data: [paste]

10. Pre-mortem of the experiment

Before launch, run a pre-mortem on this experiment: 5 reasons it could produce a misleading result (selection bias, seasonality, contamination across arms, ceiling effect, novelty effect). For each: how to detect it, how to mitigate it. End with the kill criterion that should force an immediate stop.

11. Backlog of retention bets (quarterly)

For [product] with this retention curve [paste], produce a backlog of 12 retention experiments for next quarter. For each: hypothesis, target metric (D1/D7/D30), estimated effort (S/M/L), expected lift (small/medium/large). Sort by impact / effort.

12. Email-based retention test

Design an email-based retention experiment for [product]: (1) trigger condition (e.g., 3 days since last login), (2) email variants (control = no email, variant A = soft nudge, variant B = personalized recommendation), (3) success metric (return rate within 7 days), (4) sample size per arm, (5) what would invalidate the result.

13. “Kill the feature” retention test

We suspect [feature] is hurting retention. Design a counter-test: for a random 5% of users, hide the feature entirely. Measure D7 / D14 retention vs control. Define the threshold at which we kill the feature for everyone.

14. Retention curve diagnostic

Below is our retention curve (D0 to D60). Diagnose: where the steepest drop happens, what behavior change correlates with the cliff, which segment is most affected. Recommend the next experiment to test the diagnosis.

Curve: [paste]

15. Multi-experiment dependency mapper

We have 5 retention experiments in flight. Identify: which can run in parallel safely, which contaminate each other, which require sequencing. Output a dependency graph and a recommended schedule for the next 8 weeks.

Experiments: [paste]

How to size the test before you run it

Sample size is set by four inputs: baseline retention rate, the minimum detectable effect you care about (the smallest lift worth shipping), significance level (usually 0.05), and power (usually 0.80). Plug them into Evan Miller’s sample-size calculator before you write a single line of code. A practical anchor: detecting a 5-point lift on a ~40% base rate at 80% power needs roughly 1,000+ users per arm. Demanding a smaller MDE multiplies the sample and the duration fast, so be honest about the smallest effect that would actually change your roadmap.

Common mistakes

Bundling 3+ changes in one experiment — you cannot tell what caused the lift.
Calling a 1-point move a “win” without a confidence interval or significance check.
Forgetting guardrails — lifting D1 while spiking the opt-out rate is a net loss.
Cohort window too short — D30 tests need 6+ weeks, no shortcuts.
Mixing new-user and existing-user cohorts in the same read-out.
Ignoring novelty effects — a 4-week test can hide that the lift faded by week 6.
Shipping on a single experiment instead of replicating the result.

How to push results further

State the hypothesis as “if X, then Y by Z%, because W” before launching.
Hold back at least a 30% never-touched control to detect cross-experiment contamination.
Read out with confidence intervals, not point estimates.
Pair every experiment with a kill criterion — without it, runs go on forever.
Replicate any “win” on a fresh cohort before shipping to 100%.
Pre-define the minimum detectable effect; deciding it after the fact is statistical cheating.
Keep a running log of what you tested and learned — most teams forget last quarter’s experiments by the next one.

FAQ

How big a sample do I need?: It depends on the baseline rate and your minimum detectable effect. A common anchor: 1,000+ users per arm to detect a 5-point lift on a ~40% base rate at 80% power. Verify with a sample-size calculator for your exact numbers.
What counts as good retention?: It is category-specific. As of June 2026 the cross-app medians sit near D1 26% / D7 13% / D30 7%, but gaming and fintech run higher on later days while e-commerce runs lower. Compare against your category and your own history.
D7 or D30 — which matters more?: D7 reliably predicts D30 once you have product-market fit. Before PMF, D1 activation is the most useful and fastest test to run.
How do I separate novelty effects from real lifts?: Extend the test to roughly 2x your original duration. If the lift fades by week 6, it was novelty, not retention.
Which AI model should I use for the read-out?: A reasoning model — GPT-5.5 Thinking, Claude Opus 4.7, or Gemini 3.1 Pro — because read-outs (template 9) chain statistical judgment. Fast tiers are fine for drafting hypotheses and backlogs.
What if my sample is too small?: Either combine cohorts over time (risky for a fast-changing product) or pre-commit to a directional read with explicit caveats and no ship decision.

Tags: #Prompt #Product startup #Feature priority