Retention experiments are where most teams confuse motion for progress. They ship 6 changes at once, see D7 move 2 points, and cannot tell what caused it. These 15 prompts design single-variable retention tests, define correct cohort windows, calibrate minimum detectable effect, and read out results with statistical honesty. Coverage: D1 activation, D7 habit-formation, D30 sustained-use, segmented retention, and the under-used “kill-feature” experiment.
Who this is for
Growth PMs, retention squad leads, consumer-app founders, and lifecycle marketers running in-product or email experiments.
When not to use these prompts
Skip when DAU is under 1,000 — experiments need power that small samples cannot give. Skip too for one-time purchases or pure transactional products where retention is not the goal.
Prompt anatomy / structure formula
A retention-experiment prompt should always carry six elements:
- Role: who the AI plays (senior PM / solo founder / product designer / indie dev / growth lead).
- Context: stage (idea / MVP / growth / scale), team size, traffic or ARR, platform (web / iOS / Android), audience, constraints.
- Goal: one concrete deliverable — one PRD section, one user-story set, one experiment design, one launch post.
- Constraints: timeline (this sprint / this quarter), scope cuts, must-not-break (existing flows, billing, compliance).
- Output format: table, checklist, ticket-ready JSON, or labeled blocks you can paste straight into Linear / Notion / Jira.
- Examples / signal: 1-2 reference docs or competitors you like, plus 1 anti-example you want to avoid.
Best for
- D1 activation lift design
- D7 habit-loop experiment
- D30 sustained-engagement bets
- Segmented retention rescue
- Quarterly retention bet planning
15 copy-ready prompt templates
1. Single-variable D1 lift experiment
The default. Forces single-variable discipline.
You are a growth PM. Design a D1 retention experiment for {product}: (1) hypothesis (specific behavior change), (2) single variable manipulated, (3) control vs variant, (4) target lift + minimum detectable effect, (5) sample size, (6) duration, (7) primary metric (D1 retention), (8) 3 guardrail metrics, (9) kill criteria. Banned: bundling multiple changes.
Context: {product, current D1, segment, hypothesized cause}
Variables to swap: product, current D1, segment, hypothesized cause
Optimization: If hypothesis is vague, add: “Rewrite the hypothesis in the form: ‘If we change X for users who Y, D1 retention will increase from A% to B% because Z.‘“
2. D7 habit-loop experiment
Design a D7 retention experiment focused on habit formation. Hypothesis must name: trigger (what brings them back), action (what they do), reward (what they get), investment (what makes the loop sticky). Specify the variable changed in one layer of the loop, with metric definition and guardrails. Duration: at least 21 days.
3. D30 sustained-engagement test
Design a D30 retention experiment. Hypothesis: which user behavior in week 1 predicts D30 retention, and what nudge increases that behavior. Specify the cohort definition, the predictor metric, the intervention, the success threshold. Note: D30 tests need at least 6 weeks of data and large samples.
4. Cohort definition audit
Below is my proposed cohort for a retention test. Audit it: (1) is the cohort window correct (e.g., new users in week of Aug 5), (2) is the comparison cohort matched, (3) are external factors controlled (release dates, marketing campaigns), (4) is the cohort size sufficient. Recommend the smallest fix.
Cohort def: {paste}
5. Activation event redefinition
For {product}, define the activation event that best predicts D7 retention. Steps: (1) list 5 candidate events, (2) describe how to test each as predictor, (3) recommend the most predictive one with reasoning. End with the cohort split for the next test.
6. Segmented retention rescue
D7 retention for {segment} is 30% below the global average. Design 3 segment-specific experiments to close the gap. For each: hypothesis, variable, expected lift, why this works for this segment specifically. Mark which one to run first.
7. Notification-cadence test
Design a notification-cadence experiment. Variants: 0 / 1 / 3 / 7 push notifications per week in the first 14 days. Define: variant assignment, primary metric (D14 retention), guardrails (opt-out rate, complaint volume, app rating), winner-call criteria.
8. Onboarding-step removal test
Design a counter-experiment where we REMOVE an onboarding step ({specific step}) for half the users. Hypothesis: completion rate rises, D1 retention rises, but {feature adoption} drops. Define how to measure each, and how long to wait before calling the result.
9. Read-out template (statistical honesty)
Below is the result of a retention experiment. Write the read-out: (1) hypothesis tested, (2) sample size achieved, (3) result with confidence interval, (4) whether it crossed the minimum detectable effect, (5) guardrail movement, (6) ship / kill / iterate decision, (7) what we learned even if it failed.
Result data: {paste}
10. Pre-mortem of the experiment
Before launching this experiment, run a pre-mortem: 5 reasons it could produce a misleading result (selection bias, seasonality, contamination, ceiling effect, novelty effect). For each: how to detect, how to mitigate. End with the kill criterion that should force an immediate stop.
11. Backlog of retention bets (quarterly)
For {product} with current retention curve {paste}, produce a backlog of 12 retention experiments for next quarter. For each: hypothesis, target metric (D1/D7/D30), estimated effort (S/M/L), expected lift (small/medium/large). Sort by impact / effort.
12. Email-based retention test
Design an email-based retention experiment for {product}: (1) trigger condition (e.g., 3 days since last login), (2) email variants (control = no email, variant A = soft nudge, variant B = personalized recommendation), (3) success metric (return rate within 7 days), (4) sample size, (5) what would invalidate the result.
13. “Kill the feature” retention test
We suspect {feature} is hurting retention. Design a counter-test: for a random 5% of users, hide the feature entirely. Measure D7 / D14 retention vs control. Define the threshold at which we kill the feature for everyone.
14. Retention curve diagnostic
Below is our retention curve (D0 to D60). Diagnose: where does the steepest drop happen, what behavior change correlates with the cliff, what segment is most affected. Recommend the next experiment to test the diagnosis.
Curve: {paste}
15. Multi-experiment dependency mapper
We have 5 retention experiments in flight. Identify: which can run in parallel safely, which contaminate each other, which require sequencing. Output a dependency graph and a recommended schedule for the next 8 weeks.
Experiments: {paste}
Common mistakes
- Bundling 3+ changes in one experiment — cannot tell what caused the lift.
- Calling a 1-point lift a “win” without checking statistical significance.
- Forgetting guardrails — improving D1 while spiking opt-out rate is a loss.
- Cohort window too short — D30 tests need 6+ weeks, no shortcuts.
- Mixing new-user and existing-user cohorts in the same readout.
- Ignoring novelty effects — a 4-week test can hide that the lift faded by week 6.
- Acting on a single experiment instead of replicating the result.
How to push results further
- Always state the hypothesis as “if X, then Y by Z%, because W” before launching.
- Reserve at least 30% of users as never-touched control to detect cross-experiment contamination.
- Read out experiments with confidence intervals, not just point estimates.
- Pair every experiment with a kill criterion — without it, runs go on forever.
- Replicate any “win” on a fresh cohort before shipping to 100%.
- Pre-define your minimum detectable effect; deciding after the fact is statistical cheating.
- Track cumulative learnings in a /retention-experiments doc — most teams forget what they tested last quarter.
FAQ
- How big a sample do I need?: Depends on the baseline rate and minimum detectable effect. Common rule: 1,000+ per arm for a 5-point lift on a 40% base rate.
- Can I run multiple retention tests at once?: Only if they target different surfaces and you randomize across both. Use template 15 to map dependencies.
- D7 or D30 — which matters more?: D7 predicts D30 once you cross product-market fit. Before that, D1 is the most useful test.
- How do I tell novelty effects from real lifts?: Extend the test 2x your original duration; if the lift fades, it was novelty.
- What if my sample is too small?: Either combine cohorts over time (risky for changing products) or pre-commit to a directional read with caveats.