You have a feature idea and three people on Slack arguing about how to test it. Stop. Before you ship the code, you need a one-page test plan that says exactly what you are measuring, when you will stop, and what answer you will accept. AI is great at first-draft test plans if you give it the right inputs.
The task
Produce a one-page A/B experiment plan: hypothesis, primary metric, guardrail metrics, MDE, sample-size sanity check, ramp plan, and stop conditions.
When this is the right job for AI
- You already know the feature change and the rough audience.
- You have a baseline number for the primary metric (current conversion, current D7, etc.).
- You can describe two or three guardrails in plain language.
- You want a brutal “is this even worth running” sanity check.
- You are NOT asking AI to invent statistical software — it sketches the plan, your stats engine runs it.
What to feed the AI
- The feature change in one sentence (“new onboarding step 3 with a goal-picker”)
- Primary metric and current baseline (“D7 retention, currently 22%”)
- 2-3 guardrails (“crash-free rate, IAP revenue per user, day-1 uninstall rate”)
- Traffic volume per day (“12k new installs/day, iOS only”)
- Your minimum detectable effect — or, if you don’t know it yet, ask for a sanity range
- Decision window and product calendar (“must call it inside 21 days, marketing launch on day 25”)
Copy-ready prompt
You are a senior product analyst writing a one-page A/B test plan.
Feature change: a new onboarding step 3 that asks users to pick a goal (sleep, focus, anxiety) before reaching the home screen. Current onboarding has no goal-picker.
Primary metric: D7 retention. Baseline: 22% on iOS.
Guardrails:
- Crash-free session rate (must not drop more than 0.2 pp)
- IAP revenue per new install in week 1 (must not drop more than 5%)
- Day-1 uninstall rate (must not rise more than 1 pp)
Audience: new iOS installs only. 12,000 new installs per day.
Decision window: 21 days max. Marketing launch on day 25, so we cannot extend.
Write the plan in this exact structure:
1. Hypothesis (one sentence, falsifiable). Form: "If we add X, then primary metric Y will move by Z, because mechanism W."
2. Primary metric definition. Include: what counts as a D7-retained user (returning session on calendar day 7 in user-local time, not server UTC). Mention the most likely measurement bug.
3. MDE check. Given baseline 22% and 21-day window with 12k installs/day, what is the smallest effect we can reliably detect at 80% power? Show the math (or estimate).
4. Guardrail thresholds and what we would do if each one trips. Each guardrail gets one sentence: trigger and action.
5. Ramp plan: day 1-3 at 10/10/80 (control/treat/holdout), day 4+ at 50/50 if no guardrail trips. Name the specific check to run before each ramp step.
6. Stop conditions: when do we kill, when do we extend, when do we ship.
7. The one thing this experiment will NOT answer (so we do not over-claim later).
Rules:
- No "consider" language. Each section makes a call.
- No invented numbers. If you need a number I did not give you, mark it [need from analytics].
- If the MDE is bigger than 1.5 pp, say "this experiment is likely underpowered" out loud.
- Max one page.
Sample output structure
Hypothesis. If we add a goal-picker at onboarding step 3, then D7 retention will rise from 22% to at least 24.5%, because users who self-select a goal anchor a return reason within the first session.
Primary metric. D7 retention = returning session on calendar day 7 in the user’s local timezone. Likely measurement bug: server-UTC day boundaries undercount Asia/Pacific users by ~3 pp. Confirm the analytics pipeline uses install-local day before launch.
MDE check. With baseline 22%, n ≈ 252k installs over 21 days (50/50 split = 126k per arm), MDE at 80% power and alpha 0.05 is ~0.8 pp. The target lift of 2.5 pp is well above MDE — experiment is adequately powered. [need from analytics: actual day-7 sample retention, since holdouts and slow-cohorts reduce usable n].
Guardrails. Crash-free below 99.6% → pause and inspect. IAP/new-install down more than 5% on day 3 → pause; the goal-picker may be siphoning attention from the paywall. Day-1 uninstall up more than 1 pp → kill; we are losing users at the new step.
Ramp. Day 1-3 at 10/10/80 to validate instrumentation and guardrails. Day 4 ramp to 50/50 only if crash-free and uninstall guardrails are clean. Day 14 interim check for early-call eligibility.
Stop. Ship if D7 lift greater than 1.5 pp with p less than 0.05 at day 14. Kill on any guardrail trip. Extend is not available — marketing locks on day 25.
Not answered. This test does not tell us whether the goal-picker improves week-4 retention or LTV. Plan a follow-up cohort readout at week 4.
How to refine
- Hypothesis is vague (“improves engagement”) → require “name the mechanism in one clause.”
- MDE skipped → demand “show MDE math or estimate, and call out underpowered explicitly.”
- Guardrails are decorative → require each one to have a numeric trigger and an action verb.
- Ramp plan has no checks → require “what check unlocks the next ramp step.”
- AI invents traffic numbers → repeat “no invented numbers; mark [need from analytics].”
Common mistakes
- Designing the test after you already shipped the feature flag — by then you cannot say no.
- Picking a primary metric that moves on a quarterly horizon (LTV) for a 21-day test.
- Five guardrails — each one increases the false-alarm rate. Three is plenty.
- No stop condition. Tests that run “until we feel good” never end.
FAQ
- What if my traffic is too low for the MDE I want? Pick a leading metric (D1 or activation) for the test, and queue the lagging metric for a cohort readout.
- Holdout group — yes or no? Yes for any feature you cannot easily reverse. The 5-10% holdout pays for itself the first time you need to compare against true baseline.
- Can AI run the stats? No. AI sketches the plan; your stats engine or experimentation platform runs significance.
- One-sided or two-sided test? Two-sided unless you have a written reason. One-sided makes the math easier and the conclusions weaker.
Related
- AI A/B Test Summary
- AI Experiment Interpretation
- AI Retention Cohort Readout
- AI Feature Prioritization
- AI Funnel Analysis Readout
- AI App Store ASO Keyword Research Without Guessing
- AI Crash Report Triage: From Stack Trace to Owner in One Pass
Tags: #AI writing #app-experiment #ab-testing #app-product-ops #Indie dev