AI A/B Test Plan: Draft a One-Page Experiment Spec in 10 Minutes

Q: Can AI compute the significance for me?

No. AI sketches the plan and estimates the MDE as a sanity check. Run the real significance test in Statsig, GrowthBook, Optimizely, or your in-house engine. Treat the AI number as a gut check, not a verdict.

Q: My traffic is too low for the MDE I want. Now what?

Switch the primary metric to a faster-moving leading indicator (day-1 retention or activation), and queue the lagging metric (D7, LTV) for a later cohort readout. A longer window will not save you if the math is already underpowered.

Q: Do I need a holdout group?

Yes, for any feature you cannot easily reverse. A 5-10% holdout pays for itself the first time you need a clean baseline to compare against weeks later.

Q: Halfway through I realize the sample size was estimated wrong. Stop the test?

No. Stopping when you peek at results is exactly how you break your own significance. Log the "sample-size recompute" note, run to the planned end, and correct the baseline for the next test of its kind.

Q: One-sided or two-sided test?

Two-sided unless you have a written reason. One-sided math is easier and the conclusions are weaker; reviewers will rightly distrust it.

Use AI to draft an A/B test plan with a falsifiable hypothesis, MDE sanity check, sample-size math, ramp plan, and stop conditions. Verified June 2026.

Published: May 23, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You have a feature idea and three people in Slack arguing about how to test it. Before you ship the flag, you need a one-page plan that states exactly what you are measuring, when you will stop, and what result you will accept. AI writes a strong first draft of that plan in minutes, but only if you feed it the baseline, the traffic, and the decision window. It will not invent the statistics for you, and it should not.

TL;DR

Give the model your feature change, primary metric baseline, daily traffic, and decision deadline. Ask for a fixed 7-section plan: hypothesis, metric definition, MDE check, guardrails, ramp, stop conditions, and what the test will not answer. Use Claude Opus 4.7 or GPT-5.5 for the reasoning, then run the actual significance test in Statsig, GrowthBook, or your in-house stats engine. The AI sketches; the platform decides.

What this produces

A one-page A/B experiment spec with: a falsifiable hypothesis, a precise primary-metric definition, a minimum-detectable-effect (MDE) sanity check, two to three guardrail metrics, a ramp plan, and explicit stop conditions.

When AI is the right tool for this

You already know the feature change and the rough audience.
You have a baseline number for the primary metric (current conversion, current D7, etc.).
You can describe two or three guardrails in plain language.
You want a fast “is this even worth running” sanity check before you burn three weeks of traffic.
You are not asking AI to run the stats. It drafts the plan; your experimentation platform computes significance.

What to feed the model

Input	Example
Feature change (one sentence)	“New onboarding step 3 with a goal-picker”
Primary metric + baseline	”D7 retention, currently 22% on iOS”
Guardrails (2-3)	“Crash-free rate, IAP revenue per install, day-1 uninstall rate”
Daily traffic	”12,000 new iOS installs/day”
MDE (or ask for a range)	“Smallest lift worth shipping: 2 pp”
Decision window + calendar	”21 days max; marketing launch on day 25”

Which model to use

All three frontier models (as of June 2026) handle this task well because it is structured reasoning over numbers you supply, not heavy computation. The differences are small:

Model	API price (in/out per 1M tokens)	Why pick it here
Claude Opus 4.7	$5 / $25	Holds a strict 7-section structure best; least likely to invent numbers when told not to
GPT-5.5	$5 / $30	Most token-efficient output; strong at the MDE arithmetic
Gemini 3.1 Pro	$2 / $12	Cheapest, 1M-token context if you paste a long metrics export

For a single one-page plan the cost difference is fractions of a cent, so pick whichever you already pay for: Claude Pro ($20/mo), ChatGPT Plus ($20/mo), or Google AI Pro ($19.99/mo).

Copy-ready prompt

You are a senior product analyst writing a one-page A/B test plan.

Feature change: a new onboarding step 3 that asks users to pick a goal
(sleep, focus, anxiety) before reaching the home screen. Current onboarding
has no goal-picker.

Primary metric: D7 retention. Baseline: 22% on iOS.
Guardrails:
- Crash-free session rate (must not drop more than 0.2 pp)
- IAP revenue per new install in week 1 (must not drop more than 5%)
- Day-1 uninstall rate (must not rise more than 1 pp)

Audience: new iOS installs only. 12,000 new installs per day.
Decision window: 21 days max. Marketing launch on day 25, so we cannot extend.

Write the plan in this exact structure:

1. Hypothesis (one sentence, falsifiable). Form: "If we add X, then primary
   metric Y will move by Z, because mechanism W."

2. Primary metric definition. Include: what counts as a D7-retained user
   (returning session on calendar day 7 in user-local time, not server UTC).
   Mention the most likely measurement bug.

3. MDE check. Given baseline 22% and a 21-day window with 12k installs/day,
   what is the smallest effect we can reliably detect at 80% power, alpha 0.05?
   Show the arithmetic (or a clear estimate).

4. Guardrail thresholds and the action if each trips. One sentence each:
   numeric trigger plus an action verb.

5. Ramp plan: day 1-3 at 10/10/80 (control/treat/holdout), day 4+ at 50/50 if
   no guardrail trips. Name the specific check before each ramp step.

6. Stop conditions: when do we kill, when do we extend, when do we ship.

7. The one thing this experiment will NOT answer (so we do not over-claim later).

Rules:
- No "consider" language. Each section makes a call.
- No invented numbers. If you need a number I did not give you, mark it
  [need from analytics].
- If the MDE is bigger than 1.5 pp, say "this experiment is likely
  underpowered" out loud.
- Max one page.

Sample output

Hypothesis. If we add a goal-picker at onboarding step 3, then D7 retention will rise from 22% to at least 24.5%, because users who self-select a goal anchor a return reason within the first session.

Primary metric. D7 retention = a returning session on calendar day 7 in the user’s local timezone. Likely measurement bug: server-UTC day boundaries undercount Asia/Pacific users by a few points. Confirm the analytics pipeline uses install-local day before launch.

MDE check. With baseline 22% and roughly 252k installs over 21 days (50/50 split ≈ 126k per arm), the MDE at 80% power and alpha 0.05 is about 0.7-0.8 pp. The target lift of 2.5 pp sits well above MDE, so the test is adequately powered. [need from analytics: actual day-7 sample size, since holdouts and slow cohorts shrink usable n].

Guardrails. Crash-free below 99.6% → pause and inspect. IAP per new install down more than 5% by day 3 → pause; the goal-picker may be siphoning attention from the paywall. Day-1 uninstall up more than 1 pp → kill; we are losing users at the new step.

Ramp. Day 1-3 at 10/10/80 to validate instrumentation and guardrails. Day 4 ramp to 50/50 only if crash-free and uninstall guardrails are clean. Day 14 interim check for early-call eligibility.

Stop. Ship if D7 lift exceeds 1.5 pp with p < 0.05 at day 14. Kill on any guardrail trip. Extension is unavailable; marketing locks on day 25.

Not answered. This test does not tell us whether the goal-picker improves week-4 retention or LTV. Plan a follow-up cohort readout at week 4.

The MDE math, in plain terms

The required sample size scales with one over the square of the effect you want to catch, so halving the MDE roughly quadruples the sample you need. The textbook two-proportion formula is:

n per arm ≈ (Z_alpha/2 + Z_beta)^2 × [ p1(1-p1) + p2(1-p2) ] / (p2 - p1)^2

At the industry-standard 80% power and 95% confidence, Z_alpha/2 = 1.96 and Z_beta = 0.84. Lower baselines blow up the requirement fast: detecting a small lift on a 1% baseline takes roughly 25x more users than on a 5% baseline. You do not need to compute this by hand. Have the model estimate it, then confirm with a sample-size calculator before you commit traffic. Statsig and CXL both publish free ones.

Run the actual stats on a real platform

AI drafts the plan; a real engine computes significance. As of June 2026:

Platform	Free tier	Paid entry	Best for
Statsig	Generous free events tier, no card	Pro from ~$150/mo	Technical teams that want sequential stats and feature flags together
GrowthBook	Free open-source self-host or free cloud (1 seat)	$20/seat/mo	Teams that own their warehouse and want SQL-defined metrics
Optimizely	None	$50k+/yr enterprise	Large orgs with multi-property personalization

For an indie or small team, GrowthBook (self-hosted, free) or Statsig’s free tier covers a goal-picker test without a contract.

How to refine the draft

Hypothesis is vague (“improves engagement”) → require “name the mechanism in one clause.”
MDE skipped → demand “show the MDE arithmetic or a clear estimate, and call out underpowered explicitly.”
Guardrails are decorative → require each one to have a numeric trigger and an action verb.
Ramp plan has no checks → require “what check unlocks the next ramp step.”
AI invents traffic numbers → repeat “no invented numbers; mark [need from analytics].”

Common mistakes

Designing the test after you already shipped the flag. By then you cannot honestly say no.
Picking a primary metric that only moves on a quarterly horizon (LTV) for a 21-day test.
Five guardrails. Each one raises the false-alarm rate; three is plenty.
No stop condition. Tests that run “until we feel good” never end.

FAQ

Can AI compute the significance for me? No. AI sketches the plan and estimates the MDE as a sanity check. Run the real significance test in Statsig, GrowthBook, Optimizely, or your in-house engine. Treat the AI number as a gut check, not a verdict.

My traffic is too low for the MDE I want. Now what? Switch the primary metric to a faster-moving leading indicator (day-1 retention or activation), and queue the lagging metric (D7, LTV) for a later cohort readout. A longer window will not save you if the math is already underpowered.

Do I need a holdout group? Yes, for any feature you cannot easily reverse. A 5-10% holdout pays for itself the first time you need a clean baseline to compare against weeks later.

Halfway through I realize the sample size was estimated wrong. Stop the test? No. Stopping when you peek at results is exactly how you break your own significance. Log the “sample-size recompute” note, run to the planned end, and correct the baseline for the next test of its kind.

One-sided or two-sided test? Two-sided unless you have a written reason. One-sided math is easier and the conclusions are weaker; reviewers will rightly distrust it.

Tags: #AI writing #app-experiment #ab-testing #app-product-ops #Indie dev

TL;DR

What this produces

When AI is the right tool for this

What to feed the model

Which model to use

Copy-ready prompt

Sample output

The MDE math, in plain terms

Run the actual stats on a real platform

How to refine the draft

Common mistakes

FAQ

Related

Related Articles

AI Retention Cohort Analysis: Read the Curve, Not the Number

AI App Store ASO Keyword Research Without Guessing

AI Crash Report Triage: Stack Trace to Owner in One Pass

Write a Pricing A/B Brief With AI (Without the Lossy Math)

AI User Interview Question Generator That Avoids Leading

AI User Segment Targeting Brief: Stop Spraying Notifications