The task
You ran an A/B test, the dashboard says variant B is +4% on the primary metric, and somebody in Slack already typed “ship it.” Before you ship, you need to know whether the lift is real, whether the sample is large enough, whether anything else changed during the experiment, and whether you measured the right thing. AI is useful as a structured second reviewer that asks the questions a senior analyst would.
When AI helps — and when it does not
AI is excellent at running the standard interpretation checklist — significance, effect size, sample-size adequacy, novelty effects, segment splits. It is poor at knowing what should not have changed in your product during the test — that requires context AI does not have. Feed it the timeline of other launches; otherwise it cannot flag confounders.
What to feed the AI
- Experiment setup: hypothesis, control, treatment, randomisation unit
- Primary metric and guardrail metrics, with definitions
- Sample size per arm and how long it ran
- Pre-registered MDE (minimum detectable effect)
- Anything else that changed during the experiment window (other launches, marketing pushes, outages)
- Segment splits if you have them (new vs returning, mobile vs desktop, country)
Copy-ready prompt
Interpret this A/B test result.
Hypothesis: <line>
Randomisation unit: <user / session / device>
Primary metric: <metric and definition>
Guardrail metrics: <list with definitions>
Sample size per arm: <numbers>
Duration: <days>
Pre-registered MDE: <X%>
Other changes in the window: <list>
Raw results:
"""
<paste numbers / table>
"""
Return:
1. Statistical assessment: p-value, confidence interval, effect size in business units
2. Power check: was the sample large enough for the observed effect?
3. Guardrail movement — any harmful trade-off?
4. Validity threats checklist: novelty, seasonality, confounding launches, SRM (sample ratio mismatch)
5. Segment splits — which subgroup drove the result, if any
6. Recommended next action: ship, ship to subset, extend, kill, redesign — with reasoning
7. The most likely wrong conclusion someone would draw, and why
Do not call something significant on small samples. If p < 0.05 but n is far below MDE-required, flag it.
For surprising results: “Now write a 5-question follow-up plan to figure out whether this is real, including segment slices to inspect and a confirmation test design.”
Recommended output structure
A short header verdict (“ship / extend / redesign”), a statistical block with numbers, a validity threats list, segment splits in a small table, and a recommended next action. Avoid prose-only — readers will quote whichever line is shortest.
How to check the output is usable
- The verdict has reasoning, not just a tag
- Sample-size sanity is computed against MDE, not just N
- Validity threats are named, not just “be careful”
- Segment splits identify a driver if one exists
- If shipping is recommended, the guardrails are explicitly clean
Common mistakes
- Calling significance on tiny samples — p < 0.05 with n = 200 is not the same as n = 20,000
- Ignoring guardrails — primary metric up, retention down is a bad ship
- No SRM check — if traffic split is 53/47 instead of 50/50, your randomisation is broken
- Letting AI invent confidence intervals — provide them or ask for the formula, do not let it guess
- Acting on novelty effects in the first week — extend before deciding
Practical depth notes
For AI A/B Test Interpretation: Significance, Effect Size, Validity, the difference between a usable AI result and a generic one is the input packet. Give the model the audience, the current draft or raw material, the desired format, the decision you need to make, and two examples of what good and bad output look like. Ask it to preserve facts first, then improve structure or wording second.
After the first response, do a separate review pass. Look for missing constraints, invented details, weak calls to action, and language that sounds plausible but does not match the real situation. The best final output should be easy to use immediately: clear owner, clear next step, and no hidden assumption that someone else has to untangle. A stronger version of this workflow also defines the handoff. Decide who will use the output, what they should do next, and what information would make them reject it. If the deliverable is copy, test whether it has a single clear action. If it is analysis, test whether it separates observation from recommendation. If it is planning, test whether dates, owners, and tradeoffs are explicit enough for someone else to execute.
FAQ
- What if I do not have pre-registered MDE? Ask AI to estimate one given your sample size and metric variance. Future-you will thank you for registering it.
- Can AI run the stats? It can compute z-tests / t-tests with provided inputs. For Bayesian or sequential analysis, use a dedicated tool.
- One-sided or two-sided? Default two-sided. One-sided is appropriate only if a negative result is uninterpretable, which is rare.
Related
- Growth experiment prompts — design the next experiment
- MVP validation experiment prompts — validate before building
- User retention experiment prompts — for retention-focused tests
- Funnel analysis readout — pre-experiment funnel diagnosis
- Retention cohort readout — same logic for cohort analysis
- ChatGPT data analysis workflow — broader analysis workflow
Tags: #Data analysis #Workflow #Research