Interpret A/B Test Results With AI: Significance, SRM, Effect Size

Q: What if I do not have a pre-registered MDE?

Ask the AI to back-calculate the MDE your sample could actually detect at 80% power, given your baseline rate and per-arm n. If the observed lift is below that, you were underpowered. Register an MDE before the next test so this is decided up front.

Q: Can AI run the stats itself, or does it just describe them?

With ChatGPT data analysis (GPT-5.5) or Claude with code execution (Opus 4.7 / Sonnet 4.6, as of June 2026), it writes and runs Python — `scipy.stats` for the z-test, t-test, and chi-square — and returns real numbers. In a plain chat without code execution it will estimate, which is exactly when it invents confidence intervals. Always confirm a tool ran the code.

Q: One-sided or two-sided test?

Default two-sided. A one-sided test is only appropriate when a negative result is genuinely uninterpretable for your decision, which is rare. Using one-sided to make a borderline result "significant" is p-hacking.

Q: Should I trust a result after a sample ratio mismatch if the lift is huge?

No. SRM means the split is biased, so the "huge" lift may be an artifact of which users landed in which arm. Find and fix the cause, then rerun. A large effect on broken randomisation is not evidence.

Use AI as a critical analyst on A/B results: significance, sample-ratio mismatch, effect size, power, validity threats, and the right next action — with a copy-ready prompt.

Published: May 17, 2026 Updated: Jun 05, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

TL;DR

Paste your raw A/B numbers into the prompt below and AI will run the same checklist a senior analyst runs: significance, sample-ratio mismatch (SRM), effect size in business units, power against your MDE, guardrail movement, and validity threats. Treat it as a second reviewer, not an oracle: it can compute a z-test or chi-square from numbers you give it, but it cannot know what else shipped in your product during the test, so feed it that timeline. As of June 2026, ChatGPT’s data-analysis mode (GPT-5.5) and Claude (Opus 4.7 / Sonnet 4.6) will execute the Python instead of guessing — use that, and never let the model invent a confidence interval.

The task

You ran an A/B test, the dashboard says variant B is +4% on the primary metric, and somebody in Slack already typed “ship it.” Before you ship, you need four answers: is the lift real (significant), is the sample big enough to trust a +4% read, did anything else change during the window, and did you measure the right thing? Most bad ships come from skipping one of these, not from bad math. AI is useful here as a structured reviewer that refuses to skip a step.

When AI helps — and when it does not

AI is reliable at the mechanical checklist: significance, effect size, sample-size adequacy against your MDE, novelty checks, SRM, and segment splits. When you give it the raw counts and use a tool that runs code (ChatGPT data analysis or Claude with code execution), it computes the actual z-test, t-test, or chi-square rather than approximating from memory.

It is poor at one thing that matters most: knowing what should not have changed in your product during the test. A pricing-page redesign that overlapped your checkout experiment, a marketing email that went out mid-test, or a payment outage on day three are all confounders the model has no way to see. Paste the timeline of other launches, or it cannot flag them.

What to feed the AI

Experiment setup: hypothesis, control, treatment, and randomisation unit (user / session / device)
Primary metric and guardrail metrics, with definitions
Sample size per arm and how long it ran (in days)
Pre-registered MDE (minimum detectable effect) and target power
Anything else that changed during the experiment window: other launches, marketing pushes, outages, seasonality
Segment splits if you have them: new vs returning, mobile vs desktop, country

Copy-ready prompt

Interpret this A/B test result. Run code to compute the statistics; do not estimate.

Hypothesis: [one line]
Randomisation unit: [user / session / device]
Primary metric: [metric and definition]
Guardrail metrics: [list with definitions]
Sample size per arm: [control n, treatment n]
Conversions per arm: [control, treatment]
Duration: [days]
Pre-registered MDE: [X%], target power: [0.80 / 0.90]
Other changes in the window: [list]

Raw results:
"""
[paste numbers / table]
"""

Return:
1. SRM check first: chi-square goodness-of-fit on the actual split vs expected. If p < 0.01, STOP and say the experiment is invalid.
2. Statistical assessment: p-value, 95% confidence interval, effect size in business units (not just relative %).
3. Power check: was n large enough to detect the observed effect at the pre-registered MDE? If n is far below required, flag it.
4. Guardrail movement — any harmful trade-off?
5. Validity threats checklist: novelty, seasonality, confounding launches, peeking.
6. Segment splits — which subgroup drove the result, if any.
7. Recommended next action: ship / ship to subset / extend / kill / redesign — with reasoning.
8. The most likely wrong conclusion someone would draw from this, and why.

Rules: do not call something significant on a sample far below MDE-required n.
Do not invent a confidence interval — compute it from the counts I gave you.

For a surprising result, follow up: “Write a 5-step confirmation plan to test whether this is real, including which segments to slice and a confirmation-test design.”

The five checks AI should always run

Check	What “pass” looks like	Common 2026 thresholds
SRM (sample ratio mismatch)	Observed split matches intended split	Chi-square p ≥ 0.01; SRM occurs in ~6% of tests, so check every time
Significance	p-value below alpha, with a CI that excludes 0	alpha = 0.05 (two-sided)
Power vs MDE	n meets the pre-registered MDE requirement	Power 0.80–0.90; MDE often 2–5% (lower for high-traffic sites)
Duration / novelty	Ran ≥ 2 weeks; effect stable in week 2+	2 weeks minimum, 6–8 weeks maximum
Guardrails	Retention, latency, refunds, support tickets all clean	No statistically significant harm on any guardrail

A useful sanity floor for conversion-rate tests: roughly 30,000 users and ~3,000 conversions per arm before a few-percent lift is trustworthy. Below that, a “+4%” headline is mostly noise.

Why SRM is the first check, not the last

Sample ratio mismatch means your traffic did not split the way you intended — a planned 50/50 came out 53/47. It looks minor, but it almost always means the randomisation or logging is broken (bot filtering applied to one arm only, a redirect that drops users, a caching bug), which biases every downstream number. The industry standard is a chi-square goodness-of-fit test with a strict p < 0.01 threshold, because you want high confidence before discarding a test. If SRM fails, no other statistic is interpretable — fix the cause and rerun. That is why the prompt puts it first and tells the model to stop on failure.

The peeking trap

If your team checked the dashboard 20 times and called the winner the moment it crossed p < 0.05, your real false-positive rate is not 5% — repeated peeking can push it toward 40%. For a fixed-horizon test, decide the sample size up front and read the result once, at the end. If you genuinely need to monitor continuously, use a platform with sequential testing (the mSPRT method in Statsig, Eppo, or GrowthBook), which adjusts the thresholds so peeking stays valid. Ask the AI to flag whether the decision was made at a pre-committed stopping point or mid-flight.

How to check the output is usable

The verdict has reasoning, not just a “ship / kill” tag
SRM was checked before anything else and reported explicitly
Sample-size adequacy is computed against your MDE, not just “n looks big”
Validity threats are named (novelty, the specific overlapping launch), not a vague “be careful”
Confidence intervals were computed from your counts, not asserted
If shipping is recommended, every guardrail is explicitly clean

Common mistakes

Calling significance on tiny samples — p < 0.05 with n = 200 is not the same finding as n = 20,000
Skipping the SRM check, then trusting a number from a broken split
Ignoring guardrails — primary metric up, 7-day retention down is a bad ship
Letting AI invent a confidence interval instead of computing it from the counts
Acting on novelty effects in week one — extend to at least two full weeks before deciding
Reading a peeked p-value as if it came from a fixed-horizon test

FAQ

What if I do not have a pre-registered MDE? Ask the AI to back-calculate the MDE your sample could actually detect at 80% power, given your baseline rate and per-arm n. If the observed lift is below that, you were underpowered. Register an MDE before the next test so this is decided up front.

Can AI run the stats itself, or does it just describe them? With ChatGPT data analysis (GPT-5.5) or Claude with code execution (Opus 4.7 / Sonnet 4.6, as of June 2026), it writes and runs Python — scipy.stats for the z-test, t-test, and chi-square — and returns real numbers. In a plain chat without code execution it will estimate, which is exactly when it invents confidence intervals. Always confirm a tool ran the code.

One-sided or two-sided test? Default two-sided. A one-sided test is only appropriate when a negative result is genuinely uninterpretable for your decision, which is rare. Using one-sided to make a borderline result “significant” is p-hacking.

Should I trust a result after a sample ratio mismatch if the lift is huge? No. SRM means the split is biased, so the “huge” lift may be an artifact of which users landed in which arm. Find and fix the cause, then rerun. A large effect on broken randomisation is not evidence.

Growth experiment prompts — design the next experiment
MVP validation experiment prompts — validate before building
User retention experiment prompts — for retention-focused tests
Funnel analysis readout — pre-experiment funnel diagnosis
Retention cohort readout — same logic for cohort analysis
ChatGPT data analysis workflow — broader analysis workflow

Tags: #Data analysis #Workflow #Research

TL;DR

The task

When AI helps — and when it does not

What to feed the AI

Copy-ready prompt

The five checks AI should always run

Why SRM is the first check, not the last

The peeking trap

How to check the output is usable

Common mistakes

FAQ

Related

Related Articles

Write the A/B Test Summary With AI

Write Chart Takeaways With AI: Turn a Screenshot Into a Tight Caption

AI Competitor Comparison Tables: Build a Matrix That Survives a Source Check

Write a Dashboard Takeaway With AI

AI for Financial Trend Analysis: Find Real Revenue, Cost, and Margin Shifts

Read Out a Funnel Analysis With AI