Growth Experiment Prompts for Designing and Reading A/B Tests

Q: What confidence and power should I use?

The industry defaults are 95% significance (p < 0.05) and 80% power. Raise significance to 99% only when a false positive is expensive (pricing, irreversible changes); lower power below 80% only if you accept missing real wins. The [sample-size estimator from Optimizely](https://www.optimizely.com/sample-size-calculator/) and similar tools take these as inputs.

Q: Why is peeking such a big deal?

Each time you look at a running test and consider stopping, you give yourself another chance to catch random noise above the threshold. Stopping the first time it crosses 95% can push your true false-positive rate to roughly 25%. The fix is either a fixed horizon (only read once, at full sample) or a sequential test that mathematically corrects for repeated looks.

Q: Should I use Bayesian or frequentist analysis?

Either works if applied honestly. Frequentist gives you a p-value and confidence interval and is the default in most tools. Bayesian gives the probability that B beats A directly, which is easier to explain, and handles continuous monitoring naturally. In practice the real speed gain for online tests comes from sequential methods, Bayesian or frequentist, not from the philosophy. Pick whatever your team and tooling already support and stay consistent.

Q: Can AI run the statistics for me?

It can do the design reasoning and the arithmetic if you paste real numbers and use a reasoning mode (ChatGPT Thinking, Claude with extended thinking, Gemini 3.1 Pro). Verify any sample-size or p-value output against a dedicated calculator before you act, the same way you would with the prompts in our [business data analysis with AI guide](/en/articles/business-data-analysis-ai/). Treat the model as a fast second reader, not the source of truth.

Q: How long should a test run at minimum?

Long enough to (a) reach the calculated sample size and (b) cover at least one full business cycle, usually one to two weeks, so weekday/weekend and payday effects average out. Hitting the sample number on a single anomalous Tuesday is not a finished test.

12 prompts to design growth experiments that won't waste a quarter: falsifiable hypotheses, sample-size math, guardrail metrics, and honest reads of flat or negative results.

Published: May 17, 2026 Updated: Jun 09, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Growth experiments quietly waste quarters. A team ships a “winning” test that was called at 60% confidence after three days of peeking; another runs a variant that changed four things at once, then picks the metric that looks best after the fact. These 12 prompts force a clean design before the test goes live and an honest read after, including the result teams flinch from most: flat or significantly negative. Pair them with the feature prioritization prompts to decide what is worth testing in the first place.

TL;DR

Before launch: write a falsifiable hypothesis with an explicit null, lock the primary metric, and calculate sample size. Standard defaults are 95% confidence (p < 0.05) and 80% power.
Set guardrails: name 3-5 metrics that must NOT move (churn, support load, latency, error rate) with breach thresholds, before the variant ships.
Do not peek. Stopping early on a promising result inflates the false-positive rate from 5% up to roughly 25%. Run to your pre-calculated sample, or switch to a sequential test that corrects for it.
Read both kinds of significance: a result can be statistically significant (p < 0.05) and still business-trivial (a 0.1% lift that costs more to ship than it earns).
Which model: any current frontier model (GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro) reasons through experiment design well; for the result-reading prompts that involve arithmetic, use a reasoning mode (ChatGPT Thinking, Claude with extended thinking) and paste your real numbers.

Best for

SaaS pricing experiments
Onboarding flow tests
Landing page A/B tests
Email subject-line tests
Paid-ad creative tests

Default stats settings to lock before any test

The first three prompts below ask for these inputs. Decide them once and reuse them so every experiment is comparable.

Setting	Common default	What it controls
Significance (alpha)	95% (p < 0.05)	Tolerance for a false positive
Statistical power	80%	Chance of detecting a real effect that exists
Minimum detectable effect (MDE)	The smallest lift worth shipping	Smaller MDE needs much more traffic
Primary metric	Locked before launch	Prevents post-hoc metric shopping
Test type	Fixed-horizon or sequential	Sequential lets you monitor safely without peeking penalty

If a 1% lift would not change a single decision, set your MDE to the lift that actually would, then the calculator will tell you the honest traffic cost.

1. Experiment hypothesis writer

For experiment idea {paste}, write a falsifiable hypothesis: "Changing {X} will cause {metric} to {direction + magnitude} because {mechanism}." Then state the null hypothesis explicitly. Output: how we will know we are wrong.

2. Sample-size and duration estimator

For test {paste hypothesis}, estimate sample size and duration. Inputs: baseline metric {value}, MDE I care about {%}, traffic per variant per day {N}, significance 95%, power 80%. Output: sample needed per arm, duration at current traffic, the smallest lift this traffic can actually detect, and the exact date to call it.

3. Single-variable test isolation

Below is my proposed A/B test. Audit whether the variant changes ONLY 1 variable from control. If multiple variables change, list each one and propose how to split into separate tests so each lift is attributable.

{paste}

4. Pre-test guardrail metrics

For test {paste hypothesis}, identify 5 guardrail metrics that should NOT move (or only within bounds): churn, support-ticket rate, page-load time, downstream conversion, error rate. For each, define the breach threshold that would auto-stop the test.

5. Post-test result reader

Below are experiment results. Read honestly: (a) is the lift statistically significant at 95%, (b) is it practically/business significant given ship cost, (c) did any guardrail break, (d) did any segment effect differ from the average. Output a ship / kill / iterate verdict with the one number that decides it.

{paste results}

6. No-result test interpretation

My A/B test ran to full sample and showed no significant lift. Help me interpret honestly across four causes: the hypothesis was wrong, the test was underpowered, the variant was too small to matter, or the metric was wrong. Tell me which is most likely from the numbers, and suggest the next test.

{paste}

7. Negative-result decision

My test result was significantly negative: the variant performed worse than control. Below are details. Extract: (a) is this learning valuable and why, (b) what does it suggest about the underlying belief we held, (c) what else built on that belief now needs re-testing.

{paste}

8. Pricing A/B test design

I want to test {old price} vs {new price} for {product}. Output: hypothesis, sample plan, ethical considerations (grandfather existing customers, isolate new-signup segment), the metric that decides ship (conversion vs revenue per visitor vs LTV), and the unwind path if it kills LTV.

9. Onboarding-flow test design

I want to test a {variant} of my onboarding. Output: hypothesis, sample plan, the activation metric, the latency problem (activation may only show at 7 / 14 / 28 days, so a fast read can mislead), and how to avoid biasing the cohort by season or acquisition channel.

10. Ad-creative test design

I want to test 4 ad creatives for {product}. Output: a hypothesis per creative, sample plan per arm, the primary metric (CTR / CVR / CPA), and the secondary metrics that separate a "click magnet" (high CTR, low CVR) from a real "conversion driver".

11. Multi-arm prioritization

I have 8 experiment ideas and 1 traffic source. Below is each idea. Score by ICE (Impact x Confidence x Ease), show the math, and pick the 2 to run first with reasons. Flag any pair that can run in parallel without interfering, and any that would split traffic too thin to reach power.

{paste}

12. Experiment write-up template

My test just ended. Generate a 1-page write-up: hypothesis, design, sample, primary + guardrail metrics, result with confidence level, decision, the 1 surprise, and what we test next. Audience: company-wide. Keep it readable for non-data-science readers.

Common mistakes

Peeking. Calling a test early because the dashboard “looks decided” inflates the false-positive rate from 5% to around 25%. Run to the pre-calculated sample, or use a sequential test that is built to be checked continuously.
Multi-variable arms. Changing more than one thing per variant means no lift can be attributed to anything. Split into single-variable tests (prompt 3).
No stated null. Without an explicit null hypothesis, any noisy result feels like confirmation.
No guardrails. The variant ships, support load doubles, and nobody notices for a week because no breach threshold was set.
Confusing the two significances. A statistically significant 0.1% lift is not automatically worth the engineering cost to ship it.
Metric shopping. Choosing the success metric after seeing the data instead of locking it before launch turns any test into a rationalization.
Underpowered tests read as “no effect.” “No lift” with too little traffic means you could not have seen the lift even if it existed; check power before concluding the hypothesis was wrong.

FAQ

What confidence and power should I use? The industry defaults are 95% significance (p < 0.05) and 80% power. Raise significance to 99% only when a false positive is expensive (pricing, irreversible changes); lower power below 80% only if you accept missing real wins. The sample-size estimator from Optimizely and similar tools take these as inputs.

Why is peeking such a big deal? Each time you look at a running test and consider stopping, you give yourself another chance to catch random noise above the threshold. Stopping the first time it crosses 95% can push your true false-positive rate to roughly 25%. The fix is either a fixed horizon (only read once, at full sample) or a sequential test that mathematically corrects for repeated looks.

Should I use Bayesian or frequentist analysis? Either works if applied honestly. Frequentist gives you a p-value and confidence interval and is the default in most tools. Bayesian gives the probability that B beats A directly, which is easier to explain, and handles continuous monitoring naturally. In practice the real speed gain for online tests comes from sequential methods, Bayesian or frequentist, not from the philosophy. Pick whatever your team and tooling already support and stay consistent.

Can AI run the statistics for me? It can do the design reasoning and the arithmetic if you paste real numbers and use a reasoning mode (ChatGPT Thinking, Claude with extended thinking, Gemini 3.1 Pro). Verify any sample-size or p-value output against a dedicated calculator before you act, the same way you would with the prompts in our business data analysis with AI guide. Treat the model as a fast second reader, not the source of truth.

How long should a test run at minimum? Long enough to (a) reach the calculated sample size and (b) cover at least one full business cycle, usually one to two weeks, so weekday/weekend and payday effects average out. Hitting the sample number on a single anomalous Tuesday is not a finished test.

Tags: #Prompt #Product startup #KPI

TL;DR

Best for

Default stats settings to lock before any test

1. Experiment hypothesis writer

2. Sample-size and duration estimator

3. Single-variable test isolation

4. Pre-test guardrail metrics

5. Post-test result reader

6. No-result test interpretation

7. Negative-result decision

8. Pricing A/B test design

9. Onboarding-flow test design

10. Ad-creative test design

11. Multi-arm prioritization

12. Experiment write-up template

Common mistakes

FAQ

Related

Related Articles

App Store Review Response Prompts to Lift Rating

App Store Screenshot Copy Prompts That Sell on the Scroll

Churn Reason Analysis Prompts for Cancel-Flow Data

Competitor Feature Comparison Prompts for Matrix Building

Feature Launch Announcement Prompts for In-App and Email

Help Center FAQ Prompts for Product Support (2026)