The task
The chart on your screen says 100% → 62% → 41% → 18% → 5%. Your head of growth is convinced step 4 → 5 (the 18% to 5% drop) is the problem because “look at that cliff.” The product manager is convinced it is step 3 → 4. You suspect it is actually step 2, but you cannot quite articulate why. You have one analyst-day this week, you have ~$3,000 in test budget, and you need to walk into Thursday’s meeting with a real recommendation — the single highest-ROI test to run next, plus an explanation of why the obvious “biggest cliff” is sometimes the wrong place to look.
Where AI helps — and where it does not
AI is good at applying the funnel analysis frameworks consistently — relative drop vs absolute drop, expected vs actual, conversion concentration vs distribution. It is also good at proposing hypotheses for each step’s drop and ranking which one is worth testing first. Where AI fails: knowing your industry’s actual benchmarks. The model has rough priors for “typical SaaS signup → activation conversion” but they are imprecise and often outdated. Feed it competitor numbers if you have them (from public case studies, podcasts, or trial accounts you measured yourself). The narrower the benchmark, the sharper the readout.
A common failure mode: the model defaults to “the biggest absolute drop is the problem” — which sometimes matches the truth but more often is wrong. A 50% drop at step 3 might be entirely industry-normal, while a “small” 8% drop at step 2 might be 40% worse than benchmark and the real lever. Force the model to compare to benchmark, not to other steps.
What to feed the AI
- Funnel steps with absolute numbers AND conversion rates between each (raw counts matter for sample-size math)
- Industry or competitor benchmark per step if known — even rough numbers help
- Recent changes to any step in the last 90 days (UI redesign, new copy, added requirement, removed friction)
- The traffic source mix at the top of funnel — paid vs organic vs referral usually have very different funnel shapes
- Segment breakdown if you have it — desktop vs mobile, new vs returning, plan tier
- Sample size per step — a step that converts 1k/2k can A/B test; a step that converts 20/40 cannot
- Time period covered — a 30-day funnel during a launch is different from a steady-state 90-day funnel
- Your goal in one sentence — “I want to know what to test next” produces different output than “I want to know if the funnel is healthy enough to scale ad spend”
Copy-ready prompt
Read out this funnel and recommend the next test.
Funnel steps (absolute counts + conversion rate between each): {paste}
Industry / competitor benchmark per step (rough is fine): {paste or "unknown"}
Recent changes to each step (last 90 days): {paste}
Traffic source mix at top of funnel: {paid % / organic % / referral %}
Segment breakdown if available: {desktop/mobile, new/returning, plan tier}
Time period covered: {dates, total days}
My goal: {what decision this readout supports}
Return:
1) The step with the biggest gap vs benchmark — NOT the biggest absolute drop. Show the math: actual rate, benchmark rate, gap. Call out if the biggest absolute drop is actually within benchmark range (i.e., not the real problem).
2) Hypothesis for the biggest-gap step in one sentence — what could be causing this gap given the recent changes and traffic mix.
3) The single test I should run first — name the change (copy, layout, requirement removed, friction added), the segment to target, the success metric, the sample size needed, and the expected runtime.
4) Tests NOT worth running here — name 1-2 obvious candidates that are wrong calls, and why (insufficient sample size, drop is benchmark-normal, downstream confound).
5) "Watch this number weekly" — the leading indicator that would tell me if a deeper problem is forming. Usually a step that is currently OK but trending the wrong direction.
Rules:
- If the sample size at a step is too small to A/B test (<100 conversions per variant in 4 weeks), say so and propose a qualitative test instead.
- Do not propose 5 tests; force-rank to 1 with reasoning. The reader will ask for #2 only if they reject #1.
Shorter variant — segment cut audit
Below is the funnel cut by segment: {paste}
Identify the segment that distorts the overall funnel most — i.e., a segment that converts very differently from the average. Specifically:
- The segment driving the headline drop-off.
- A segment hidden under the average that is converting fine and would be lost if we rebuilt the funnel for the loud segment.
- The segment we should probably stop sending paid traffic to until the rest of the funnel improves.
Sample output
A useful readout: “The biggest gap vs benchmark is step 3 (signup → first action): you convert 44%, industry typical for similar B2B SaaS is 60%. Step 4 → 5 looks like the biggest cliff visually (18% → 5%) but at the 18% sample size, 5% is within benchmark for trial-to-paid conversion in your category. Hypothesis for step 3: the empty-state has no clear next action, and the recent v2 onboarding rewrite removed the original ‘create your first X’ CTA. Test: revert the empty-state CTA from generic ‘explore’ back to specific ‘create your first X’ for new signups; target segment is the paid-traffic cohort (largest volume). Success metric: % of new signups completing first action within 24 hours, target 55%. Sample size needed: 1,400 per variant (~3 weeks at current volume). Watch weekly: trial-to-paid for the cohort that hits the new CTA, to confirm the lift carries downstream.”
A useful “not worth running” callout: “Not worth running: A/B test on step 5 (trial-to-paid copy variants). The sample size is 5% of step 4, which means 4 months to reach significance, by which time the v3 onboarding will have shipped and the test conditions will be invalid. Run a qualitative test instead — 8 interviews with trial users who did not convert.”
How to refine
- If AI picks the biggest absolute drop instead of the biggest gap vs benchmark: “Compare each step to benchmark, not to other steps. The biggest gap vs benchmark is the real lever; the biggest absolute drop is often industry-normal.”
- If 3+ tests are proposed: “Force-rank to 1. The reader will ask for #2 only if they reject #1. Multi-test recommendations dilute the call.”
- If sample size math is missing: “For the proposed test, calculate required sample size per variant at 80% power, 5% significance, expected effect size. If the runtime is >4 weeks, propose a qualitative test instead.”
- If segment cuts are missing: “Add segment analysis: redo the funnel for paid vs organic and for desktop vs mobile separately. Sometimes the headline drop is one segment dragging the average.”
- If the leading indicator is vague: “Name the specific weekly number to watch. ‘Engagement’ is not a leading indicator; ‘D1 retention for paid-traffic signups’ is.”
Common mistakes
- Optimizing the biggest absolute drop without checking benchmark: sometimes that drop is industry-normal and the real lever lives 2 steps earlier with a small but unusual drop.
- Running A/B tests on late-funnel steps with insufficient sample size: 4 months to reach significance on a step that will change before the test ends; qualitative interviews are faster and cheaper.
- Ignoring step 1 because it is “100% by definition”: the real top-of-funnel is traffic → step 1, and step 1’s conversion rate from traffic is often the highest-leverage drop you forgot to measure.
- No segment cut: a 5% conversion that is 8% on desktop and 1% on mobile is not the same as 5% everywhere; the rollout strategy is different.
- Treating all funnel steps as equally fixable: late steps usually require product changes (engineering quarters), early steps often respond to copy / layout (one week); weight the test recommendation by fix cost.
- No traffic-source consideration: paid traffic, organic, and referral often have very different funnel shapes; one funnel chart for all three hides the real picture.
- Confusing leading and lagging indicators: “monthly revenue” lags everything; “D1 retention of new signups” tells you next month’s revenue this week.
- Not asking who to share the readout with: leadership wants the test recommendation; product wants the hypothesis; the reader determines the shape.
FAQ
- Where do I get industry benchmarks?: Trade reports (industry-specific), the Mixpanel / Amplitude annual benchmarks, OpenView’s SaaS benchmarks, competitor case studies, and conference talks where founders share numbers. Take any single source with a grain of salt; cross-reference 2-3 if possible.
- How small can a funnel step be and still warrant A/B testing?: Calculate the required sample size for your expected effect size at 80% power. If you need 6 months to reach significance, run a qualitative test (5-8 user interviews) instead. The point of testing is to learn fast.
- What if the funnel keeps changing as we ship new features?: Baseline before each change, then compare. The funnel is most useful as a delta tracker; comparing this month’s absolute numbers to last year’s is rarely insightful if the product changed.
- Should the AI also predict what the funnel will look like after the test?: Use it for a sanity check on expected effect size (e.g., “this test would have to lift step 3 by at least 15pp to be worth it given the engineering cost”). Do not use it as a real forecast.
- What about non-linear funnels — branching, loops, multi-product?: AI handles linear funnels well; for branching, feed each branch as its own funnel and ask the model to compare. For loops (re-engagement), the right framing is cohort retention, not funnel — see the linked retention cohort article.