Where do I get industry benchmarks?

The [2026 Amplitude Product Intelligence Report](https://amplitude.com/), Mixpanel's annual benchmarks, OpenView's SaaS benchmarks, competitor case studies, and conference talks where founders share numbers. Both Amplitude (free up to 50K monthly tracked users) and Mixpanel (free up to 20M events/month, funnels included) let you measure your own funnel for free, which beats any external average. Cross-reference 2–3 sources before trusting a single one.

How small can a funnel step be and still warrant A/B testing?

Calculate the required sample size for your expected effect size at 80% power and 5% significance. For a 5% baseline and a 10% relative lift you need roughly 31,000 visitors per variant; required sample scales with 1 / MDE², so smaller lifts get expensive fast. If you need 6 months to reach significance, run a qualitative test (5–8 user interviews) instead.

Which model should I paste into?

As of June 2026, Claude Opus 4.7 and Gemini 3.1 Pro both carry 1M-token context, so they swallow large multi-segment exports without truncation; ChatGPT GPT-5.5 on Plus handles ~320 pages in-app (full 1M only on the $200 Pro tier). For one small funnel, any of the three is fine.

What if the funnel keeps changing as we ship new features?

Baseline before each change, then compare. The funnel is most useful as a delta tracker; comparing this month's absolute numbers to last year's is rarely insightful if the product changed.

Should the AI also predict what the funnel will look like after the test?

Use it for a sanity check on expected effect size (e.g., "this test would have to lift step 3 by at least 15pp to be worth it given the engineering cost"). Do not use it as a real forecast.

What about non-linear funnels — branching, loops, multi-product?

AI handles linear funnels well; for branching, feed each branch as its own funnel and ask the model to compare. For loops (re-engagement), the right framing is cohort retention, not funnel — see the linked retention cohort article.

AI Use Cases

Read Out a Funnel Analysis With AI

Find the step with the biggest gap vs benchmark — not the biggest absolute drop — and surface the single highest-ROI test to run next, plus the tests not worth running.

Published: May 17, 2026 Updated: Jun 05, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

TL;DR

Paste your funnel (absolute counts AND step conversion rates) plus any industry benchmark into ChatGPT (GPT-5.5), Claude (Opus 4.7), or Gemini 3.1 Pro and force one rule: rank steps by gap vs benchmark, not by absolute drop. The biggest visual cliff is often industry-normal; the real lever is usually a “small” drop that sits well below benchmark. Then make the model force-rank to one test, show the sample-size math, and name the tests that are not worth running. As of June 2026, the median B2B SaaS activation rate is ~34% (Amplitude), so a 44% step is fine and a 22% step is the emergency, even if a later step drops a bigger number of percentage points.

The task

The chart on your screen says 100% → 62% → 41% → 18% → 5%. Your head of growth is convinced step 4 → 5 (the 18% to 5% drop) is the problem because “look at that cliff.” The product manager is convinced it is step 3 → 4. You suspect it is actually step 2, but you cannot quite articulate why. You have one analyst-day this week, ~$3,000 in test budget, and you need to walk into Thursday’s meeting with a real recommendation: the single highest-ROI test to run next, plus an explanation of why the obvious “biggest cliff” is sometimes the wrong place to look.

Where AI helps — and where it does not

AI is good at applying funnel frameworks consistently: relative drop vs absolute drop, expected vs actual, conversion concentration vs distribution. It is good at proposing a hypothesis for each step’s drop and force-ranking which one to test first. Where it fails: knowing your industry’s actual benchmarks. The model has rough priors for “typical SaaS signup → activation” but they are imprecise and often a release or two out of date. Feed it real numbers — from public case studies, the Amplitude / Mixpanel annual reports, or trial accounts you measured yourself. The narrower the benchmark you supply, the sharper the readout.

The common failure mode: the model defaults to “the biggest absolute drop is the problem.” Sometimes that matches reality; more often it does not. A 50% drop at step 3 can be entirely industry-normal, while a “small” 8% drop at step 2 might run 40% below benchmark and be the real lever. Force the model to compare each step to benchmark, not to the other steps.

Which model to use

Any of the three flagships handles this well as of June 2026. Pick by how big your paste is:

Model	Best for funnel readouts	Context
Claude Opus 4.7	Tightest force-ranking and sample-size reasoning	1M tokens
Gemini 3.1 Pro	Large multi-segment dumps (full BigQuery exports)	1M tokens
ChatGPT GPT-5.5 (Thinking)	Fast back-and-forth on a single funnel	~320 pages in-app on Plus; full 1M only on $200 Pro

For a single funnel of a few dozen rows, all three are interchangeable. If you are pasting a 12-segment cut with 90 days of daily counts, use Opus 4.7 or Gemini 3.1 Pro so nothing gets truncated. See the model comparison guide if you have not picked a default.

Benchmark anchors (as of June 2026)

Use these as a sanity check when you have no industry-specific source. They come from the 2026 Amplitude Product Intelligence Report and widely cited trial-conversion studies; treat them as priors, not gospel, and override them with your own vertical’s numbers whenever you have them.

Funnel stage	Median	Top quartile	Bottom quartile
B2B SaaS signup → activation	~34%	55–65%	<18%
Free-to-paid (no credit card)	~8%	15–25%	<3%
Free trial requiring a credit card	~30%	—	—

The practical takeaway: 24-hour activation correlates almost linearly with 30-day paid conversion, so a below-benchmark activation step is usually worth more than a below-benchmark trial-to-paid step — it moves everything downstream. That is exactly the logic you want the model to apply.

What to feed the AI

Funnel steps with absolute numbers AND conversion rates between each (raw counts matter for sample-size math)
Industry or competitor benchmark per step if known — even rough numbers help
Recent changes to any step in the last 90 days (UI redesign, new copy, added requirement, removed friction)
The traffic source mix at the top of funnel — paid vs organic vs referral usually have very different funnel shapes
Segment breakdown if you have it — desktop vs mobile, new vs returning, plan tier
Sample size per step — a step that converts 1k/2k can A/B test; a step that converts 20/40 cannot
Time period covered — a 30-day funnel during a launch is different from a steady-state 90-day funnel
Your goal in one sentence — “I want to know what to test next” produces different output than “I want to know if the funnel is healthy enough to scale ad spend”

Copy-ready prompt

Read out this funnel and recommend the next test.

Funnel steps (absolute counts + conversion rate between each): [paste]
Industry / competitor benchmark per step (rough is fine): [paste or "unknown"]
Recent changes to each step (last 90 days): [paste]
Traffic source mix at top of funnel: [paid % / organic % / referral %]
Segment breakdown if available: [desktop/mobile, new/returning, plan tier]
Time period covered: [dates, total days]
My goal: [what decision this readout supports]

Return:
1) The step with the biggest gap vs benchmark — NOT the biggest absolute drop. Show the math: actual rate, benchmark rate, gap. Call out if the biggest absolute drop is actually within benchmark range (i.e., not the real problem).
2) Hypothesis for the biggest-gap step in one sentence — what could be causing this gap given the recent changes and traffic mix.
3) The single test I should run first — name the change (copy, layout, requirement removed, friction added), the segment to target, the success metric, the sample size needed, and the expected runtime.
4) Tests NOT worth running here — name 1-2 obvious candidates that are wrong calls, and why (insufficient sample size, drop is benchmark-normal, downstream confound).
5) "Watch this number weekly" — the leading indicator that would tell me if a deeper problem is forming. Usually a step that is currently OK but trending the wrong direction.

Rules:
- If the sample size at a step is too small to A/B test (<100 conversions per variant in 4 weeks), say so and propose a qualitative test instead.
- Do not propose 5 tests; force-rank to 1 with reasoning. The reader will ask for #2 only if they reject #1.

Shorter variant — segment cut audit

Below is the funnel cut by segment: [paste]
Identify the segment that distorts the overall funnel most — i.e., a segment that converts very differently from the average. Specifically:
- The segment driving the headline drop-off.
- A segment hidden under the average that is converting fine and would be lost if we rebuilt the funnel for the loud segment.
- The segment we should probably stop sending paid traffic to until the rest of the funnel improves.

Sample output

A useful readout: “The biggest gap vs benchmark is step 3 (signup → first action): you convert 44%, and the 2026 Amplitude median for B2B SaaS activation is 34% — so step 3 is actually above benchmark and not your problem. The real gap is step 2 → 3, where you convert 41% but the recent v2 onboarding rewrite should have lifted it; for your paid-traffic cohort it runs 22%, well below the ~34% median. Step 4 → 5 looks like the biggest cliff visually (18% → 5%), but at the 18% sample size, 5% trial-to-paid is within the ~8% no-credit-card median for your category. Hypothesis: the v2 empty-state removed the original ‘create your first X’ CTA, so paid signups land with no clear next action. Test: revert the empty-state CTA from generic ‘Explore’ to specific ‘Create your first X’ for new paid-traffic signups. Success metric: % completing first action within 24 hours, target 55%. Sample size: ~1,400 per variant (~3 weeks at current volume). Watch weekly: trial-to-paid for the cohort that hits the new CTA, to confirm the lift carries downstream.”

A useful “not worth running” callout: “Not worth running: an A/B test on step 5 (trial-to-paid copy variants). At 5% of step 4’s volume you would need roughly four months to reach significance, by which time the v3 onboarding will have shipped and the test conditions will be invalid. Run a qualitative test instead — 8 interviews with trial users who did not convert.”

How to refine

If AI picks the biggest absolute drop instead of the biggest gap vs benchmark: “Compare each step to benchmark, not to other steps. The biggest gap vs benchmark is the real lever; the biggest absolute drop is often industry-normal.”
If 3+ tests are proposed: “Force-rank to 1. The reader will ask for #2 only if they reject #1. Multi-test recommendations dilute the call.”
If sample size math is missing: “For the proposed test, calculate required sample size per variant at 80% power, 5% significance, and the expected effect size. Remember required sample scales with 1 / MDE² — halving the lift you want to detect quadruples the sample. If runtime is >4 weeks, propose a qualitative test instead.”
If segment cuts are missing: “Add segment analysis: redo the funnel for paid vs organic and for desktop vs mobile separately. Sometimes the headline drop is one segment dragging the average.”
If the leading indicator is vague: “Name the specific weekly number to watch. ‘Engagement’ is not a leading indicator; ‘D1 retention for paid-traffic signups’ is.”

Common mistakes

Optimizing the biggest absolute drop without checking benchmark: sometimes that drop is industry-normal and the real lever lives 2 steps earlier with a small but unusual drop.
Running A/B tests on late-funnel steps with insufficient sample size: four months to reach significance on a step that will change before the test ends; qualitative interviews are faster and cheaper.
Ignoring step 1 because it is “100% by definition”: the real top-of-funnel is traffic → step 1, and step 1’s conversion rate from traffic is often the highest-leverage drop you forgot to measure.
No segment cut: a 5% conversion that is 8% on desktop and 1% on mobile is not the same as 5% everywhere; the rollout strategy is different.
Treating all funnel steps as equally fixable: late steps usually require product changes (engineering quarters), early steps often respond to copy / layout (one week); weight the test recommendation by fix cost.
No traffic-source consideration: paid traffic, organic, and referral often have very different funnel shapes; one funnel chart for all three hides the real picture.
Confusing leading and lagging indicators: “monthly revenue” lags everything; “D1 retention of new signups” tells you next month’s revenue this week.
Not asking who to share the readout with: leadership wants the test recommendation; product wants the hypothesis; the reader determines the shape.

FAQ

Where do I get industry benchmarks?: The 2026 Amplitude Product Intelligence Report, Mixpanel’s annual benchmarks, OpenView’s SaaS benchmarks, competitor case studies, and conference talks where founders share numbers. Both Amplitude (free up to 50K monthly tracked users) and Mixpanel (free up to 20M events/month, funnels included) let you measure your own funnel for free, which beats any external average. Cross-reference 2–3 sources before trusting a single one.
How small can a funnel step be and still warrant A/B testing?: Calculate the required sample size for your expected effect size at 80% power and 5% significance. For a 5% baseline and a 10% relative lift you need roughly 31,000 visitors per variant; required sample scales with 1 / MDE², so smaller lifts get expensive fast. If you need 6 months to reach significance, run a qualitative test (5–8 user interviews) instead.
Which model should I paste into?: As of June 2026, Claude Opus 4.7 and Gemini 3.1 Pro both carry 1M-token context, so they swallow large multi-segment exports without truncation; ChatGPT GPT-5.5 on Plus handles ~320 pages in-app (full 1M only on the $200 Pro tier). For one small funnel, any of the three is fine.
What if the funnel keeps changing as we ship new features?: Baseline before each change, then compare. The funnel is most useful as a delta tracker; comparing this month’s absolute numbers to last year’s is rarely insightful if the product changed.
Should the AI also predict what the funnel will look like after the test?: Use it for a sanity check on expected effect size (e.g., “this test would have to lift step 3 by at least 15pp to be worth it given the engineering cost”). Do not use it as a real forecast.
What about non-linear funnels — branching, loops, multi-product?: AI handles linear funnels well; for branching, feed each branch as its own funnel and ask the model to compare. For loops (re-engagement), the right framing is cohort retention, not funnel — see the linked retention cohort article.

Tags: #AI writing #Data analysis #Workflow #Funnel

TL;DR

The task

Where AI helps — and where it does not

Which model to use

Benchmark anchors (as of June 2026)

What to feed the AI

Copy-ready prompt

Shorter variant — segment cut audit

Sample output

How to refine

Common mistakes

FAQ

Related

Related Articles

Write the A/B Test Summary With AI

Write Chart Takeaways With AI: Turn a Screenshot Into a Tight Caption

AI Competitor Comparison Tables: Build a Matrix That Survives a Source Check

Write a Dashboard Takeaway With AI

Interpret A/B Test Results With AI: Significance, SRM, Effect Size

AI for Financial Trend Analysis: Find Real Revenue, Cost, and Margin Shifts