How many questions for 30 minutes?

10-12 prepared, expect to use 6-8. Follow-ups eat time and that is the point.

Should the user see the questions in advance?

No. You want first-take answers, not rehearsed ones.

What if a user keeps giving opinions instead of stories?

Redirect once: "tell me about the last time that happened." If they cannot, that is data — they may not actually do it often.

Can AI run the interview?

No. AI cannot read silence, lean in, or know when to wait through a pause that is about to produce the real answer. Use AI for prep and synthesis, not for the live conversation.

How do I synthesize 8 interviews?

Transcribe first (Otter.ai or Granola, ~$10-18/month as of June 2026), then paste transcripts into Claude or a dedicated tool. Ask it to extract verb+noun pairs ("deleted habit," "switched tool"), not "themes" — themes are your bias talking. Cluster by verb, then count: 3+ users on the same behavior is a strong signal, 1-2 is a lead to chase. Purpose-built repositories like Dovetail (from ~$20/user/month) do the tagging and semantic search natively if you run research often.

Should I just use the same questions every round?

No. Each interview round tests a different hypothesis, so each needs its own question list. Reuse the *rules* (the taboo words, the past-tense constraint), not the questions.

AI Use Cases

AI User Interview Question Generator That Avoids Leading

Updated for 2026 — use AI to draft interview questions that surface real behavior instead of confirming what you already believe — with constraints that block the most common leading-question patterns.

Published: May 23, 2026 Updated: Jun 09, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Most interview question drafts fail in the same way. They ask “would you use this?” instead of “what did you do last Tuesday?” They smuggle in the answer. They invite users to be polite. AI is very good at producing this kind of bad question by default, and very good at producing the opposite if you give it the right rules and the right hypothesis to test.

TL;DR: Hand the AI three things — a one-sentence hypothesis, the exact user segment, and a list of banned words — and ask for behavior questions anchored to a specific past time window. The rules in the prompt below are drawn straight from Rob Fitzpatrick’s The Mom Test: talk about the user’s life, not your idea; ask about the past, not the future; never name the thing you are testing. As of June 2026, Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro all draft these well once constrained — but only if you keep the leash short.

The two failure patterns, side by side

The whole job is converting the left column into the right column. Give the AI both columns so it knows the bar.

Leading / opinion question (delete)	Behavioral rewrite (keep)
“Would you pay for unlimited habits?"	"Walk me through the last time you hit a wall and wanted more room."
"How do you feel about the 50-habit limit?"	"Show me your list. When did you last delete something, and why?"
"What features would you love to see?"	"What did you try to track last month that you gave up on?"
"Do you like the new dashboard?"	"Open the app and narrate what you do first, before you think about it."
"Would this save you time?"	"How long did your last weekly review actually take? Walk me through it.”

The pattern: every good question points at a specific past event the user can replay from memory. The bad ones ask for a prediction or an opinion, which is where politeness and bias live.

The task

You are planning 30-minute interviews with 6-10 users. You have a hypothesis you are trying to either kill or refine. You need a question list that surfaces actual behavior in the recent past, not opinions about an imagined future, and that does not telegraph what you want to hear.

When this is the right job for AI

You can write your hypothesis in one sentence. If you cannot, the interview is not ready.
You know who you are talking to and what their recent context is (last 2 weeks of relevant behavior).
You will read the draft critically and strip every question that contains a value-loaded word.
You can run the interview with the discipline to follow up on stories, not on opinions.

What to feed the AI

The hypothesis in one sentence (“power users who hit the 50-habit limit churn within 2 weeks”)
The user segment and how they were recruited (“paying users who have logged 40+ habits, recruited from in-app prompt”)
The decision the interview is informing (raise the limit? sell a tier? auto-archive?)
A list of taboo words you do not want in any question (“would,” “could,” “feature,” your product name as a noun)
2 examples of questions you already know are bad — so AI calibrates the bar

Copy-ready prompt

You are writing user interview questions for a 30-minute call.

Hypothesis: power users who hit the 50-habit limit churn within 2 weeks because the limit forces them to delete habits they were tracking, not because the limit itself bothers them.

User segment: paying users who have logged 40+ active habits in the last 30 days. Recruited from in-app prompt with $20 gift card. They know we want to talk about how they use the app.

Decision this informs: do we raise the limit, ship a paid tier with no limit, or auto-archive habits inactive 30+ days?

Taboo words (do not appear in any question):
- "would" / "could" / "if"
- "feature" / "feedback"
- the word "limit" itself (we want to see if they bring it up unprompted)
- product name as a noun ("how do you use [App]")
- "love" / "hate" / "like"

Bad question examples (so you know the bar):
- "Would you pay for unlimited habits?" (hypothetical + sells the answer)
- "How do you feel about the 50-habit limit?" (asks for opinion, names the thing)
- "What features would you love to see next?" (everything wrong in 9 words)

Write 12 questions. Structure:

1. 3 opening questions about last week — what they actually did, no opinions.
2. 4 questions that probe behavior around the suspected churn moment — without naming the limit or hinting we are interested in it.
3. 3 questions about what they did the LAST time they stopped using a habit-tracking tool (any tool, not just ours).
4. 2 closing questions that test whether the user brings up the limit on their own. If they do not, the hypothesis is weaker.

Each question must:
- Reference a specific past time window (last week, last month, the last time you did X).
- Avoid every taboo word above.
- Not embed an assumption about why they did something.

Mark any question where you had to bend a rule with [BEND: why] so I can review.

Sample output structure

Opening (last week):

Walk me through last Tuesday — when did you open the app and what did you do?

Tell me about the last time you added something new to track. What was happening that day?

The last time you opened the app and then closed it without doing anything — what was going on?

Behavior near the suspected churn moment: 4. Show me your active list. Talk me through what each one is for, in 15 seconds each. 5. When was the last time you removed something from this list? Walk me through that decision. 6. The last time you wanted to add something and did not — what stopped you? (do not lead — wait through silence) 7. Tell me about a time in the last month when you felt the list got hard to manage.

Last time they stopped using ANY tracking tool: 8. Think about the last tracking tool, any tool, you stopped using. What was the last week before you stopped? 9. What were you trying to track in that final week that did not work out? 10. Did you go to another tool, or just stop?

Closing: 11. If you were redesigning your tracking setup from scratch tomorrow, walk me through the first 10 minutes. 12. What is the next thing you are going to start tracking, and where will you put it?

Which model to use

Any frontier model handles this once it has the rules. Differences are small and live in how hard each one fights the urge to be helpful (which here means leading):

Claude Opus 4.7 / Sonnet 4.6 — best at honoring the taboo-word list and the [BEND: why] flag, and least likely to slip a “would you” back in on a refinement pass. Sonnet 4.6 is the cheaper default; Opus 4.7 is worth it only for a tricky hypothesis.
GPT-5.5 (set the picker to Thinking) — strong structure, but check its closing questions; it tends to over-help and softly name the thing you said not to name.
Gemini 3.1 Pro — fine, and its 1M-token context (as of June 2026) is handy if you paste a long product spec or three prior transcripts as background.

Whatever you pick, the model is a drafting tool. The rules in the prompt do the real work.

How to refine

AI smuggled a taboo word in → re-prompt with “rewrite Q7 without manage — that is a value-loaded word here. Use a behavior verb.”
Question is hypothetical (“if you were to…”) → strict rule: “every question must reference a past time window. No if, no would.”
Question embeds the hypothesis → reject and ask: “rewrite Q6 so it does not assume the user wanted to add something — make it open to wanted to remove or wanted to change.”
Too many questions, interview will overrun → cap at 10 and tag the 2 that get dropped first if time is tight.
AI did not surface the “what tool did they use before” angle → require a section on past tool exits.

Common mistakes

Asking opinion questions when behavior questions exist. “How do you feel about X” is a polite-answer machine.
Naming the thing you are testing in the question. If the user does not raise it unprompted, that is the signal.
Stacking two questions in one (“when did you last add a habit and how did that feel?”). Users only answer the easier half.
Using your own product vocabulary. Users do not call them “habits,” they call them “the workout thing.”
Skipping the “last time you quit a tool like this” question. The graveyard of prior tools is where the real reasons live.

FAQ

How many questions for 30 minutes? 10-12 prepared, expect to use 6-8. Follow-ups eat time and that is the point.
Should the user see the questions in advance? No. You want first-take answers, not rehearsed ones.
What if a user keeps giving opinions instead of stories? Redirect once: “tell me about the last time that happened.” If they cannot, that is data — they may not actually do it often.
Can AI run the interview? No. AI cannot read silence, lean in, or know when to wait through a pause that is about to produce the real answer. Use AI for prep and synthesis, not for the live conversation.
How do I synthesize 8 interviews? Transcribe first (Otter.ai or Granola, ~$10-18/month as of June 2026), then paste transcripts into Claude or a dedicated tool. Ask it to extract verb+noun pairs (“deleted habit,” “switched tool”), not “themes” — themes are your bias talking. Cluster by verb, then count: 3+ users on the same behavior is a strong signal, 1-2 is a lead to chase. Purpose-built repositories like Dovetail (from ~$20/user/month) do the tagging and semantic search natively if you run research often.
Should I just use the same questions every round? No. Each interview round tests a different hypothesis, so each needs its own question list. Reuse the rules (the taboo words, the past-tense constraint), not the questions.

The two failure patterns, side by side

The task

When this is the right job for AI

What to feed the AI

Copy-ready prompt

Sample output structure

Which model to use

How to refine

Common mistakes

FAQ

Further reading

Related

Related Articles

AI A/B Test Plan: Draft a One-Page Experiment Spec in 10 Minutes

AI Retention Cohort Analysis: Read the Curve, Not the Number

AI App Store ASO Keyword Research Without Guessing

AI Crash Report Triage: Stack Trace to Owner in One Pass

Write a Pricing A/B Brief With AI (Without the Lossy Math)

AI User Segment Targeting Brief: Stop Spraying Notifications