How much feedback can AI cluster at once?

Any frontier model handles 100-500 short responses in one call cleanly. For a full quarter (thousands of verbatims), use a 1M-token model — Claude Sonnet 4.6, Gemini 3.1 Pro, or GPT-5.5 — which holds on the order of 700,000 words. Beyond that, batch by week or segment, then re-cluster the summaries.

Which model is best for this?

For everyday clustering, Claude Sonnet 4.6 gives tight, low-hallucination summaries. For the cheapest large-batch single call, Gemini 3.1 Pro ($2/$12 per 1M tokens API). For the hardest JTBD or severity reasoning, Claude Opus 4.7. All as of June 2026.

How do I trust the AI did not hallucinate?

Run template 15 and spot-check 10 random verbatims against source. If 2 are invented, redo the clustering.

Should I share the verbatim with the AI raw?

Yes, but strip PII first. Verbatim language is the highest-value input; paraphrasing loses signal.

How often should I cluster?

Monthly for active products, quarterly at minimum. The drift between cycles is itself a leading indicator.

What if themes feel too generic?

Add a constraint that every theme label must reference a feature, screen, or workflow — no abstract nouns like "experience" or "quality".

Prompt Library

User Feedback Clustering Prompts for Themed Insights

15 feedback-clustering prompts that compress hundreds of support tickets, NPS comments and reviews into 8-12 actionable themes — with counts, severity, and a model-choice table for June 2026.

Published: May 19, 2026 Updated: Jun 14, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Raw feedback is impossible to act on. The 500 verbatim responses you exported from Intercom, the App Store, Typeform and Linear bury the actual signal under noise. These 15 prompts cluster, count, label and prioritize that feedback into the small set of themes a roadmap can actually answer. Every template includes specific patterns for support tickets, NPS comments, churn-exit surveys, app reviews, and beta-test transcripts — paste them into ChatGPT, Claude, or Gemini and adapt the bracketed placeholders.

TL;DR

Start with template 1 (open taxonomy), then re-run with template 2 once you know your categories.
Always demand a count per theme — without counts you cannot prioritize.
Run template 15 (hallucination guard) before any clustering goes to execs.
For 100-500 short responses, any frontier model works. For full quarters (thousands of verbatims at once), use a 1M-token model: Claude Sonnet 4.6, Gemini 3.1 Pro, or GPT-5.5 (full 1M in-app only on ChatGPT Pro $200).
Strip PII (names, emails, account IDs) before pasting — these prompts assume clean input.

Which model and how much can it take at once

A clustering prompt lives or dies on how much raw feedback fits in one call. The constraint is no longer model quality — it is context discipline. As of June 2026, here is what to reach for:

Model	Context window	Strength for clustering	Notes
Claude Sonnet 4.6	1M tokens	Tight, low-hallucination summaries; good verbatim fidelity	Workhorse; bundled in Claude Pro $20/mo
Claude Opus 4.7	1M tokens	Best reasoning on ambiguous, mixed feedback	Pricier ($5/$25 per 1M tokens API); reserve for hard JTBD/severity passes
Gemini 3.1 Pro	1M tokens	Cheapest 1M-token frontier ($2/$12 API)	Strong on long single-call batches; in Google AI Pro $19.99/mo
GPT-5.5	1M (API)	Solid default; picker has Instant/Thinking/Pro	In-app context ~320 pages on Plus $20; full 1M only on Pro $200/mo

Practical rule: ~750 words is roughly 1,000 tokens, so a 1M-token window holds on the order of 700,000 words of verbatims — far more than any single quarter. The real failure mode is asking for too much at once and getting shallow themes, not hitting the token ceiling. If a batch produces vague clusters, split by week or segment and re-cluster the summaries.

For repeatable, team-wide synthesis at scale, dedicated platforms (Dovetail, Thematic) add verbatim traceability and sentiment scoring tied to NPS/CSAT. Prompts win on speed and flexibility; platforms win on auditability. Evaluate a dedicated platform when more than one team needs to query the same feedback corpus and trace every theme back to a source quote.

Who this is for

PMs and CX leads who own quarterly feedback synthesis, founders reading their own support inbox, research-ops teams running large surveys, and growth teams hunting the next experiment angle.

When not to use these prompts

Skip when n is under 30 — at that size, read every response by hand. Skip too when the feedback is highly technical (engineering logs, code review) and needs a domain taxonomy the model lacks. And never feed raw PII to a model: scrub names, emails, and account IDs first.

Prompt anatomy

A feedback-clustering prompt should carry six elements:

Role: who the AI plays (senior PM / solo founder / product designer / growth lead).
Context: stage (idea / MVP / growth / scale), team size, traffic or ARR, platform (web / iOS / Android), audience, constraints.
Goal: one concrete deliverable — one theme set, one ticket batch, one delta report.
Constraints: theme count, mutual exclusivity, severity scale, must-cite-from-source.
Output format: table, matrix, ticket-ready JSON, or labeled blocks you can paste into Linear / Notion / Jira.
Source guard: the raw verbatims to cite from, plus an instruction to flag any quote not in the source.

15 copy-ready prompt templates

Bracketed tokens like [N] and [paste] are placeholders — replace them with your numbers and pasted feedback.

1. First-pass theme extraction

Start here. Let the AI propose its own taxonomy before you constrain it.

You are a senior PM clustering raw user feedback. Below are [N] verbatim responses. Group them into 8-12 themes. For each theme: (1) short label (3-5 words), (2) one-sentence description, (3) count of responses, (4) 2 representative verbatim quotes, (5) suggested severity (blocker / major / minor / nice-to-have).

Feedback: [paste]

Variables to swap: N, feedback corpus.

Optimization: If themes overlap, add: “Merge themes where 60%+ of underlying quotes could fit either. Keep clusters mutually exclusive.”

2. Constrained taxonomy clustering

Use this once you know the categories you care about.

Cluster the feedback below into these predefined categories: [bug, missing feature, pricing, onboarding, performance, integration, support quality, other]. For each: count, % of total, top 3 verbatim, single recommended action. Anything in "other" exceeding 10% should suggest a new category.

[paste]

3. Sentiment + theme matrix

Cluster the feedback below into themes AND label each with sentiment (positive / mixed / negative). Output a matrix: theme by sentiment, with counts. Highlight any theme where the same feature gets equal positive and negative comments — that is where the actual disagreement lives.

[paste]

4. Churn-exit reason clustering

Below are [N] cancellation-flow survey responses. Cluster the reasons for leaving into 6-8 categories. For each: count, %, 2 verbatim, whether it is reversible (product fix vs life-circumstance), and one suggested counter-move.

[paste]

5. NPS comment clustering by score band

Cluster these NPS comments into themes, but split the output by score band: detractors (0-6), passives (7-8), promoters (9-10). For each band: top 5 themes with counts. Highlight any theme that appears in BOTH detractors and promoters — that is your polarizing feature.

[paste]

6. Bug vs feature-request separation

Take this mixed feedback corpus and split it into two stacks: bugs (something broken vs expected behavior) and feature requests (something not built yet). For each stack, cluster into themes with counts. Flag any items where the line is unclear.

[paste]

7. Persona-aware clustering

Below is feedback tagged with user persona (free / paid / enterprise / new / power). Cluster themes BY persona — the same theme can appear in multiple personas. Output: theme by persona matrix with counts. Highlight which themes are concentrated in paid + power users — those move revenue.

[paste]

8. Jobs-to-be-done reframing

Re-cluster this feedback using JTBD framing instead of feature categories. Output 5-8 jobs in the form "When X, I want Y, so I can Z." For each: count of underlying feedback, the products / workarounds users currently use for that job, and where our product falls short.

[paste]

9. Severity + frequency 2x2

For each theme in this clustered feedback, place it on a severity (low/high) by frequency (low/high) 2x2. Output as a table. The high/high quadrant is the next sprint. The high-severity / low-frequency quadrant needs an audit (rare but bad).

[paste themes + counts]

10. Quote selection per theme

For each of these themes, pick the 3 most representative verbatim quotes for an internal share-out. Criteria: clarity, emotion, specificity. Exclude any quote with PII (names, emails, account IDs). Tag each quote with persona if known.

Themes + raw quotes: [paste]

11. Cross-channel reconciliation

I have 3 sources of feedback for the same quarter: app reviews, support tickets, sales loss reasons. Compare the top 5 themes from each. Output a table: theme by source. Highlight themes appearing in all 3 sources (highest confidence) and themes appearing in only 1 (channel-specific noise or hidden signal).

[paste 3 source summaries]

12. Action recommendation per theme

For each theme below, propose 1 product action, 1 GTM / messaging action, and 1 thing NOT to do. Each action should be testable in 2 weeks. Mark which action belongs to which team.

[paste themes]

13. Quarterly delta report

Compare last quarter's feedback themes to this quarter's. Output: themes that grew, themes that shrank, new themes, retired themes. Hypothesize 1 reason for each major delta. End with the 3 themes worth a deep-dive next quarter.

Last quarter: [paste]
This quarter: [paste]

14. Feedback-to-ticket converter

Convert the top 5 themes from this clustering into engineering / design tickets. Each ticket: title (under 12 words), problem (2 sentences from clustered evidence), proposed scope, success metric, linked verbatim quotes. Output as JSON for Linear / Jira import.

[paste themes + quotes]

15. Hallucination guard pass

This is the one non-negotiable template. Run it before any clustering reaches a decision-maker.

Audit this AI-generated clustering against the source feedback. For each theme: confirm the count by spot-checking, flag any verbatim quote that does not appear in source, flag any claimed pattern not supported by at least 3 quotes. Output: confirmed themes vs themes to redo.

Clustering: [paste]
Source: [paste 20 random verbatim]

Common mistakes

Asking for 25 themes when 10 would do — that overfits noise.
No counts per theme — without counts you cannot prioritize.
Letting the model invent verbatim quotes — always pass the raw text and demand spot-checking (template 15).
Mixing bugs and feature requests in one cluster — different actions, different teams.
Ignoring sentiment — a popular theme split 50/50 positive vs negative is your most controversial issue.
Re-clustering without comparing to last quarter — drift is itself the signal.
Acting on a theme with under 5 supporting quotes — too small to spend a sprint on.

How to push results further

Strip PII before pasting feedback — names, emails, account IDs.
Run template 15 on any clustering before sharing it upward.
Pair every theme with at least 3 verbatim quotes; one quote is anecdote.
Use template 11 each quarter to separate real signal from channel noise.
Re-export from your source tools (Intercom, Linear, Typeform) rather than reusing old extracts — themes shift weekly.
When clusters look “balanced” with similar counts, push back; real signal usually has 2-3 dominant themes.
Assign every theme a single owner before the share-out, or nothing moves.

FAQ

How much feedback can AI cluster at once?: Any frontier model handles 100-500 short responses in one call cleanly. For a full quarter (thousands of verbatims), use a 1M-token model — Claude Sonnet 4.6, Gemini 3.1 Pro, or GPT-5.5 — which holds on the order of 700,000 words. Beyond that, batch by week or segment, then re-cluster the summaries.
Which model is best for this?: For everyday clustering, Claude Sonnet 4.6 gives tight, low-hallucination summaries. For the cheapest large-batch single call, Gemini 3.1 Pro ($2/$12 per 1M tokens API). For the hardest JTBD or severity reasoning, Claude Opus 4.7. All as of June 2026.
How do I trust the AI did not hallucinate?: Run template 15 and spot-check 10 random verbatims against source. If 2 are invented, redo the clustering.
Should I share the verbatim with the AI raw?: Yes, but strip PII first. Verbatim language is the highest-value input; paraphrasing loses signal.
How often should I cluster?: Monthly for active products, quarterly at minimum. The drift between cycles is itself a leading indicator.
What if themes feel too generic?: Add a constraint that every theme label must reference a feature, screen, or workflow — no abstract nouns like “experience” or “quality”.

Tags: #Prompt #Product startup #User story