Should I include "tech debt" in the 3?

Only if it's blocking a feature on the ship list. Otherwise it goes into the 1 week of underestimated non-feature work.

What if my north star isn't retention?

Same template — swap "retention" for "activation" or "revenue per user." The "what lever does this pull?" question is identical.

RICE, ICE, or Kano for this?

RICE for the ranking (one metric, competing bets). Use Kano first only if you can't tell whether a feature is a basic expectation or a delight — it classifies, it doesn't rank.

Can AI replace user research?

No. Feed it user signal; don't ask it to invent signal. A model with no evidence will manufacture confident reasons, which is worse than no answer.

Which model should I actually use?

For a quarterly cut, Claude Opus 4.7 if you want the tightest argument per "no"; Gemini 3.1 Pro if you're scoring many items and watching cost. GPT-5.5 if you're iterating fast in chat.

AI Use Cases

AI Feature Prioritization for Indie Apps: From 30-Item Backlog to a 3-Item Quarter

Use AI to cut a sprawling feature backlog into the 3 things worth shipping this quarter — with a copy-ready prompt, a RICE scoring pass, and model picks as of June 2026.

Published: May 20, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

TL;DR

Hand the AI the user signal behind each backlog item (not just titles), a single north-star outcome, and your real capacity in shippable feature-weeks. Ask it to cut the list to three, name the exact retention lever each one pulls, and flag the non-feature work you’re underestimating. Use Claude Opus 4.7 when you want the most rigorous step-by-step argument for each cut; use Gemini 3.1 Pro when you’re scoring 30+ items and care about cost. The AI is brutal in the prompt and softer in the output — the final cut is still yours to own.

The task

You’re an indie dev or small-team PM. Your backlog has 30+ feature ideas. You have one quarter and at most three features you can ship well. You need AI to force the brutal cut you keep avoiding.

The hard part isn’t generating ideas — it’s saying no to 27 of them with a reason you can defend in a standup. That’s where a model with strong long-context reasoning earns its keep: it can hold every item’s user signal at once and rank against a single outcome, which a tired human at 11pm cannot.

When this is the right job for AI

You can hand AI the user signal behind each feature — not just titles. Review quotes, support-ticket counts, churn-cohort evidence.
You have a north-star outcome for the quarter, expressed as a number (“week-4 retention from 22% to 30%”).
You will live with the cut. AI ranks; you commit.

If you only have titles and vibes, AI will produce confident-sounding nonsense. Feed it evidence or skip this.

What to feed the AI

20-30 backlog items: name + the user signal (review quote, support-ticket count, churn-cohort evidence)
North-star outcome for the quarter, as a number
Capacity: how many “shippable feature-weeks” you actually have after on-call, bugs, and App Review delays
2 features your gut already wants to drop
2 features competitors just shipped that you’re tempted to chase

On capacity, be honest about the App Review tax: as of June 2026, Apple approves roughly 90% of submissions within 24 hours and 98% within 48 hours, but apps with new in-app purchases or complex privacy declarations routinely sit in review for 2-5 days, and macOS builds run longer (live review times, Runway). Budget one slipped week per quarter for a rejection-and-resubmit cycle and you’ll rarely be surprised.

The framework underneath the prompt

You don’t have to teach the AI a framework, but it helps to know which one you’re implicitly asking for. The three that matter for indie roadmaps:

Framework	Formula / shape	Best for	Failure mode
RICE	(Reach × Impact × Confidence) ÷ Effort	Ranking roadmap features against one shared metric	Collapses into theater when the inputs are internal conviction, not evidence
ICE	Impact × Confidence × Ease	Quick ranking of small experiments	With no Reach constraint, drifts toward whatever the scorer finds exciting
Kano	Classify: basic / performance / delight	Understanding which kind of value a feature is	It’s a satisfaction model, not a ranking model — it won’t order your backlog

RICE was built by Sean McBride on Intercom’s growth team to compare ideas against a single conversion goal, which is exactly the indie situation: one north star, a pile of competing bets (RICE scoring, Intercom). The prompt below is essentially a RICE pass with a human-language wrapper — it asks for reach (user signal), impact (which lever it pulls), confidence (how strong the signal is), and effort (weeks) without making you fill in a spreadsheet.

Copy-ready prompt

You are cutting a feature backlog from 25 items to 3 for the next quarter.

North-star outcome: week-4 retention from 22% -> 30%.
Capacity: 10 shippable feature-weeks (13-week quarter, -2 for on-call, -1 for App Review delays).

Backlog (name | user signal):
1. Streak freeze | 38 support tickets, 12 reviews mention "lost my streak"
2. Widgets v2 | 9 reviews, gut says it's cosmetic
3. Reminders by time-of-day | 1 review, 6 support tickets — power-user request
4. Apple Watch app | competitor just shipped, no user signal
5. Shareable progress card | 22 organic-share screenshots on X, no requests
6. Onboarding rewrite | analytics: 41% drop in step 3 of 4
7. Premium price test ($4.99 -> $6.99) | revenue model assumption
8. Dark mode v2 | 14 reviews, mostly "the contrast hurts"
9. iPad layout | analytics: 8% of MAU on iPad, current layout broken
10. Settings cleanup | engineering ask, no user signal
... [paste 15 more]

Gut wants to drop: Apple Watch app, settings cleanup.
Tempted to chase: competitor widgets, competitor watch app.

Output:
1. The 3 features I ship this quarter. For each: which retention lever it
   pulls, weeks of effort, and the one risk.
2. The 5 "no" features with one-line reasoning (specifically why NOT, not
   "lower priority").
3. The one feature I should explicitly NOT chase from competitor moves, and
   the one-line rebuttal to my own FOMO.
4. The non-feature work I am underestimating: bugs, perf, onboarding fixes
   that will move retention more than any feature.
5. What I should measure weekly to know if the 3 commits are working.

The -> arrows are written as plain text so you can paste this into any chat box without it being read as Markdown.

Which model to run it on

This is a long-context ranking-and-argument task, not a creative one. As of June 2026:

Model	Why pick it here	API price (in/out per 1M)
Claude Opus 4.7	Produces the most rigorous step-by-step reasoning per cut; best when “this is wrong” is expensive	$5 / $25
Gemini 3.1 Pro	Best price-to-performance for scoring 30+ items; ~70% of Opus’s depth at a fraction of the cost	$2 / $12
GPT-5.5	Fastest to a clean answer; good default if you’re iterating live	$5 / $30

For a one-off quarterly cut, any of the three flagship chat apps (Claude Pro at $20/mo, ChatGPT Plus at $20/mo, Google AI Pro at $19.99/mo) is plenty — you’re sending a few thousand tokens, not running a pipeline. Reach for the API and Gemini 3.1 Pro only if you’re batch-scoring many products or re-running the prompt dozens of times. If you live in an editor, Cursor can run the same prompt against Opus 4.7 or Gemini 3.1 Pro without leaving your repo.

Sample output structure

Ship this quarter:

Onboarding rewrite (3 weeks) — direct lever: 41% drop at step 3 = you lose half the retention cohort before week 1. Risk: rewriting onboarding while changing pricing confounds the measurement, so don’t run both at once.

Streak freeze (2 weeks) — 12 reviews + 38 support tickets is the loudest single signal. Risk: streak purists will complain freezes “cheapen” the mechanic; ship a one-tap off switch.

Reminders by time-of-day (2 weeks) — power-user retention proxy; the support-ticket cohort retains 1.7x. Risk: notification UX needs care or you push users out instead of pulling them back.

Not this quarter (5 nos):

Apple Watch app: chasing a competitor with zero user signal. No.

Settings cleanup: zero user-visible retention impact.

Widgets v2: cosmetic; widgets v1 already meets the bar.

Dark mode v2: real complaint, but it doesn’t move week-4 retention.

iPad layout: 8% of MAU, genuinely important — but next quarter, not this one.

Don’t chase: Apple Watch. Rebuttal to your own FOMO: “watch apps add roughly 40% maintenance load, your audience wears a watch at ~12%, and your competitor’s ship is a marketing bet, not a retention bet.”

Underestimated non-feature work: the 6 P1 bugs in your TestFlight. Each one costs you a week-4-retention point in affected cohorts. Allocate 1 week.

Weekly measurement:

Onboarding funnel step-3 conversion (weekly).

Streak-related ticket count (weekly).

Week-4 cohort retention (rolling 6-week chart).

How to refine the output

Output too generic → require “each ship-feature names a specific retention mechanism, not improves engagement.”
AI scores everything equally → demand “the 3 features must include exactly one onboarding/funnel fix, not three power-user features.”
AI hedges → strict rule: “no consider, no it depends. Either ship or no.”
Skips the underestimated work → ask explicitly for “what’s missing from the backlog that is more important than any feature on it.”
You suspect the model is gaming its own scoring → add “treat any item whose only signal is competitor shipped it as Reach = 0.” This is the exact discipline RICE-style scoring loses without a Reach constraint.

Common mistakes

Equating “most requested” with “ships first.” Most-requested features often serve power users who already retain.
Letting competitor moves overweight your roadmap. Their bet isn’t your bet.
Not budgeting non-feature work. Bugs and onboarding fixes often dominate the retention math.
Picking 3 features that all need the same engineer. Capacity is per-engineer, not just per-team.
Trusting the score over the signal. RICE and ICE both collapse into theater when the inputs are conviction dressed up as numbers — the prompt is only as good as the evidence you paste in.

FAQ

Should I include “tech debt” in the 3? Only if it’s blocking a feature on the ship list. Otherwise it goes into the 1 week of underestimated non-feature work.
What if my north star isn’t retention? Same template — swap “retention” for “activation” or “revenue per user.” The “what lever does this pull?” question is identical.
RICE, ICE, or Kano for this? RICE for the ranking (one metric, competing bets). Use Kano first only if you can’t tell whether a feature is a basic expectation or a delight — it classifies, it doesn’t rank.
Can AI replace user research? No. Feed it user signal; don’t ask it to invent signal. A model with no evidence will manufacture confident reasons, which is worse than no answer.
Which model should I actually use? For a quarterly cut, Claude Opus 4.7 if you want the tightest argument per “no”; Gemini 3.1 Pro if you’re scoring many items and watching cost. GPT-5.5 if you’re iterating fast in chat.

Tags: #AI writing #Feature priority #Product #Roadmap #Prioritization

TL;DR

The task

When this is the right job for AI

What to feed the AI

The framework underneath the prompt

Copy-ready prompt

Which model to run it on

Sample output structure

How to refine the output

Common mistakes

FAQ

Related

Related Articles

AI A/B Test Plan: Draft a One-Page Experiment Spec in 10 Minutes

AI Retention Cohort Analysis: Read the Curve, Not the Number

AI App Store ASO Keyword Research Without Guessing

AI Crash Report Triage: Stack Trace to Owner in One Pass

Write a Pricing A/B Brief With AI (Without the Lossy Math)

AI User Interview Question Generator That Avoids Leading