Prompt Does Not State What Success Looks Like

When a prompt has no success criteria, "good" defaults to whatever the model thinks sounds confident. Here is the 5-line success block that ends revision purgatory.

Published: May 20, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You asked the model to “write a good summary of the meeting.” Two minutes later it returned 600 words. Is it good? You read it. It is fine. You re-prompt “make it better.” The next version is also fine, in slightly different ways. You spend 20 minutes evaluating versions, none of which feels clearly right. The problem is not the model. The problem is “good”: you never said what good looks like, so the model defaulted to “confident-sounding” and you defaulted to “I’ll know it when I see it.” Both definitions are wrong by construction. With no success criterion in the prompt, the model hill-climbs on confidence, not on usefulness.

Fastest fix: end the prompt with a 5-line ## Success criteria block that turns every adjective into a number, a required item, or a banned phrase (template in Step 1 below). Then ask the model to print a pass/fail checklist against that block before it stops. This is the same loop the vendors recommend — Anthropic tells builders to define specific, measurable success criteria before writing the prompt, and OpenAI’s GPT-5.5 guidance says to “describe the destination” (target outcome, success criteria, constraints, stop rules) rather than every step.

This page walks through why prompts without success criteria stay stuck in revision purgatory, and how to write a success block that makes “done” mechanical.

Common causes

1. Prompt focuses on the task, not the bar

You said what to do (“write a summary”) but not what makes it correct (length, must-include items, banned content). The model produces plausibly-shaped output and stops.

How to spot it: search your prompt for “should be”, “passes”, “must contain”. If none of these appear, you have no bar.

2. Quality is assumed from context

“You know what I want” is a real internal sentence. The model does not know. It has none of the context your team has accumulated.

How to spot it: your prompt assumes shared taste with the model.

3. Multiple stakeholders, multiple definitions

Two reviewers disagree on what “done” looks like. The model averages and pleases neither.

How to spot it: reviewers reject the output for different reasons.

4. Subjective adjectives as stand-ins

“Good”, “clear”, “professional”, “useful” — all unmeasurable. The model interprets each against its training-distribution average.

How to spot it: criteria are adjectives, not numbers / rules / checklists.

5. Success looks different per input

“Good summary” of a 5-line message is different from “good summary” of a 50-page report. Criteria need to scale or be expressed as ratios.

How to spot it: same prompt works on small inputs, fails on large ones.

What a real success criterion looks like

Anthropic’s own builder guidance defines good success criteria with four properties (June 2026). Borrow them for any prompt, not just API work:

Property	Vague	Specific and testable
Specific	”good summary"	"captures every decision, owner, and deadline”
Measurable	”concise"	"80-120 words; no sentence over 25 words”
Achievable	”perfect"	"matches the held-out example I accepted last week”
Relevant	”professional"	"parseable by the downstream Slack bot: 3 numbered lines”

Anthropic’s canonical example replaces “the model should classify sentiments well” with “F1 score of at least 0.85 on a held-out test set of 10,000 posts.” You rarely need an F1 score for everyday prompting, but the move is the same: trade the adjective for a number, a required item, or a banned phrase you can check in seconds.

Before you change anything

Write down what a perfect output would look like in 5 bullets.
Identify which of those 5 are measurable and which are vibe.
Find a past “perfect” output and reverse-engineer the criteria.
Decide who is the audience and what they will do with the output.
Plan for criteria that scale with input size if relevant.

Information to collect

The current prompt.
2-3 outputs you accepted and 2-3 you rejected.
The reason each was accepted / rejected (the implicit criteria).
The downstream consumer of the output (human reader, parser, database).
Model and any system prompt.

Shortest path to fix

Step 1: Add a measurable Success Criteria block

End the prompt with:

## Success criteria
- Length: 80-120 words
- Must include: 1 decision made, 1 owner, 1 deadline
- Banned: "circling back", "going forward", "let me know"
- Format: 3 numbered points + 1 followup question
- Tone: 2nd person, present tense, no hedging

Measurable, enforceable, fast to check.

Step 2: Convert each adjective to a check

Adjective	Check
”Good summary"	"Captures 3 key decisions. Each decision in ≤25 words."
"Clear writing"	"Each sentence ≤20 words. No nested clauses."
"Useful analysis"	"At least 1 actionable next step with owner and date."
"Professional tone"	"No exclamation marks. No emoji. No first-person."
"Thorough review"	"Cites 3+ specific lines/files. Flags both pros and cons.”

Step 3: Have the model self-check against the rubric

Append:

After writing, output a checklist:
- Length: [actual word count] / 80-120 → pass/fail
- Required items present: [list] → pass/fail
- Banned phrases used: [list, or "none"] → pass/fail
- If any fail, rewrite and re-check, then print the final pass/fail.

This catches issues without a human reviewer in the loop. Two details make the self-check reliable, both straight from Anthropic’s LLM-grading guidance (June 2026):

Force a discrete verdict. Make the model output pass/fail or a 1-5 score, never a paragraph. “Purely qualitative evaluations are hard to assess quickly and at scale.”
Let it reason first, then decide. Asking the model to think before it scores raises grading accuracy on judgement-heavy tasks. On a reasoning model (GPT-5.5 Thinking, Claude Opus 4.7, Gemini 3.1 Pro), you can go further and tell it to build a 5-7 category rubric, score itself against it, and rewrite until it tops every category before answering — OpenAI’s GPT-5.5 cookbook recommends exactly this for high-stakes outputs.

For anything you run repeatedly, move the grading into a real eval harness (a small set of accepted/rejected examples you re-score on every prompt change) instead of eyeballing each output. See ambiguous evaluation criteria for how to build that scoring set.

Step 4: Provide a “passes” and “fails” example

Passes the criteria:
"1. Decision: ship v2 on Friday. Owner: Alex. Deadline: 2026-05-26.
2. Decision: hold the launch tweet. Owner: Sam. Deadline: TBD.
3. Decision: roll back if error rate > 2%. Owner: on-call. Deadline: continuous.
Follow-up: who owns the rollback playbook?"

Fails the criteria (too vague):
"The team aligned on shipping the new feature soon and will keep an eye
on metrics. Sam will coordinate communications. Let me know if questions."

The contrast is far stronger than a description.

Step 5: Scale criteria with input size if relevant

For tasks where input varies:

Success criteria (scaled):
- Length: min(input_word_count / 10, 200) words
- Must capture: at least 1 decision per 100 words of input
- ...

Step 6: Move stable criteria to project / system instructions

If you keep retyping the same criteria, lift them into a persistent instruction so every turn inherits them. As of June 2026 the place to put them is:

ChatGPT: a Project’s instructions, or a custom GPT’s “Instructions” field.
Claude: a Project’s custom instructions, or the API system prompt.
Gemini: a Gem’s instructions, or “Saved info” under Settings.

This saves prompt space, keeps the bar consistent across turns, and means you only maintain the criteria in one place.

How to confirm the fix

The model’s self-check passes on every output.
Two reviewers looking at the same output reach the same accept/reject verdict.
Running the same prompt 3 times produces 3 outputs that all pass criteria.
You spend less than 60 seconds deciding whether an output is “done”.
“Make it better” is no longer the dominant follow-up — specific fixes are.

If it still fails

Criteria may still be too vague — write a “passes” example yourself; if your example is borderline, the criteria are loose.
Add 1-2 more pass/fail examples; few-shot beats rules.
The task may genuinely have no single success — split it into sub-tasks with their own criteria.
If reviewers disagree on outputs that pass criteria, the criteria do not capture what you actually want — revise.

Prevention

Default: every prompt ends with a measurable Success Criteria block.
Build personal templates per task type with reusable criteria.
For team work, agree on success criteria before delegating to AI.
Audit accepted outputs monthly: are they passing your criteria or just your gut?
Treat “make it better” as a smell. If you say this, the criteria are missing or wrong.
When in doubt, ask the model to propose 3 candidate criteria; pick one.

FAQ

How many criteria should a success block have?

Three to six lines. Anthropic notes most real tasks need “multidimensional” criteria along several axes (length, required items, tone, format), but a block longer than about six lines usually means you are encoding taste you have not yet pinned down. Start with the 2-3 that actually cause rejections and add more only when an output passes the block but you still reject it.

Won’t a strict success block make outputs robotic or kill creativity?

No, if you constrain the bar, not the path. Modern models do best when you “describe the destination” (the outcome and how it is judged) and leave the route open — that is OpenAI’s explicit GPT-5.5 guidance. Banning “circling back” and requiring an owner per decision constrains form, not ideas. If outputs feel flat, your criteria are over-specifying wording rather than results; loosen the wording rules and keep the must-include items.

The model’s self-check says “pass” but the output is still wrong. Why?

The checklist is grading the wrong thing. A self-check only verifies the criteria you wrote, so a passing-but-wrong output means the criteria miss what you actually care about. Take that exact output, write down why you rejected it, and add that reason as a new line. This is the single fastest way to tighten a loose block. Also force the verdict to be a literal pass/fail token, not prose — a model asked for a paragraph will rationalize a pass.

Can the AI write the success criteria for me?

Yes, and it is a good starting move. Paste 2-3 outputs you accepted and 2-3 you rejected and ask the model to infer the rules that separate them, then ask for 3 candidate success blocks. Edit the best one by hand. Do not ship criteria you have not read — the model will happily invent a plausible bar that is not yours.

My prompt works on short inputs but fails on long ones. Is that a criteria problem?

Usually yes. A fixed “80-120 words” rule is wrong for a 50-page report. Express the bar as a ratio or a floor (see Step 5): length = min(input_words / 10, 200) and “at least 1 decision captured per 100 words of input.” Criteria that do not scale with input size are the most common reason a tuned prompt breaks on the next, bigger task.

Tags: #Troubleshooting #Prompt #Prompt quality #Prompt engineering