You asked the model to “write a good summary of the meeting.” Two minutes later it returned 600 words. Is it good? You read it. It is fine. You re-prompt “make it better”. The next version is also fine, in slightly different ways. You spend 20 minutes evaluating versions, none of which feel clearly right. The problem is not the model. The problem is “good”: you never said what good looks like, so the model defaulted to “confident-sounding” and you defaulted to “I’ll know it when I see it”. Both definitions are wrong by construction. Without a success criterion in the prompt, the model is hill-climbing on confidence, not on usefulness.
This page walks through why prompts without success criteria stay stuck in revision purgatory, and how to write a 4-line success block that makes “done” mechanical.
Common causes
1. Prompt focuses on the task, not the bar
You said what to do (“write a summary”) but not what makes it correct (length, must-include items, banned content). The model produces plausibly-shaped output and stops.
How to spot it: search your prompt for “should be”, “passes”, “must contain”. If none of these appear, you have no bar.
2. Quality is assumed from context
“You know what I want” is a real internal sentence. The model does not know. It has none of the context your team has accumulated.
How to spot it: your prompt assumes shared taste with the model.
3. Multiple stakeholders, multiple definitions
Two reviewers disagree on what “done” looks like. The model averages and pleases neither.
How to spot it: reviewers reject the output for different reasons.
4. Subjective adjectives as stand-ins
“Good”, “clear”, “professional”, “useful” — all unmeasurable. The model interprets each against its training-distribution average.
How to spot it: criteria are adjectives, not numbers / rules / checklists.
5. Success looks different per input
“Good summary” of a 5-line message is different from “good summary” of a 50-page report. Criteria need to scale or be expressed as ratios.
How to spot it: same prompt works on small inputs, fails on large ones.
Before you change anything
- Write down what a perfect output would look like in 5 bullets.
- Identify which of those 5 are measurable and which are vibe.
- Find a past “perfect” output and reverse-engineer the criteria.
- Decide who is the audience and what they will do with the output.
- Plan for criteria that scale with input size if relevant.
Information to collect
- The current prompt.
- 2-3 outputs you accepted and 2-3 you rejected.
- The reason each was accepted / rejected (the implicit criteria).
- The downstream consumer of the output (human reader, parser, database).
- Model and any system prompt.
Shortest path to fix
Step 1: Add a measurable Success Criteria block
End the prompt with:
## Success criteria
- Length: 80-120 words
- Must include: 1 decision made, 1 owner, 1 deadline
- Banned: "circling back", "going forward", "let me know"
- Format: 3 numbered points + 1 followup question
- Tone: 2nd person, present tense, no hedging
Measurable, enforceable, fast to check.
Step 2: Convert each adjective to a check
| Adjective | Check |
|---|---|
| ”Good summary" | "Captures 3 key decisions. Each decision in ≤25 words." |
| "Clear writing" | "Each sentence ≤20 words. No nested clauses." |
| "Useful analysis" | "At least 1 actionable next step with owner and date." |
| "Professional tone" | "No exclamation marks. No emoji. No first-person." |
| "Thorough review" | "Cites 3+ specific lines/files. Flags both pros and cons.” |
Step 3: Have the model self-check
Append:
After writing, output a checklist:
- Length: [actual word count] / 80-120 → pass/fail
- Required items present: [list] → pass/fail
- Banned phrases used: [list, or "none"] → pass/fail
- If any fail, rewrite and re-check.
This catches issues without a human reviewer in the loop.
Step 4: Provide a “passes” and “fails” example
Passes the criteria:
"1. Decision: ship v2 on Friday. Owner: Alex. Deadline: 2026-05-26.
2. Decision: hold the launch tweet. Owner: Sam. Deadline: TBD.
3. Decision: roll back if error rate > 2%. Owner: on-call. Deadline: continuous.
Follow-up: who owns the rollback playbook?"
Fails the criteria (too vague):
"The team aligned on shipping the new feature soon and will keep an eye
on metrics. Sam will coordinate communications. Let me know if questions."
The contrast is far stronger than a description.
Step 5: Scale criteria with input size if relevant
For tasks where input varies:
Success criteria (scaled):
- Length: min(input_word_count / 10, 200) words
- Must capture: at least 1 decision per 100 words of input
- ...
Step 6: Move stable criteria to project / system prompt
If you keep writing the same criteria, lift them into a project instruction or system prompt. Saves prompt space and stays consistent across turns.
How to confirm the fix
- The model’s self-check passes on every output.
- Two reviewers looking at the same output reach the same accept/reject verdict.
- Running the same prompt 3 times produces 3 outputs that all pass criteria.
- You spend less than 60 seconds deciding whether an output is “done”.
- “Make it better” is no longer the dominant follow-up — specific fixes are.
If it still fails
- Criteria may still be too vague — write a “passes” example yourself; if your example is borderline, the criteria are loose.
- Add 1-2 more pass/fail examples; few-shot beats rules.
- The task may genuinely have no single success — split it into sub-tasks with their own criteria.
- If reviewers disagree on outputs that pass criteria, the criteria do not capture what you actually want — revise.
Prevention
- Default: every prompt ends with a measurable Success Criteria block.
- Build personal templates per task type with reusable criteria.
- For team work, agree on success criteria before delegating to AI.
- Audit accepted outputs monthly: are they passing your criteria or just your gut?
- Treat “make it better” as a smell — if you say this, the criteria are missing or wrong.
- When in doubt, ask the model to propose 3 candidate criteria; pick one.
Related reading
- Ambiguous evaluation criteria
- Output sounds polished but is not actionable
- Prompt asks for “best” without defining it
- No output format specified
- Conflicting instructions weaken output
Tags: #Troubleshooting #Prompt #Prompt quality #Prompt engineering