Ambiguous Evaluation Criteria Cause Weak Answers

Your criteria are vague — "engaging", "professional", "innovative" — so the model interprets freely. Turn each adjective into a testable rule with a 10-second check.

Published: May 20, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You wrote a careful prompt with criteria: “engaging, professional, and innovative”. The output is technically all three. It is also unusable: the “engaging” hook feels like a LinkedIn post, the “professional” tone reads as cold, and the “innovative” angle is the same one three competitors used last month. The model is not failing — your criteria are not criteria. They are vibes. The model interpreted each adjective against its training-distribution average, and the average is exactly what you do not want.

Fastest fix: for each criterion, write a 10-second test a stranger could run ("opens with a number or a concrete scene", not “engaging”), rank the criteria, and add one pass example plus one fail example for the hardest one. If you only have time for one change, swap every adjective for a checkable rule. This page walks through the full process and how to confirm it stuck.

Common causes

1. Criteria are adjectives, not rules

Words like “engaging”, “professional”, “innovative”, “natural”, “polished” cannot be checked. Two reviewers will disagree. The model has no anchor. This is the single most common cause, and it is the one rubric guidance from evaluation tooling keeps flagging: fuzzy terms like “good”, “useful”, or “high quality” produce disagreement unless they are defined operationally, because a model executes procedures, not adjectives.

How to spot it: read your criteria aloud. If you cannot describe a 10-second test for each one, it is an adjective, not a rule.

2. No pass / fail examples

If you say “professional” without showing a “yes, this is professional” and a “no, this is not”, the model defaults to its own definition.

How to spot it: your prompt has zero examples of acceptable output and zero rejected examples.

3. Criteria silently conflict

“Innovative but on-brand”, “engaging but professional”, “comprehensive but concise” — each pair has a tension you have not resolved. The model picks one side, often the wrong one.

How to spot it: pairs of criteria where pushing one direction up pushes the other down.

4. Criteria assume shared taste

You wrote “make it feel like our brand” without defining “our brand”. The model has never read your style guide.

How to spot it: criteria reference things only your team knows.

5. No ranking

“All criteria matter equally” is rarely true. When you do not rank, the model averages, and averages are mediocre.

How to spot it: when the model trades off, it sacrifices the criterion you cared about most.

6. The criterion has no length neutrality

When a criterion like “thorough” or “comprehensive” has no upper bound, the model treats longer as better — and so do most reviewers, including LLM judges, which carry a documented verbosity bias. You ask for quality and get word count.

How to spot it: your accepted outputs keep getting longer over time without getting better.

Which bucket are you in

Symptom	Most likely cause	Go to
Output is “correct” but flat and generic	Adjectives, not rules (cause 1)	Step 1
Two people on your team disagree on whether it passes	No pass/fail examples (cause 2)	Step 2
Model nails one criterion and ignores another	Silent conflict / no ranking (causes 3, 5)	Step 3
”Make it sound like us” never lands	Shared-taste assumption (cause 4)	Step 5
Outputs keep getting longer, not better	No length neutrality (cause 6)	Step 1 (add a cap)

Before you change anything

List every criterion in your current prompt.
For each, draft a 10-second test a stranger could perform.
Find or write one “pass” and one “fail” example for the hardest two.
Identify which criteria conflict with which.
Decide the rank order before re-prompting.

Information to collect

Current prompt with all criteria.
An output you accept and an output you reject, both labeled.
The reasons each was accepted or rejected (so you can extract rules).
Model, temperature, system prompt.
Whether reviewers actually agree on each criterion (often they do not).

Shortest path to fix

Step 1: Operationalize each adjective

Convert every taste word into a testable rule. A strong rule is specific, observable, and bounded (so “thorough” cannot quietly mean “longer”):

Adjective	Testable rule
”Engaging"	"Opens with a question, a statistic, or a concrete scene. Not ‘In today’s world’."
"Professional"	"No exclamation marks. No emoji. No first-person plural (‘we’). No contractions."
"Innovative"	"Mentions at least one specific named technique, tool, or pattern not in the top-5 industry list."
"Concise"	"Under 200 words. Each sentence under 20 words."
"Natural"	"Sentence length variance: at least one under 10 words, at least one over 20."
"Thorough"	"Covers all 4 listed sub-points. Equal quality at fewer words scores the same as more words.”

Step 2: Provide one pass and one fail example

For the trickiest criterion, include:

Example of acceptable "engaging" opening:
  "73% of teams have already abandoned their first AI rollout. Here is what
   the survivors did differently."

Example of unacceptable "engaging" opening:
  "In today's rapidly evolving landscape of artificial intelligence,
   organizations are discovering new opportunities."

The acceptable one uses a specific number and a concrete frame. The
unacceptable one uses generic language and corporate buzz.

Examples beat 100 words of adjectives.

Step 3: Rank the criteria

Declare order of priority and what to sacrifice when they clash:

Priority order (drop from bottom up if you cannot satisfy all):
1. Factually correct (never violate)
2. Under 200 words
3. Operationalized "engaging" rules above
4. Operationalized "professional" rules above
5. Brand voice anchor

If "engaging" and "professional" conflict, prefer "professional".

Step 4: Have the model self-audit (reason, then verdict)

Append a checklist, and force the model to state its reasoning before the yes/no so the verdict follows from the check rather than the other way around:

After writing, output a checklist. For each criterion, give the evidence
first, then the verdict:
- Criterion 1 (factually correct): evidence -> yes/no
- Criterion 2 (under 200 words): word count -> yes/no
- Criterion 3 (engaging rules): which rule was satisfied -> yes/no
- Criterion 4 (professional rules): any violations -> yes/no
If any are no, rewrite and re-check.

Use plain pass/fail per criterion rather than a 1-to-5 score. Asking for “3 out of 5” makes the model do two jobs at once — decide if it is good enough and pick a spot on an arbitrary scale — and the scale-placement step adds noise. Reserve a graded scale only for criteria with genuine gradations (coherence, empathy, pedagogical quality); use pass/fail for hard gates like factual correctness, length caps, and policy rules.

Step 5: Lock subjective ones with anchors

For criteria that are inherently subjective (“our brand voice”), give 2-3 sentences of canonical brand copy as a voice anchor. The model imitates anchors more reliably than it follows adjectives.

Step 6: Test with an edge-case input

Feed an input where two criteria conflict obviously. If the output still tries to satisfy both, your priority ranking did not land. Re-rank explicitly.

How to confirm the fix

Two reviewers reading the criteria reach the same accept/reject verdict on the same output. (In LLM-eval practice, judge-versus-human agreement of roughly 75-90% is the bar that signals a rubric is reliable enough to trust; aim for the same with two humans.)
The model’s self-audit checklist shows all checks pass, with evidence cited for each.
Running the same prompt 3 times produces outputs that all pass the criteria.
A “deliberately bad” output (drafted by you) fails the rules — if it sneaks through, a rule is still too loose.

If it still fails

The criteria may still be unmeasurable — try removing the worst one and see if quality improves.
Add more pass/fail examples for the criteria that are still loose.
Use few-shot: 3-5 examples of accepted outputs anchor the model far better than rules alone.
If criteria genuinely conflict at the input you care about, the spec is impossible — change the input or change a criterion.

FAQ

How many criteria is too many? If a single prompt carries more than five or six, the model averages across them and none lands sharply. Rank them and let the bottom ones be “nice to have”. A short list of testable rules beats a long list of adjectives every time.

Should I score each criterion 1-5 or just pass/fail? Pass/fail for anything with a clear threshold (under 200 words, no emoji, factually correct). A 1-5 score only earns its keep when the quality genuinely has degrees — coherence, empathy, pedagogical clarity. Mixing arbitrary scale placement into a hard gate just adds variance.

Why does the model keep writing longer to seem “better”? Because “thorough” or “comprehensive” with no cap reads as “more is better”, and both human and LLM reviewers carry the same verbosity bias. Add an explicit length-neutrality line: equal correctness at fewer words scores the same as, or better than, more words.

My two reviewers still disagree after I added rules. Now what? The disagreement points at the exact rule that is still subjective. Pull that one criterion out, write one pass example and one fail example for it specifically, and have both reviewers re-grade. The example resolves what the rule could not.

Does this work the same across ChatGPT, Claude, and Gemini? Yes — the failure is in the spec, not the model. Operationalized criteria, ranking, and pass/fail examples improve output on GPT-5.5, Claude Opus 4.7 / Sonnet 4.6, and Gemini 3.1 Pro alike, because every model is interpreting the same vague adjective against its own average.

Prevention

Default rule: every criterion must be testable by a stranger in 10 seconds.
Maintain a library of operationalized criteria per task type (blog post, email, summary, code review).
For each recurring task, keep a “gold standard” output you can attach as a few-shot anchor.
Rank criteria explicitly in every prompt. “Equally important” is a tell that they were not actually thought through.
Audit accepted outputs monthly: are they passing your criteria or just your gut?

For deeper background on writing measurable rubrics that humans and AI judges both apply consistently, see Evidently AI’s guide to LLM-as-a-judge.

Tags: #Troubleshooting #Prompt #Prompt quality #Prompt engineering