Ambiguous Evaluation Criteria Cause Weak Answers

You gave criteria, but the criteria are themselves vague — "engaging", "professional", "innovative" — so the model interprets freely.

You wrote a careful prompt with criteria: “engaging, professional, and innovative”. The output is technically all three. It is also unusable: the “engaging” hook feels like a LinkedIn post, the “professional” tone reads as cold, and the “innovative” angle is the same angle three competitors used last month. The model is not failing — your criteria are not criteria. They are vibes. The model interpreted each adjective against its training-distribution average, and the average is exactly what you do not want.

This page walks through how to detect when your evaluation criteria are unmeasurable, and how to convert each one into a rule the model and a reviewer can both apply consistently.

Common causes

1. Criteria are adjectives, not rules

Words like “engaging”, “professional”, “innovative”, “natural”, “polished” cannot be checked. Two reviewers will disagree. The model has no anchor.

How to spot it: read your criteria aloud. If you cannot describe a 10-second test for each one, it is an adjective, not a rule.

2. No pass / fail examples

If you say “professional” without showing a “yes this is professional” and a “no this is not”, the model defaults to its own definition.

How to spot it: your prompt has zero examples of acceptable output and zero rejected examples.

3. Criteria silently conflict

“Innovative but on-brand”, “engaging but professional”, “comprehensive but concise” — each pair has a tension you have not resolved. The model picks one side, often the wrong one.

How to spot it: pairs of criteria where one direction increases the other decreases.

4. Criteria assume shared taste

You wrote “make it feel like our brand” without defining “our brand”. The model has never read your style guide.

How to spot it: criteria reference things only your team knows.

5. No ranking

“All criteria matter equally” is rarely true. When you do not rank, the model averages, and averages are mediocre.

How to spot it: when the model trades off, it picks the wrong one to sacrifice.

Before you change anything

  • List every criterion in your current prompt.
  • For each, draft a 10-second test a stranger could perform.
  • Find or write one “pass” and one “fail” example for the hardest two.
  • Identify which criteria conflict with which.
  • Decide the rank order before re-prompting.

Information to collect

  • Current prompt with all criteria.
  • An output you accept and an output you reject, both labeled.
  • The reasons each was accepted or rejected (so you can extract rules).
  • Model, temperature, system prompt.
  • Whether reviewers actually agree on each criterion (often they do not).

Shortest path to fix

Step 1: Operationalize each adjective

Convert every taste word into a testable rule:

AdjectiveTestable rule
”Engaging""Opens with a question, a statistic, or a concrete scene. Not ‘In today’s world’."
"Professional""No exclamation marks. No emoji. No first-person plural (‘we’). No contractions."
"Innovative""Mentions at least one specific named technique, tool, or pattern not in the top-5 industry list."
"Concise""Under 200 words. Each sentence under 20 words."
"Natural""Sentence length variance: at least one under 10 words, at least one over 20.”

Step 2: Provide one pass and one fail example

For the trickiest criterion, include:

Example of acceptable "engaging" opening:
  "73% of teams have already abandoned their first AI rollout. Here is what
   the survivors did differently."

Example of unacceptable "engaging" opening:
  "In today's rapidly evolving landscape of artificial intelligence,
   organizations are discovering new opportunities."

The acceptable one uses a specific number and a concrete frame. The
unacceptable one uses generic language and corporate buzz.

Examples beat 100 words of adjectives.

Step 3: Rank the criteria

Declare order of priority and what to sacrifice when they clash:

Priority order (drop from bottom up if you cannot satisfy all):
1. Factually correct (never violate)
2. Under 200 words
3. Operationalized "engaging" rules above
4. Operationalized "professional" rules above
5. Brand voice anchor

If "engaging" and "professional" conflict, prefer "professional".

Step 4: Have the model self-audit

Append:

After writing, output a checklist:
- Criterion 1 (factually correct): yes/no + evidence
- Criterion 2 (under 200 words): yes/no + word count
- Criterion 3 (engaging rules): yes/no + which rule was satisfied
- Criterion 4 (professional rules): yes/no + any violations
If any are no, rewrite and re-check.

Step 5: Lock subjective ones with anchors

For criteria that are inherently subjective (“our brand voice”), give 2-3 sentences of canonical brand copy as a voice anchor. The model imitates anchors more reliably than it follows adjectives.

Step 6: Test with an edge-case input

Feed an input where two criteria conflict obviously. If the output still tries to satisfy both, your priority ranking did not land. Re-rank explicitly.

How to confirm the fix

  • Two reviewers reading the criteria reach the same accept/reject verdict on the same output.
  • The model’s self-audit checklist shows all checks pass.
  • Running the same prompt 3 times produces outputs that all pass the criteria.
  • A “deliberately bad” output (drafted by you) fails the rules.

If it still fails

  1. The criteria may still be unmeasurable — try removing the worst one and see if quality improves.
  2. Add more pass/fail examples for the criteria that are still loose.
  3. Use few-shot: 3-5 examples of accepted outputs anchor the model far better than rules alone.
  4. If criteria genuinely conflict at the input you care about, the spec is impossible — change the input or change a criterion.

Prevention

  • Default rule: every criterion must be testable by a stranger in 10 seconds.
  • Maintain a library of operationalized criteria per task type (blog post, email, summary, code review).
  • For each recurring task, have a “gold standard” output you can attach as a few-shot anchor.
  • Rank criteria explicitly in every prompt. “Equally important” is a tell that they are not actually thought through.
  • Audit accepted outputs monthly: are they passing your criteria or just your gut?

Tags: #Troubleshooting #Prompt #Prompt quality #Prompt engineering