Few-Shot Examples Have Uneven Quality and Drag Output Down

You provided 5 few-shot examples. Two are great, three are mediocre. The model averages toward the mediocre ones. Why example quality variance hurts and how to curate.

You wrote a prompt with 5 few-shot examples to teach the model your output style. Two of the examples are excellent — sharp, specific, the exact register you want. Three were copied from older drafts and are mediocre — verbose, generic, slightly off-tone. You expect the model to learn from the good ones. It doesn’t. It produces output that averages all 5, leaning toward the mediocre majority. Sometimes worse — it mimics the mediocre examples’ bad habits while ignoring the good examples’ standout traits.

The model doesn’t grade your examples. It treats every example in the prompt as equally authoritative. Mixed-quality few-shot is one of the most common reasons why “I gave it examples and it still doesn’t get it” — the examples weren’t all good.

Common causes

1. Old examples never re-evaluated

You added examples 3 months ago when you were figuring out the task. You’ve since refined what good output looks like. The old examples no longer represent that bar.

How to spot it: Read each example. Would you ship it today as a great output? If no, it’s pulling the model down.

2. Length variance teaches inconsistency

Example 1 is 80 words. Example 2 is 200 words. Example 3 is 50 words. Model infers length is variable and produces variable-length output, even when you want consistent length.

How to spot it: Word-count your examples. If max/min is more than 2x, length is inconsistent.

3. Tone drift across examples

Example 1 is formal. Example 2 is casual. Example 3 has emoji. The model picks one (often whichever is most recent) or blends — neither matches what you want.

How to spot it: Read examples back-to-back. If you mentally code-switch between them, the model does too.

4. One example contains a subtle error

Among 5 examples, one has a typo, a factual mistake, or a formatting glitch. Model learns to reproduce that error category.

How to spot it: Audit each example as you would final output. Errors in the source poison the well.

5. Examples illustrate edge cases, not the common case

You picked tricky examples to “stress test” the prompt. Now the model thinks every input is an edge case and over-handles common inputs.

How to spot it: Are your examples 80% routine inputs or 80% weird inputs? They should reflect the actual distribution.

6. Output structure varies across examples

Example 1 uses bullet points. Example 2 uses numbered list. Example 3 uses prose. Model alternates between formats randomly.

How to spot it: Output structure differs between examples. Pick one.

7. Examples are from a different task type

You’re now using the prompt for a new use case. The examples are from the old use case. Model carries over patterns that don’t apply.

How to spot it: Examples don’t match the current input distribution. They feel “off topic.”

Shortest path to fix

Step 1: Audit and re-grade every example

For each example in your prompt, score it 1-5 on:

  • Matches current output bar
  • Length consistent with target
  • Tone consistent with target
  • Free of errors
  • Structure matches target format

Drop anything below 4 on all axes.

Step 2: Replace dropped examples with curated ones

Better to have 2 stellar examples than 5 uneven ones. Models learn faster from a small, consistent set than from a large, varied one.

Input: [routine case]
Output: [exemplary output]

Input: [common variation]
Output: [exemplary output]

Input: [tricky case worth covering]
Output: [exemplary output]

3 examples is often enough.

Step 3: Normalize length

If your target output is ~100 words, every example output should be 80-120 words. Don’t include a 30-word example next to a 200-word one.

Step 4: Normalize structure

Pick one output format and use it across all examples. Bullets, numbered list, prose, JSON — whatever fits the task. Mixing teaches inconsistency.

Step 5: Order examples by similarity to expected input

Recency bias is real — the most recent example shapes output most. Put the example whose input shape most resembles the live input LAST.

Examples 1-2: general case
Example 3 (placed last): example whose input is closest to live input
---
Live input: [user's real input]

Step 6: Add a brief commentary explaining what each example demonstrates

Some teams add 1-line commentary above each example:

Example 1 (concise, formal):
Input: ...
Output: ...

Example 2 (handles missing field):
Input: ...
Output: ...

Helps the model interpret what dimension to learn from each.

Step 7: A/B test example sets

Generate 20 outputs with set A (5 mixed examples) and 20 with set B (3 curated examples). Score blindly. The curated set usually wins.

When this is not on you

Some tasks genuinely have high variance — open-ended creative writing, where examples spanning styles is intentional. There, mixed quality is okay if it’s deliberate. The bug is when you didn’t intend variance.

Easy to misdiagnose as

“Model just isn’t good at this task.” Often the model is fine; the examples were noisy. Curate before concluding the model is bad.

Prevention

  • Re-audit few-shot examples quarterly; drop or update stale ones.
  • Aim for 3 high-quality examples over 5+ mixed ones.
  • Normalize length, tone, and structure across examples.
  • Add per-example commentary if examples cover different scenarios.
  • A/B test new example sets before shipping; measure win rate.
  • Treat the example pool as production code — version control, code review.

FAQ

  • Should few-shot examples be from real outputs or synthesized? Real-quality is what matters; synthesized examples can be just as effective if they hit the target.
  • Does the order of examples matter? Yes — recency bias means the last example influences output most. Put your best/most-relevant example last.

Tags: #Prompt engineering #Troubleshooting #llm-output #few-shot #examples #in-context-learning