You wrote a prompt with 5 few-shot examples to teach the model your output style. Two of the examples are excellent — sharp, specific, the exact register you want. Three were copied from older drafts and are mediocre — verbose, generic, slightly off-tone. You expect the model to learn from the good ones. It doesn’t. It produces output that averages all 5, leaning toward the mediocre majority. Sometimes worse — it mimics the mediocre examples’ bad habits while ignoring the good examples’ standout traits.
The model doesn’t grade your examples. It treats every example in the prompt as equally authoritative. Mixed-quality few-shot is one of the most common reasons why “I gave it examples and it still doesn’t get it” — the examples weren’t all good.
Common causes
1. Old examples never re-evaluated
You added examples 3 months ago when you were figuring out the task. You’ve since refined what good output looks like. The old examples no longer represent that bar.
How to spot it: Read each example. Would you ship it today as a great output? If no, it’s pulling the model down.
2. Length variance teaches inconsistency
Example 1 is 80 words. Example 2 is 200 words. Example 3 is 50 words. Model infers length is variable and produces variable-length output, even when you want consistent length.
How to spot it: Word-count your examples. If max/min is more than 2x, length is inconsistent.
3. Tone drift across examples
Example 1 is formal. Example 2 is casual. Example 3 has emoji. The model picks one (often whichever is most recent) or blends — neither matches what you want.
How to spot it: Read examples back-to-back. If you mentally code-switch between them, the model does too.
4. One example contains a subtle error
Among 5 examples, one has a typo, a factual mistake, or a formatting glitch. Model learns to reproduce that error category.
How to spot it: Audit each example as you would final output. Errors in the source poison the well.
5. Examples illustrate edge cases, not the common case
You picked tricky examples to “stress test” the prompt. Now the model thinks every input is an edge case and over-handles common inputs.
How to spot it: Are your examples 80% routine inputs or 80% weird inputs? They should reflect the actual distribution.
6. Output structure varies across examples
Example 1 uses bullet points. Example 2 uses numbered list. Example 3 uses prose. Model alternates between formats randomly.
How to spot it: Output structure differs between examples. Pick one.
7. Examples are from a different task type
You’re now using the prompt for a new use case. The examples are from the old use case. Model carries over patterns that don’t apply.
How to spot it: Examples don’t match the current input distribution. They feel “off topic.”
Shortest path to fix
Step 1: Audit and re-grade every example
For each example in your prompt, score it 1-5 on:
- Matches current output bar
- Length consistent with target
- Tone consistent with target
- Free of errors
- Structure matches target format
Drop anything below 4 on all axes.
Step 2: Replace dropped examples with curated ones
Better to have 2 stellar examples than 5 uneven ones. Models learn faster from a small, consistent set than from a large, varied one.
Input: [routine case]
Output: [exemplary output]
Input: [common variation]
Output: [exemplary output]
Input: [tricky case worth covering]
Output: [exemplary output]
3 examples is often enough.
Step 3: Normalize length
If your target output is ~100 words, every example output should be 80-120 words. Don’t include a 30-word example next to a 200-word one.
Step 4: Normalize structure
Pick one output format and use it across all examples. Bullets, numbered list, prose, JSON — whatever fits the task. Mixing teaches inconsistency.
Step 5: Order examples by similarity to expected input
Recency bias is real — the most recent example shapes output most. Put the example whose input shape most resembles the live input LAST.
Examples 1-2: general case
Example 3 (placed last): example whose input is closest to live input
---
Live input: [user's real input]
Step 6: Add a brief commentary explaining what each example demonstrates
Some teams add 1-line commentary above each example:
Example 1 (concise, formal):
Input: ...
Output: ...
Example 2 (handles missing field):
Input: ...
Output: ...
Helps the model interpret what dimension to learn from each.
Step 7: A/B test example sets
Generate 20 outputs with set A (5 mixed examples) and 20 with set B (3 curated examples). Score blindly. The curated set usually wins.
When this is not on you
Some tasks genuinely have high variance — open-ended creative writing, where examples spanning styles is intentional. There, mixed quality is okay if it’s deliberate. The bug is when you didn’t intend variance.
Easy to misdiagnose as
“Model just isn’t good at this task.” Often the model is fine; the examples were noisy. Curate before concluding the model is bad.
Prevention
- Re-audit few-shot examples quarterly; drop or update stale ones.
- Aim for 3 high-quality examples over 5+ mixed ones.
- Normalize length, tone, and structure across examples.
- Add per-example commentary if examples cover different scenarios.
- A/B test new example sets before shipping; measure win rate.
- Treat the example pool as production code — version control, code review.
FAQ
- Should few-shot examples be from real outputs or synthesized? Real-quality is what matters; synthesized examples can be just as effective if they hit the target.
- Does the order of examples matter? Yes — recency bias means the last example influences output most. Put your best/most-relevant example last.
Related
- Missing examples cause output drift
- Too many examples overwhelm the model
- Mixed-tone instructions
- Conflicting instructions weaken output
- AI output style drift
- Style vs format conflict
- Prompt copied from another task
- Latest sentence overrides earlier instructions
- Prompt lacks context hierarchy
- No success criteria specified
Tags: #Prompt engineering #Troubleshooting #llm-output #few-shot #examples #in-context-learning