Should few-shot examples be real outputs or synthesized?

Real-world quality is what matters, not provenance. Synthesized examples can be just as effective if they hit the target bar, and many teams ask the model itself to draft candidate examples and then hand-edit them.

Does the order of examples really matter?

Yes. Recency bias means the last example influences output most, and research on in-context example ordering shows accuracy can move by as much as ~30% from order and selection alone. Put your best, most-relevant example last.

How many examples is too many?

Past roughly 3-5, you usually get diminishing returns: token cost keeps rising but accuracy flattens, and more examples increase the odds of introducing a weak or conflicting one. Add examples only when a real failure case demands one.

Should I add a counter-example to fix a bad habit?

Usually no. Removing the example that taught the habit is more reliable than adding a "don't do this" demonstration, which can confuse the model about which pattern to imitate.

My examples look good but output is still off; now what?

A/B test (Step 7). If a curated, consistent set still underperforms, the issue is likely in the instructions, the success criteria, or a conflict between style and format rather than the examples.

Troubleshooting

Few-Shot Examples Have Uneven Quality and Drag Output Down

You gave the model 5 examples; 2 are great, 3 are mediocre, and it averages toward the mediocre ones. Why example variance hurts and how to curate down to 3-5 consistent ones.

Published: May 24, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You wrote a prompt with 5 few-shot examples to teach the model your output style. Two are excellent: sharp, specific, the exact register you want. Three were copied from older drafts and are mediocre: verbose, generic, slightly off-tone. You expect the model to learn from the good ones. It doesn’t. It produces output that averages all 5, leaning toward the mediocre majority. Sometimes worse: it mimics the mediocre examples’ bad habits while ignoring the good examples’ standout traits.

Fastest fix: delete the mediocre examples instead of trying to outweigh them with instructions. Cut your set to the 2-3 examples you would happily ship today, make their length, tone, and structure consistent, and put the example whose input most resembles the live request last. A model learns faster from 3 consistent examples than from 5 uneven ones, and both Anthropic and OpenAI guidance converges on roughly 3-5 well-chosen examples as the sweet spot (as of June 2026).

The reason this happens: the model doesn’t grade your examples. In-context learning treats every demonstration in the prompt as equally authoritative, then leans on the statistical pattern across all of them. Mixed-quality few-shot is one of the most common reasons behind “I gave it examples and it still doesn’t get it.” The examples weren’t all good, and two known biases amplify the damage: recency bias (the example nearest the live input pulls hardest) and majority-label bias (whatever pattern shows up most often wins). Research on in-context example ordering finds accuracy can swing by as much as ~30% purely from which examples you include and the order you place them in.

Common causes

1. Old examples never re-evaluated

You added examples months ago while figuring out the task. You’ve since refined what good output looks like. The old examples no longer represent that bar.

How to spot it: Read each example. Would you ship it today as a great output? If no, it’s pulling the model down.

2. Length variance teaches inconsistency

Example 1 is 80 words. Example 2 is 200 words. Example 3 is 50 words. The model infers length is variable and produces variable-length output, even when you want consistent length.

How to spot it: Word-count your example outputs. If the longest is more than ~2x the shortest, length is inconsistent.

3. Tone drift across examples

Example 1 is formal. Example 2 is casual. Example 3 has emoji. The model picks one (often whichever is last) or blends them, and neither matches what you want.

How to spot it: Read examples back-to-back. If you mentally code-switch between them, the model does too.

4. One example contains a subtle error

Among 5 examples, one has a typo, a factual mistake, or a formatting glitch. The model learns to reproduce that error category.

How to spot it: Audit each example as you would a final output. Errors in the source poison the well.

5. Examples illustrate edge cases, not the common case

You picked tricky examples to stress-test the prompt. Now the model thinks every input is an edge case and over-handles routine inputs. This is majority-label bias working against you: if 4 of 5 examples are weird, “weird handling” becomes the default.

How to spot it: Are your examples roughly 80% routine inputs or 80% odd inputs? They should mirror the actual input distribution.

6. Output structure varies across examples

Example 1 uses bullet points. Example 2 uses a numbered list. Example 3 uses prose. The model alternates between formats at random.

How to spot it: Output structure differs between examples. Pick one.

7. Examples are from a different task type

You’re now using the prompt for a new use case, but the examples are from the old one. The model carries over patterns that don’t apply.

How to spot it: Examples don’t match the current input distribution. They feel off-topic.

Which bucket are you in

Symptom you observe	Most likely cause	Go to
Output is fine but generic, never as sharp as your best example	Mediocre examples averaging down the good ones	Step 1-2
Output length is all over the place	Length variance across examples	Step 3
Tone flips between formal and casual run to run	Tone drift across examples	Step 1, Step 6
Output format alternates (bullets one time, prose the next)	Structure varies across examples	Step 4
Model over-explains or over-handles simple inputs	Examples skew to edge cases	Cause 5, Step 1
One specific mistake keeps reappearing	A poisoned example	Cause 4, Step 1
The closest-matching example doesn’t seem to “win”	Recency/order, best example not placed last	Step 5

Shortest path to fix

Step 1: Audit and re-grade every example

For each example in your prompt, score it 1-5 on five axes:

Matches the current output bar
Length consistent with the target
Tone consistent with the target
Free of errors
Structure matches the target format

Drop anything that scores below 4 on any axis. Don’t try to “balance out” a weak example by adding a counter-example; that just adds noise.

Step 2: Replace dropped examples with curated ones

Two stellar examples beat five uneven ones. Models learn faster from a small, consistent set than from a large, varied one, and example count hits diminishing returns fast: the big gains come from the first two or three demonstrations, after which extra examples mostly add token cost without proportional accuracy. Aim for 3-5 high-quality examples.

Input: [routine case]
Output: [exemplary output]

Input: [common variation]
Output: [exemplary output]

Input: [tricky case worth covering]
Output: [exemplary output]

Three examples is often enough.

Step 3: Normalize length

If your target output is ~100 words, every example output should be roughly 80-120 words. Don’t include a 30-word example next to a 200-word one.

Step 4: Normalize structure

Pick one output format and use it across all examples: bullets, numbered list, prose, or JSON, whatever fits the task. Mixing formats teaches inconsistency.

Step 5: Order examples by similarity to the expected input

Recency bias is real and documented: the example placed last (nearest the live input) shapes the output most. Put the example whose input shape most resembles the live input last, and avoid ending on an outlier.

Examples 1-2: general case
Example 3 (placed last): input closest to the live input
---
Live input: [user's real input]

If your examples carry labels or categories (e.g. positive/negative, accept/reject), keep them roughly balanced rather than clustering one label at the end, which avoids both majority-label and recency bias.

Step 6: Wrap and label each example

On the API, give each example clear boundaries so the model reads them as demonstrations, not instructions. Anthropic’s prompt-engineering guidance recommends wrapping each example in <example> tags (and the whole block in <examples> tags); a 1-line note above each example tells the model what dimension to learn from it.

<examples>
<example>
Note: concise, formal
Input: ...
Output: ...
</example>
<example>
Note: handles a missing field
Input: ...
Output: ...
</example>
</examples>

Step 7: A/B test example sets

Generate 20 outputs with set A (5 mixed examples) and 20 with set B (3 curated examples). Score them blind against your rubric. The curated set usually wins. This is also the cleanest way to settle whether a borderline example earns its slot.

How to confirm it’s fixed

Run 10-20 fresh inputs through the curated prompt and check that output length, tone, and structure stay within your target range across the whole batch, not just on cherry-picked cases.
Re-run the 2-3 inputs that previously produced the off-tone or wrong-length output. They should now land on target.
If you A/B tested in Step 7, the curated set’s win rate against your rubric should be clearly higher; if it isn’t, the problem may be the prompt instructions rather than the examples.

When this is not on you

Some tasks genuinely have high variance, like open-ended creative writing where examples spanning styles is intentional. There, mixed quality is fine if it’s deliberate. The bug is when you didn’t intend the variance.

Easy to misdiagnose as

“The model just isn’t good at this task.” Often the model is fine and the examples were noisy. Curate before concluding the model is bad. The 30%-swing-from-ordering-alone finding is a good reminder that the prompt, not the model, is usually the variable you control.

Prevention

Re-audit few-shot examples on a schedule (quarterly is a reasonable default); drop or update stale ones.
Aim for 3-5 high-quality examples over a larger mixed set.
Normalize length, tone, and structure across examples.
Add a per-example note when examples cover different scenarios.
Keep labels/categories roughly balanced and avoid ending on an outlier.
A/B test new example sets before shipping; measure win rate.
Treat the example pool as production code: version control and review changes to it.

FAQ

Should few-shot examples be real outputs or synthesized? Real-world quality is what matters, not provenance. Synthesized examples can be just as effective if they hit the target bar, and many teams ask the model itself to draft candidate examples and then hand-edit them.
Does the order of examples really matter? Yes. Recency bias means the last example influences output most, and research on in-context example ordering shows accuracy can move by as much as ~30% from order and selection alone. Put your best, most-relevant example last.
How many examples is too many? Past roughly 3-5, you usually get diminishing returns: token cost keeps rising but accuracy flattens, and more examples increase the odds of introducing a weak or conflicting one. Add examples only when a real failure case demands one.
Should I add a counter-example to fix a bad habit? Usually no. Removing the example that taught the habit is more reliable than adding a “don’t do this” demonstration, which can confuse the model about which pattern to imitate.
My examples look good but output is still off; now what? A/B test (Step 7). If a curated, consistent set still underperforms, the issue is likely in the instructions, the success criteria, or a conflict between style and format rather than the examples.

Tags: #Prompt engineering #Troubleshooting #llm-output #few-shot #examples #in-context-learning