You loaded your prompt with 8 examples to “cover every case”. The output now copies whichever example most resembles your input, even when the task needs reasoning. Or worse: the output is a mosaic of all 8 examples, blending phrases that should not coexist. More examples sound like they should produce more accurate outputs. Past a certain point, they do the opposite: examples crowd out the instruction, drift apart in style, introduce edge cases that mislead the average case, and consume tokens that would be better spent on the actual task. There is a sweet spot, and it is usually 1-3 examples.
This page walks through why more examples is not always better and how to choose 1-3 representative ones that actually pin the output.
Common causes
1. Stacking to “cover every case”
You added an example for edge case A, then one for edge case B, then “just in case” examples C through H. The collection now spans a wider style range than your real inputs.
How to spot it: 5+ examples, each handling a different edge.
2. Examples drift in style or structure
Example 1 is concise. Example 4 is verbose. Example 7 has bullets. The model averages and the output is inconsistent.
How to spot it: your examples differ in length, register, or structure.
3. Examples include edge cases that mislead
You added an edge-case example. The model now treats that edge as the typical case and applies its handling to everyday input.
How to spot it: outputs handle edge cases well but mishandle the common case.
4. Examples occupy more tokens than the task
In a 2000-word prompt with 1600 words of examples, the task is 20% of the prompt. The model anchors on the dominant content — examples — and treats the task as a footnote.
How to spot it: token count of examples > task + constraints + output spec combined.
5. No clear instruction after the examples
You ended on the last example with no “now do X for the input below”. The model assumes you want example 9 in the same vein and produces another example, not the task answer.
How to spot it: prompt ends on an example, not on the task verb.
Before you change anything
- Count your examples. If over 4, you are likely past the sweet spot.
- Identify the typical case vs the edge cases among your examples.
- Decide which 1-3 examples are truly representative.
- Confirm your examples agree on style and structure.
- Plan to put the task instruction immediately after the examples.
Information to collect
- Current prompt with all examples.
- Output that drifted toward an example or blended several.
- Your typical input vs your edge cases.
- Token count of examples vs task.
- Model and any system prompt.
Shortest path to fix
Step 1: Cut to 1-3 representative examples
Pick examples that:
- Cover the typical case (not edge cases)
- Agree on style and structure
- Demonstrate the output shape clearly
For each remaining example, justify its inclusion in one sentence. If you cannot, cut it.
Step 2: Order examples by representativeness
Put the most typical example first. The first example anchors hardest. The last example is recency-anchored.
Example 1 (most typical): ...
Example 2 (variant within typical range): ...
Example 3 (boundary of typical): ...
Step 3: Label examples explicitly
EXAMPLE 1:
Input: <input 1>
Output: <output 1>
EXAMPLE 2:
Input: <input 2>
Output: <output 2>
NOW DO THIS FOR THE INPUT BELOW:
Input: <real input>
Output:
The explicit “NOW DO THIS” labels prevent the model from continuing the example sequence.
Step 4: Put the task instruction immediately after examples
The task verb should be the last thing the model reads before generating. Recency works in your favor.
Step 5: Handle edge cases separately
If you really need edge-case coverage, do not stuff them into the main few-shot. Use a router:
Main prompt: handle typical case (with 1-3 typical examples).
If input matches <edge condition>, route to a separate sub-prompt
with edge-case examples and edge-case rules.
Edge handling stays separate from the common path.
Step 6: Audit example budget
If your examples are over 1000 tokens, prune them. Each example should be the minimum needed to demonstrate the pattern, not a full real-world specimen.
How to confirm the fix
- Number of examples is between 1 and 3.
- All examples agree on style, structure, and tone.
- The most typical example is first.
- “NOW DO THIS” or equivalent label appears immediately before the real input.
- Output handles the typical case well and you treat edge cases via a separate path.
- Running the same prompt 3 times produces consistent outputs.
If it still fails
- Your typical case may not be well-defined — write 1 ideal example and use only that.
- Examples may still drift — audit them line by line for consistency.
- The task may not benefit from examples at all — try removing them entirely.
- For very diverse inputs, switch to retrieval (RAG) that pulls relevant examples per input.
Prevention
- Default: 1-3 examples max for most tasks.
- Audit example libraries quarterly; cut examples that drift from the standard.
- Keep example budget under 30% of total prompt tokens.
- Place task instruction immediately after examples, never before.
- For diverse inputs, prefer dynamic example selection (retrieval) over static many-shot.
- When in doubt about example count, A/B test: 1, 3, and 5 examples; pick the lowest count that hits quality.
Related reading
- Missing examples output drift
- Long prompt degrades output
- Prompt lacks source hierarchy
- Prompt lacks context hierarchy
- Role instruction alone not enough
Tags: #Troubleshooting #Prompt #Prompt quality #Prompt engineering