Prompt Lacks Context Hierarchy: Why the Model Answers the Wrong Part

You pasted everything as a flat block, so the model can't tell critical lines from background. Add labels, tag your sources, and put the task where attention is highest.

Published: May 20, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You pasted 1,200 words: a meeting transcript, three Slack messages, two paragraphs of background, four bullets of requirements, and a question at the end. Everything sits at the same visual weight, so the model has no map of what matters. With nothing to anchor on, it falls back on length — and the longest block was the transcript, so it summarized the transcript and ignored your requirements.

This is a hierarchy problem, not a capability problem. The model does not know which lines are load-bearing because you never marked them. It is also a known attention problem: language models attend most to the start and end of a prompt and least to the middle (the “lost in the middle” effect), so anything important buried mid-prompt gets the least weight.

Fastest fix: put the task and your hard requirements in labeled sections, move the long pasted material into tagged blocks, and place the actual question at the very bottom of the prompt. For long inputs (roughly 20k+ tokens / 30+ pages), Anthropic’s own guidance is to put the documents first and the question last — queries at the end improve response quality by up to 30% in their tests (as of June 2026).

What “lost in the middle” actually means

Research on long-context retrieval found a U-shaped attention curve: accuracy on multi-document question answering drops by 30 percentage points or more when the relevant document moves from position 1 to position 10 in a 20-document context. The model is not broken; it weights the edges of the window more than the center. So hierarchy is not cosmetic. Where a fact sits in the prompt changes how much the model “sees” it.

A famous Anthropic example: on a long-context needle-in-a-haystack test, adding a single line — Here is the most relevant sentence in the context: — before the answer span pushed Claude 2.1’s accuracy from 27% to 98%. The fix was not a better model; it was telling the model where to look.

Which bucket are you in

Symptom	Likely cause	Go to
Model summarizes the longest block instead of answering	No section labels; defaults to length	Step 1
Output blends code, transcript, and spec into mush	Mixed input types with no separators	Step 2
A “must” requirement silently dropped	Hard rule written as soft prose	Step 3
Right facts, wrong emphasis	Task buried in the middle of the prompt	Step 4
Model cites the wrong source as authoritative	Pasted material has no provenance	Step 2
Long reference ignored entirely	Section too long, no summary anchor	Step 5

Common causes

1. No section labels

A wall of prose forces the model to infer where context ends and the task begins. That inference picks the dominant theme, usually the background.

How to spot it: no ## headings, no Task: or Background: markers, no list of requirements.

2. Multiple input types pasted together

Code, a transcript, requirements, and screenshots-pasted-as-text in one block. The model averages across them and returns a mongrel response.

How to spot it: two or more input types with no separator (no tags, no fences, no headings) between them.

3. Hard requirements buried in soft prose

It would be nice to have X and We absolutely need Y appear at the same visual weight. The model reads “absolutely” but the soft modal context around it dilutes the signal.

How to spot it: non-negotiable rules written inside paragraphs instead of a labeled, capitalized list.

4. Reference material has no provenance

You pasted text from three sources without labeling which is which. The model treats them as equally authoritative even when one is a rough draft and one is the signed-off spec.

How to spot it: pasted content with no attribution header (<source>, From: …, or a heading naming the document).

5. Sections out of priority order

For a short prompt, the task ends up in the middle, where attention is lowest. For a long prompt, the question is at the very top, before the documents the model needs to read to answer it.

How to spot it: structural inversion — background dominates the high-attention positions and the task does not.

Before you change anything

List each input type in your current prompt (transcript, spec, code, email, etc.).
Mark which lines are hard requirements versus background context.
Decide a priority order: what should the model attend to most?
Note whether each pasted source has clear provenance.
Check the rough token size. Under ~2,000 words behaves differently from a 30-page paste (see Step 4).

Shortest path to fix

Step 1: Label every section

## Task
<one imperative sentence>

## Hard requirements (non-negotiable)
- Requirement 1
- Requirement 2

## Soft preferences (drop if in conflict)
- Preference 1

## Background context
<reference material>

## Output format
<schema>

Visible structure beats inferred structure. Anthropic’s prompting guide frames this with a golden rule: show your prompt to a colleague with no context and ask them to follow it. If they would be confused about what to do, so will the model.

Step 2: Tag mixed inputs with their source

Wrap each pasted block in a tag that names what it is and where it came from. This is the format Anthropic recommends for multi-document inputs:

<documents>
  <document index="1">
    <source>standup-2026-05-21</source>
    <document_content>
      ... transcript text ...
    </document_content>
  </document>
  <document index="2">
    <source>product-spec-v3</source>
    <document_content>
      ... requirements ...
    </document_content>
  </document>
</documents>

A lighter inline form works too when you have only a few blocks:

<transcript source="standup-2026-05-21">
... transcript text ...
</transcript>

<requirements source="product-spec-v3">
... requirements ...
</requirements>

XML-style tags parse cleanly because the model can tell instructions from data by the tag, not by guessing. Markdown fences also separate blocks, but tags additionally carry provenance, which fixes cause #4.

Step 3: Mark hard rules with ALL-CAPS keywords

## Hard requirements

MUST: every output includes the customer's order number.
MUST NOT: reveal internal employee names.
MUST: return valid JSON.

MUST / MUST NOT read as stronger than “should” because they map to RFC 2119, the IETF convention models have absorbed from years of specs and documentation. One detail that matters: in RFC 2119 the normative weight applies only when the keyword is in ALL CAPITALS. Lowercase must reads as ordinary prose, so capitalize the keywords you actually mean as non-negotiable.

Step 4: Put the task where attention is highest

Position depends on how long the prompt is.

Short prompts (under ~2,000 words): the start and end of the prompt get the most attention, so use both. Lead with the task and hard requirements; restate the non-negotiables at the very bottom.

[TOP] Task + non-negotiables
[MID] Background, transcripts, references
[BOT] Output format + restate the non-negotiables

Long prompts (roughly 20k+ tokens / 30+ pages of pasted material): invert it. Anthropic’s long-context guidance is to place the long documents at the top, above the query, and put the actual question and instructions at the very bottom. In their tests this ordering improves response quality by up to 30% on complex, multi-document inputs (as of June 2026).

[TOP] <documents> ... long pasted material ... </documents>
[BOT] Task + instructions + output format

The reason both rules work is the same U-shaped curve: keep the thing you most need the model to act on out of the dead middle.

Step 5: Summarize long references and ground answers in quotes

If a section runs past ~200 words, prepend a two-line summary so the model has a map before the evidence:

<reference source="customer-email">
<summary>Customer is angry about billing; key claim is they were
charged twice for the same period.</summary>
<full>
... 400 words of email ...
</full>
</reference>

For long-document tasks, add one more instruction that consistently lifts accuracy: ask the model to quote the relevant parts first, then answer. Anthropic recommends this explicitly to help the model “cut through the noise” of the surrounding text:

Find the quotes from the documents above that are relevant to the
question, and put them in <quotes> tags. Then answer using only those
quotes. If nothing is relevant, say so.

This forces the retrieval step to happen before the reasoning step, which is exactly the failure mode that flat prompts create.

Step 6: Reuse a template

For recurring task types, save the structured skeleton (Task / Hard requirements / Background / Output format) and fill the slots. Filling a template is faster than rebuilding structure each time and prevents the layout from drifting back into a flat wall.

How to confirm the fix

A stranger reading the prompt can name the task, the requirements, and the background in about 30 seconds.
The output addresses the actual task, not the longest section.
Every hard requirement appears in the output.
Provenance holds: when the model cites a fact, you can trace it to a <source> tag.
Re-running the same prompt produces outputs with the same structure, not a different shape each time.

If it still fails

The prompt may simply be too long. Cut any background that does not change the answer before adding more structure.
You may have too many hard requirements. Rank them and drop the weakest; ten “must”s read like zero.
Move the durable rules out of the message entirely — into the system prompt, a Claude/ChatGPT Project’s custom instructions, or a Cursor rules file — so they persist across turns instead of being re-pasted.
If the input genuinely cannot be shorter, move to a model with a larger, better-attended context window. As of June 2026, Claude Opus 4.7 and Sonnet 4.6 and Gemini 3.1 Pro carry 1M-token windows; ChatGPT Plus exposes roughly 320 pages in-app, with the full 1M context reserved for the $200 Pro tier.

FAQ

Does the model really read the middle of my prompt less carefully?

Yes, and it is measurable. Multi-document QA accuracy drops by 30 points or more when the answer moves from the first position to the tenth in a 20-document context. Putting the critical instruction at the top or bottom — not the middle — directly counters this.

Should the task go at the top or the bottom of the prompt?

For a short prompt, lead with the task and restate the non-negotiables at the end. For a long prompt with big pasted documents (~20k+ tokens), put the documents first and the task last — Anthropic measures up to a 30% quality gain from ending with the query.

Do I have to use XML tags, or is Markdown fine?

Either separates blocks. XML-style tags have one advantage: a tag like <source>product-spec-v3</source> carries provenance, so the model can tell which paste is authoritative. Use tags when source attribution matters; fences are fine when it does not.

Why capitalize MUST and MUST NOT?

Because the RFC 2119 convention only attaches normative weight to the ALL-CAPS forms. Lowercase must reads as ordinary prose. Models trained on specs treat MUST NOT as a hard prohibition, which is what you want for a non-negotiable rule.

My requirements are still being ignored after all this. Now what?

Cut requirement count first (rank and drop the weakest), then add the quote-grounding instruction from Step 5 so the model retrieves before it reasons. If it still slips, move the rules into the system prompt or project instructions so they are not competing with the pasted material for attention.

Prevention

Default rule: any prompt over ~200 words uses labeled sections.
Keep one template per recurring task type with the section skeleton already in place.
Tag every paste with its source whenever you combine multiple sources.
Use MUST / MUST NOT (capitalized) for hard rules; reserve “should” for preferences.
For long inputs, default to documents-first, question-last ordering.
On a team, agree on a shared section taxonomy so everyone’s prompts read the same way.

External references: Anthropic long context prompting tips and the RFC 2119 requirement-keyword definitions.

Tags: #Troubleshooting #Prompt #Prompt quality #Prompt engineering