ChatGPT Skims a Large Document: How to Force Full Analysis

Uploaded a 200-page report, but the summary reads like it only saw 30 pages. ChatGPT runs retrieval, not full reading. Force depth with chunked, page-anchored queries.

Published: May 17, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Fastest fix: stop asking for “summarize the whole report.” Ask ChatGPT to list the section headings with page ranges first, then query one chapter at a time and demand a verbatim quote plus page number for every claim. That alone fixes most cases, because it forces ChatGPT to retrieve each section deliberately instead of guessing from whatever it grabbed.

Here’s why the default fails: ChatGPT doesn’t read a large upload top-to-bottom. It builds a private semantic index from the file, then on each turn it retrieves the chunks that look most relevant to your prompt and injects only those into the model’s context. As of June 2026 the retrieval pass inserts roughly 110,000 tokens from the index per query (OpenAI Help Center, “File Uploads FAQ”), and a single file is indexed up to 2,000,000 tokens of text; anything past that is silently dropped. A 200-page PDF answered with one “summarize this” call may put only 8-15 chunks (a handful of pages) in front of the model. You think it read the whole thing; it answered from the slice that happened to be retrieved. That is by design, not a bug, and structured prompts (chunk by section, name pages, query tables individually) push the whole document through.

Which bucket are you in

Symptom	Likely cause	Jump to
Whole-doc summary is all themes, no numbers	Retrieval grabbed a few chunks; rest never loaded	Cause 1, Step 1-2
It can’t quote a page you know exists	That page was never retrieved or never extracted	Cause 1 / 3, Step 1
One chapter reads as “mostly figures / little content”	Extraction failed on image or table pages	Cause 3, Step 5-6
Same prompt gives different numbers each run	Retrieval re-ranks per wording	Cause 4, Step 2
First and last chapters fine, middle wrong or blank	Lost in the middle	Cause 6, Step 2
Huge file, only first pages seem present	Hit a size or token ceiling	Cause 5, Step 5

Common causes

Ordered by hit rate, highest first.

1. Retrieval limited to top-k chunks; the full doc never lands together

The most common failure. One “summarize this report” call retrieves 5-10 chunks of a few hundred words each — maybe 5-10 pages of actual content reach the model. The other 190 pages were never read.

How to spot it: Ask it to “quote page 142, first paragraph verbatim.” If it can’t, that page was never retrieved.

2. Large input triggers “highlights” mode

The model sees “summarize 200-page report” and defaults to summarization rather than close reading — smoothing over specific numbers in favor of themes.

How to spot it: Lots of meta language in the answer (“the report covers…, focuses on…, emphasizes…”) with no concrete numbers / names / dates.

3. Some pages failed extraction (image pages, scans, complex layouts)

The PDF parser silently fails on image-only pages, multi-column mixed layouts, and table-heavy pages — the file uploads fine but those pages never enter the retrieval pool.

How to spot it: Have it list every chapter’s page count and compare to the real TOC. Chapters where it says “very little content” or “mostly figures” = extraction failure.

4. Same question gives different answers on different turns

Each retrieval re-ranks chunks based on your prompt’s exact wording; tiny rewording shifts the ranking. Two identical-looking questions can return different details.

How to spot it: Ask the same prompt twice (5 minutes apart), and key numbers / list items differ = retrieval sampling randomness.

5. File hits a hard size or token ceiling

As of June 2026, ChatGPT rejects any single file over 512 MB, and it only indexes the first 2,000,000 tokens of text per file (roughly a 3,000-page text document); anything past that is silently ignored during search. Very large files can also fall back to a lighter processing path that surfaces mostly the opening pages.

How to spot it: With Advanced Data Analysis active, have it run a tiny script and print os.path.getsize(path) plus len(pdf_pages). Reported page count matches the real file = fully loaded; a mismatch (or a number capped well below the real total) = you hit a ceiling, so split the file.

6. Context window is big but attention is diluted

In ChatGPT, GPT-5.5 Instant runs a 32K-token context on Plus and Business; GPT-5.5 Thinking opens up to 256K but only when you manually pick it; the full 1M-token context is in-app only on the $200 Pro tier. Even when the window theoretically holds a 200-page PDF, attention is uneven: the well-documented “lost in the middle” effect (Liu et al., 2023) makes mid-document content nearly invisible, with accuracy forming a U-shaped curve that favors the start and end.

How to spot it: First few and last few chapters answered accurately; middle chapters missing or wrong = lost in the middle. Switching to GPT-5.5 Thinking often recovers the middle on its own.

Before you start

Confirm whether this happens in a plain chat, a Project, or a Custom GPT — Custom GPT Knowledge uses a different retrieval path than ad-hoc uploads.
Duplicate the chat before retesting so history doesn’t pollute the next diagnostic.
Confirm your plan: Free, Go ($8), Plus ($20), Pro ($100/$200), Business, and Enterprise differ noticeably in context window and model access. Only paid tiers expose the model picker that lets you switch to GPT-5.5 Thinking.

Info to collect

File type, size (MB), total pages / rows, number of chapters.
Whether scanned PDF, whether has non-ASCII / formulas / many figures / multi-column.
Upload route: dragged into chat, Project Files, Custom GPT Knowledge.
Full prompt text + reply screenshot; identify which specific chapters / pages are obviously missing.
Current model + subscription tier.

Shortest fix path

Ordered by ROI. The first two solve ~70% of cases.

Step 1: List chapter headings first, compare to TOC

Don’t open with “summarize the whole thing.” Turn 1:

List every section heading in the uploaded document, with page ranges.
Format as a numbered list. Do not summarize content yet.

Compare output against the actual TOC. Missing chapters = extraction / retrieval failure for that section — handle separately.

Step 2: Chapter-by-chapter prompts + force evidence

Don’t ask for one whole-doc summary. One chapter at a time:

Summarize Chapter 3 (pages 47-72). Output:
- 3 concrete facts from this chapter, each with a 1-sentence quote
  + page number
- 2 numerical data points (with units)
- 1 sentence on what this chapter does NOT cover

Do not generalize. If you cannot find 3 facts, return fewer.

Per-chapter retrieval concentrates attention in a smaller window — depth returns immediately.

Step 3: Query tables / numbers separately

Data-dense chapters get a dedicated pass:

The document contains tables on pages 18, 34, 67, and 102.
For each table:
- Title / caption
- Row count + column count
- Quote the headers verbatim
- Quote the first 3 data rows verbatim

Per-table queries, never “summarize all the tables.”

Step 4: Aggregate yourself, don’t ask the model to “summarize the chapter summaries”

After collecting 10 chapter answers, don’t ask “based on the above, summarize the report’s themes” — it will re-smooth.

Aggregate on your side:

Put each chapter’s key facts in a Markdown table.
Eyeball patterns, contradictions, cross-chapter conclusions yourself.
For specific cross-chapter comparisons, ask targeted questions (“chapter 3 says X, chapter 7 says Y, reconcile”).

Step 5: If extraction is shaky, split the PDF

If Step 1 returns an incomplete chapter list, split locally:

# By chapter (if PDF has bookmarks)
pdftk full.pdf cat 1-46 output chapter1.pdf
pdftk full.pdf cat 47-72 output chapter3.pdf
# Or fixed-size split
qpdf --split-pages=30 full.pdf part-%d.pdf

Upload each ≤ 30-50-page chunk; per-file extraction coverage jumps significantly.

Step 6: For scanned / image-heavy PDFs, OCR + convert to Markdown

# OCR
brew install ocrmypdf
ocrmypdf input.pdf output.pdf --language eng+chi_sim

# Higher-quality structured extraction
pip install marker-pdf
marker_single input.pdf ./out --max_pages 300

Upload the resulting Markdown — retrieval quality jumps over PDF parsing.

Step 7: For recurring report types, build a Custom GPT

For quarterly / weekly reports you analyze repeatedly:

Knowledge: current report + 3-4 historical comparisons.
Instructions: bake in the TOC — “This doc type always has sections: Executive Summary, Financials, Risk Factors…”
Replace the Knowledge file each cycle; reuse your prompt templates.

How to confirm the fix

Open a fresh chat, upload the same file, run Step 2’s chapter-by-chapter prompt — every chapter produces concrete numbers + page citations = truly fixed.
Pick one quote, Ctrl+F in the PDF — found at the cited page = actually read.
Have a colleague run the same chunked prompts — consistent coverage = stable process, not luck.

If still broken

Cut to minimum: keep only the 5-10 pages you care most about, see if depth comes back.
Swap extraction tool: PDF → Marker Markdown → upload Markdown directly, bypassing ChatGPT’s PDF parser.
Switch model: open the picker and move from GPT-5.5 Instant to GPT-5.5 Thinking (or Pro on a $200 plan). The reasoning modes handle long-document cross-synthesis and the middle of the document noticeably better.
Package the source file plus your chunked prompt plus the list of missing chapters, and file a ticket at help.openai.com.

Prevention

Mental model for large docs: ChatGPT is a “chapter-level research assistant,” not a “one-click summarizer.”
Always list TOC first, then ask per chapter — never “summarize the whole thing.”
Every chapter request must demand quote + page; treat unsupported claims as not-read.
For recurring report types, build a Custom GPT with the TOC baked into Instructions.
For decision-grade summaries, have ChatGPT output raw facts; write the conclusion yourself.

FAQ

Does a bigger ChatGPT plan make it read the whole document? Partly. A bigger plan widens the window (GPT-5.5 Thinking reaches 256K, and the $200 Pro tier gets the full 1M in-app context), so more of the file can sit in context at once. But the retrieval ceiling and “lost in the middle” effect still apply, so chunked, page-anchored prompts beat a bigger plan for accuracy on a 200-page file.

Why can it quote page 12 but not page 150? Because it never retrieved page 150. ChatGPT pulls only the chunks that match your prompt, around 110,000 tokens per pass as of June 2026. Naming the page or section in your prompt is what forces that chunk into context.

Is uploading a PDF or pasting the text better? For text-heavy reports, converting to clean Markdown and uploading that beats a raw PDF, because you skip ChatGPT’s PDF parser, which silently fails on scans, multi-column layouts, and table-heavy pages. Pasting works only for short excerpts that fit the visible context.

How big a file is too big? A single file is rejected over 512 MB, and only the first 2,000,000 tokens of text (roughly 3,000 text pages) are indexed; the rest is ignored during search. Long before those hard limits, retrieval accuracy degrades, so splitting into 30-50 page chunks helps well under the ceiling.

Why do I get different numbers when I ask the same thing twice? Each turn re-ranks chunks against your exact wording, so a tiny rewording changes which pages get retrieved. Pin the page or section in the prompt and demand a verbatim quote to make answers reproducible.

Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #Large document