ChatGPT Skims a Large Document — How to Force Full Analysis

200-page report uploaded, but the summary feels like it only read 30 pages. The model is "skimming" — force depth with chunked queries.

ChatGPT doesn’t read large documents top-to-bottom — it runs retrieval: the file is chunked (usually 500-1500 tokens per chunk), ranked by relevance to your prompt, and only top-k chunks land in context. A 200-page PDF may yield 8-15 chunks reaching the model’s attention. You think it read the whole thing; it actually answered from the slice that happened to be retrieved. That’s by design, not a bug — but structured prompts (chunk by section, name pages, query tables individually) can force the whole document through.

Common causes

Ordered by hit rate, highest first.

1. Retrieval limited to top-k chunks; the full doc never lands together

The most common failure. One “summarize this report” call retrieves 5-10 chunks of a few hundred words each — maybe 5-10 pages of actual content reach the model. The other 190 pages were never read.

How to spot it: Ask it to “quote page 142, first paragraph verbatim.” If it can’t, that page was never retrieved.

2. Large input triggers “highlights” mode

The model sees “summarize 200-page report” and defaults to summarization rather than close reading — smoothing over specific numbers in favor of themes.

How to spot it: Lots of meta language in the answer (“the report covers…, focuses on…, emphasizes…”) with no concrete numbers / names / dates.

3. Some pages failed extraction (image pages, scans, complex layouts)

The PDF parser silently fails on image-only pages, multi-column mixed layouts, and table-heavy pages — the file uploads fine but those pages never enter the retrieval pool.

How to spot it: Have it list every chapter’s page count and compare to the real TOC. Chapters where it says “very little content” or “mostly figures” = extraction failure.

4. Same question gives different answers on different turns

Each retrieval re-ranks chunks based on your prompt’s exact wording; tiny rewording shifts the ranking. Two identical-looking questions can return different details.

How to spot it: Ask the same prompt twice (5 minutes apart), and key numbers / list items differ = retrieval sampling randomness.

5. File near / over a format size threshold

PDF > 50MB, CSV > 100MB, etc. trigger different processing paths — some go through a “lite” mode (metadata + first few pages only).

How to spot it: After upload, have it print os.path.getsize + len(pdf_pages). Reported pages match actual = fully loaded; mismatch = lite path.

6. Context window looks big but attention is diluted

GPT-5.5 / 5 have 128K-200K-token windows that theoretically hold a 200-page PDF, but attention is diluted — “lost in the middle” makes mid-document content nearly invisible.

How to spot it: First few + last few chapters answered accurately; middle chapters missing or wrong = lost in middle.

Before you start

  • Confirm whether this happens in a plain chat, a Project, or a Custom GPT — Custom GPT Knowledge uses a different retrieval path than ad-hoc uploads.
  • Duplicate the chat before retesting so history doesn’t pollute the next diagnostic.
  • Confirm your plan: Free / Plus / Team / Enterprise differ noticeably in context window and available models.

Info to collect

  • File type, size (MB), total pages / rows, number of chapters.
  • Whether scanned PDF, whether has non-ASCII / formulas / many figures / multi-column.
  • Upload route: dragged into chat, Project Files, Custom GPT Knowledge.
  • Full prompt text + reply screenshot; identify which specific chapters / pages are obviously missing.
  • Current model + subscription tier.

Shortest fix path

Ordered by ROI. The first two solve ~70% of cases.

Step 1: List chapter headings first, compare to TOC

Don’t open with “summarize the whole thing.” Turn 1:

List every section heading in the uploaded document, with page ranges.
Format as a numbered list. Do not summarize content yet.

Compare output against the actual TOC. Missing chapters = extraction / retrieval failure for that section — handle separately.

Step 2: Chapter-by-chapter prompts + force evidence

Don’t ask for one whole-doc summary. One chapter at a time:

Summarize Chapter 3 (pages 47-72). Output:
- 3 concrete facts from this chapter, each with a 1-sentence quote
  + page number
- 2 numerical data points (with units)
- 1 sentence on what this chapter does NOT cover

Do not generalize. If you cannot find 3 facts, return fewer.

Per-chapter retrieval concentrates attention in a smaller window — depth returns immediately.

Step 3: Query tables / numbers separately

Data-dense chapters get a dedicated pass:

The document contains tables on pages 18, 34, 67, and 102.
For each table:
- Title / caption
- Row count + column count
- Quote the headers verbatim
- Quote the first 3 data rows verbatim

Per-table queries, never “summarize all the tables.”

Step 4: Aggregate yourself, don’t ask the model to “summarize the chapter summaries”

After collecting 10 chapter answers, don’t ask “based on the above, summarize the report’s themes” — it will re-smooth.

Aggregate on your side:

  • Put each chapter’s key facts in a Markdown table.
  • Eyeball patterns, contradictions, cross-chapter conclusions yourself.
  • For specific cross-chapter comparisons, ask targeted questions (“chapter 3 says X, chapter 7 says Y, reconcile”).

Step 5: If extraction is shaky, split the PDF

If Step 1 returns an incomplete chapter list, split locally:

# By chapter (if PDF has bookmarks)
pdftk full.pdf cat 1-46 output chapter1.pdf
pdftk full.pdf cat 47-72 output chapter3.pdf
# Or fixed-size split
qpdf --split-pages=30 full.pdf part-%d.pdf

Upload each ≤ 30-50-page chunk; per-file extraction coverage jumps significantly.

Step 6: For scanned / image-heavy PDFs, OCR + convert to Markdown

# OCR
brew install ocrmypdf
ocrmypdf input.pdf output.pdf --language eng+chi_sim

# Higher-quality structured extraction
pip install marker-pdf
marker_single input.pdf ./out --max_pages 300

Upload the resulting Markdown — retrieval quality jumps over PDF parsing.

Step 7: For recurring report types, build a Custom GPT

For quarterly / weekly reports you analyze repeatedly:

  • Knowledge: current report + 3-4 historical comparisons.
  • Instructions: bake in the TOC — “This doc type always has sections: Executive Summary, Financials, Risk Factors…”
  • Replace the Knowledge file each cycle; reuse your prompt templates.

How to confirm the fix

  • Open a fresh chat, upload the same file, run Step 2’s chapter-by-chapter prompt — every chapter produces concrete numbers + page citations = truly fixed.
  • Pick one quote, Ctrl+F in the PDF — found at the cited page = actually read.
  • Have a colleague run the same chunked prompts — consistent coverage = stable process, not luck.

If still broken

  • Cut to minimum: keep only the 5-10 pages you care most about, see if depth comes back.
  • Swap extraction tool: PDF → Marker Markdown → upload Markdown directly, bypassing ChatGPT’s PDF parser.
  • Switch model: 4o → o3 / GPT-5; reasoning models handle long-document cross-synthesis better.
  • Package source file + chunked prompt + list of missing chapters, file a ticket at help.openai.com.

Prevention

  • Mental model for large docs: ChatGPT is a “chapter-level research assistant,” not a “one-click summarizer.”
  • Always list TOC first, then ask per chapter — never “summarize the whole thing.”
  • Every chapter request must demand quote + page; treat unsupported claims as not-read.
  • For recurring report types, build a Custom GPT with the TOC baked into Instructions.
  • For decision-grade summaries, have ChatGPT output raw facts; write the conclusion yourself.

Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #Large document