ChatGPT doesn’t read large documents top-to-bottom — it runs retrieval: the file is chunked (usually 500-1500 tokens per chunk), ranked by relevance to your prompt, and only top-k chunks land in context. A 200-page PDF may yield 8-15 chunks reaching the model’s attention. You think it read the whole thing; it actually answered from the slice that happened to be retrieved. That’s by design, not a bug — but structured prompts (chunk by section, name pages, query tables individually) can force the whole document through.
Common causes
Ordered by hit rate, highest first.
1. Retrieval limited to top-k chunks; the full doc never lands together
The most common failure. One “summarize this report” call retrieves 5-10 chunks of a few hundred words each — maybe 5-10 pages of actual content reach the model. The other 190 pages were never read.
How to spot it: Ask it to “quote page 142, first paragraph verbatim.” If it can’t, that page was never retrieved.
2. Large input triggers “highlights” mode
The model sees “summarize 200-page report” and defaults to summarization rather than close reading — smoothing over specific numbers in favor of themes.
How to spot it: Lots of meta language in the answer (“the report covers…, focuses on…, emphasizes…”) with no concrete numbers / names / dates.
3. Some pages failed extraction (image pages, scans, complex layouts)
The PDF parser silently fails on image-only pages, multi-column mixed layouts, and table-heavy pages — the file uploads fine but those pages never enter the retrieval pool.
How to spot it: Have it list every chapter’s page count and compare to the real TOC. Chapters where it says “very little content” or “mostly figures” = extraction failure.
4. Same question gives different answers on different turns
Each retrieval re-ranks chunks based on your prompt’s exact wording; tiny rewording shifts the ranking. Two identical-looking questions can return different details.
How to spot it: Ask the same prompt twice (5 minutes apart), and key numbers / list items differ = retrieval sampling randomness.
5. File near / over a format size threshold
PDF > 50MB, CSV > 100MB, etc. trigger different processing paths — some go through a “lite” mode (metadata + first few pages only).
How to spot it: After upload, have it print os.path.getsize + len(pdf_pages). Reported pages match actual = fully loaded; mismatch = lite path.
6. Context window looks big but attention is diluted
GPT-5.5 / 5 have 128K-200K-token windows that theoretically hold a 200-page PDF, but attention is diluted — “lost in the middle” makes mid-document content nearly invisible.
How to spot it: First few + last few chapters answered accurately; middle chapters missing or wrong = lost in middle.
Before you start
- Confirm whether this happens in a plain chat, a Project, or a Custom GPT — Custom GPT Knowledge uses a different retrieval path than ad-hoc uploads.
- Duplicate the chat before retesting so history doesn’t pollute the next diagnostic.
- Confirm your plan: Free / Plus / Team / Enterprise differ noticeably in context window and available models.
Info to collect
- File type, size (MB), total pages / rows, number of chapters.
- Whether scanned PDF, whether has non-ASCII / formulas / many figures / multi-column.
- Upload route: dragged into chat, Project Files, Custom GPT Knowledge.
- Full prompt text + reply screenshot; identify which specific chapters / pages are obviously missing.
- Current model + subscription tier.
Shortest fix path
Ordered by ROI. The first two solve ~70% of cases.
Step 1: List chapter headings first, compare to TOC
Don’t open with “summarize the whole thing.” Turn 1:
List every section heading in the uploaded document, with page ranges.
Format as a numbered list. Do not summarize content yet.
Compare output against the actual TOC. Missing chapters = extraction / retrieval failure for that section — handle separately.
Step 2: Chapter-by-chapter prompts + force evidence
Don’t ask for one whole-doc summary. One chapter at a time:
Summarize Chapter 3 (pages 47-72). Output:
- 3 concrete facts from this chapter, each with a 1-sentence quote
+ page number
- 2 numerical data points (with units)
- 1 sentence on what this chapter does NOT cover
Do not generalize. If you cannot find 3 facts, return fewer.
Per-chapter retrieval concentrates attention in a smaller window — depth returns immediately.
Step 3: Query tables / numbers separately
Data-dense chapters get a dedicated pass:
The document contains tables on pages 18, 34, 67, and 102.
For each table:
- Title / caption
- Row count + column count
- Quote the headers verbatim
- Quote the first 3 data rows verbatim
Per-table queries, never “summarize all the tables.”
Step 4: Aggregate yourself, don’t ask the model to “summarize the chapter summaries”
After collecting 10 chapter answers, don’t ask “based on the above, summarize the report’s themes” — it will re-smooth.
Aggregate on your side:
- Put each chapter’s key facts in a Markdown table.
- Eyeball patterns, contradictions, cross-chapter conclusions yourself.
- For specific cross-chapter comparisons, ask targeted questions (“chapter 3 says X, chapter 7 says Y, reconcile”).
Step 5: If extraction is shaky, split the PDF
If Step 1 returns an incomplete chapter list, split locally:
# By chapter (if PDF has bookmarks)
pdftk full.pdf cat 1-46 output chapter1.pdf
pdftk full.pdf cat 47-72 output chapter3.pdf
# Or fixed-size split
qpdf --split-pages=30 full.pdf part-%d.pdf
Upload each ≤ 30-50-page chunk; per-file extraction coverage jumps significantly.
Step 6: For scanned / image-heavy PDFs, OCR + convert to Markdown
# OCR
brew install ocrmypdf
ocrmypdf input.pdf output.pdf --language eng+chi_sim
# Higher-quality structured extraction
pip install marker-pdf
marker_single input.pdf ./out --max_pages 300
Upload the resulting Markdown — retrieval quality jumps over PDF parsing.
Step 7: For recurring report types, build a Custom GPT
For quarterly / weekly reports you analyze repeatedly:
- Knowledge: current report + 3-4 historical comparisons.
- Instructions: bake in the TOC — “This doc type always has sections: Executive Summary, Financials, Risk Factors…”
- Replace the Knowledge file each cycle; reuse your prompt templates.
How to confirm the fix
- Open a fresh chat, upload the same file, run Step 2’s chapter-by-chapter prompt — every chapter produces concrete numbers + page citations = truly fixed.
- Pick one quote, Ctrl+F in the PDF — found at the cited page = actually read.
- Have a colleague run the same chunked prompts — consistent coverage = stable process, not luck.
If still broken
- Cut to minimum: keep only the 5-10 pages you care most about, see if depth comes back.
- Swap extraction tool: PDF → Marker Markdown → upload Markdown directly, bypassing ChatGPT’s PDF parser.
- Switch model: 4o → o3 / GPT-5; reasoning models handle long-document cross-synthesis better.
- Package source file + chunked prompt + list of missing chapters, file a ticket at help.openai.com.
Prevention
- Mental model for large docs: ChatGPT is a “chapter-level research assistant,” not a “one-click summarizer.”
- Always list TOC first, then ask per chapter — never “summarize the whole thing.”
- Every chapter request must demand quote + page; treat unsupported claims as not-read.
- For recurring report types, build a Custom GPT with the TOC baked into Instructions.
- For decision-grade summaries, have ChatGPT output raw facts; write the conclusion yourself.
Related reading
- ChatGPT uploaded PDF not analyzed correctly
- ChatGPT file analysis too shallow
- ChatGPT multiple files not used together
- ChatGPT Projects
- ChatGPT file analysis
- ChatGPT Projects advanced workflow
Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #Large document