Uploaded PDF Not Analyzed Correctly in ChatGPT

PDF uploaded fine but ChatGPT misses tables, skips pages, or invents numbers. The fix is almost always the extraction layer, not the model. Step-by-step diagnosis with current limits.

Published: May 17, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

ChatGPT does not “see” your PDF the way you do. On the Free, Go, Plus, and Pro plans, PDF ingestion is text-only: a parser pulls out the selectable text layer, builds a private semantic index, and the model answers from retrieved fragments of that index. If the text layer is bad — scanned pages with no text at all, multi-column layouts extracted out of order, tables flattened into a run-on number stream, embedded subset fonts that decode to garbage, image-only pages — the model is summarizing noise, and the answer is wrong. Figures and charts are dropped entirely (only ChatGPT Enterprise’s “Visual Retrieval with PDFs” feature reads them in-chat, as of June 2026).

That single fact reorders your whole debugging process: most “ChatGPT analyzed my PDF wrong” reports are extraction failures, not reasoning failures. Verify the extracted text first; only touch the model and the prompt after the input is known-good.

TL;DR

ChatGPT reads the text layer, not the page. No text layer (scanned/photo PDF) means no content reaches the model.
It does not reliably auto-OCR scans, and it ignores embedded images/charts on Free/Go/Plus/Pro. Fix that before upload.
Large files: the cap is 512 MB and 2 million tokens per file (as of June 2026). Beyond ~2M tokens the tail is silently dropped, so a 3,000-page report can look “half-read.”
The two highest-ROI fixes — OCR the scan, and convert complex tables to Markdown locally — clear roughly 70% of cases.
No-CLI 30-second fix to try first: open the PDF in Chrome or Edge, Print -> Save as PDF, re-upload. The browser rewrites the file with a clean text layer and often unsticks a PDF that ChatGPT refused with No text could be extracted from this file.
Confirm a fix by making ChatGPT quote one sentence verbatim and Ctrl+F-ing it in the source PDF.

Common causes

Ordered by hit rate, highest first.

1. Scanned pages have no text layer

The most common failure. PDFs from phone photos / paper scans look normal but each page is one big image. No text to extract — the model sees “[image]” and blanks, and answers “no content on that page.”

How to spot it: In a PDF reader (Preview / Adobe), try mouse-selecting text on the problem page. Can’t select = no text layer = scanned page.

2. Multi-column / complex layout extraction order is wrong

Academic papers / two-column magazine PDFs — the extractor often follows z-order (left-to-right, top-to-bottom) and interleaves the left column’s first paragraph with the right column’s first paragraph — sentences scrambled.

How to spot it: Ask it to “quote page 3 paragraph 1 verbatim” — output reads as oddly stitched sentences / no clear subject-verb chain = order is wrong.

3. Tables flattened, rows / columns lost

Tables in PDFs are just drawn lines + text boxes; the extractor often loses the row/column structure and outputs a flat stream of numbers. “Q1 revenue 100, Q2 200” can come out as “Q1 revenue Q2 100 200.”

How to spot it: Table-related questions wrong / numbers off-by-one column = this case. Asking “extract Table 2 row by row” produces a mess = confirms.

4. Embedded fonts / non-standard CIDs produce garbled output

CJK PDFs / math formulas / unusual fonts often use embedded subset fonts with non-standard CID maps. Extraction yields ”��” or random English letter substitutions in place of characters.

How to spot it: Ask it to quote a passage and the output is obviously garbled / character-substituted = font issue.

5. Huge PDF exceeds the 2M-token index, so the tail is dropped

ChatGPT does not hold a big PDF in context. It builds a semantic index (capped at 2 million tokens per file, as of June 2026) and pulls only the relevant fragments — roughly 110k tokens per retrieval pass — into the model. Anything past the 2M-token ceiling is silently ignored. A 3,000-page legal brief can sit well under the 512 MB size cap yet have its last third never indexed, so the model confidently answers about the first chapters and blanks on the rest.

How to spot it: Ask it to “list each chapter and its page count” against the real table of contents. Clearly missing tail chapters = the index was truncated.

6. Encrypted / DRM-protected PDF can’t be extracted

Password-protected or DRM PDFs upload but extraction fails. Model returns “I cannot access this file’s content.”

How to spot it: Check locally with qpdf --is-encrypted = encrypted.

7. Images / figures completely ignored

On Free, Go, Plus, and Pro, PDF ingestion is text-only — charts, flowcharts, and scanned figures are not read at all. The model never sees them, so it cannot reference their content. Only ChatGPT Enterprise’s “Visual Retrieval with PDFs” feature (Enterprise-only as of June 2026; not on Free, Plus, Pro, Team, or Edu) interprets in-chart numbers and diagrams during a normal upload.

How to spot it: A chart obviously contains numbers, but the answer never mentions any, and asking “what does Figure 3 show?” yields a guess or a refusal = the figure was never read.

Current limits worth knowing (June 2026)

These are the numbers that quietly cause “half-read” or “rejected” PDFs. All figures as of June 2026.

Limit	Value	What breaks when you hit it
Max file size	512 MB per file	Upload rejected outright
Index cap (text/PDF)	2,000,000 tokens per file	Tail beyond ~2M tokens silently dropped
Retrieval per pass	~110,000 tokens fetched per query	Long files answered from fragments, not the whole
Reliable-processing size	keep under ~25 MB	Larger files upload but flake during indexing
Files per message	10 (Plus and above)	11th attachment ignored
Upload rate, Free	~3 files/day	Further uploads blocked
Upload rate, Plus	up to 80 files / 3 hours	Throttled after the cap
Upload rate, Team/Enterprise	up to 160 files / 3 hours	Throttled after the cap

Sources: OpenAI File Uploads FAQ and OpenAI’s published per-plan caps.

Before you start

Confirm where it happened: a Project, a Custom GPT, or plain chat. PDF handling is the same text-only pipeline, but Projects persist files across the whole project, which changes what is actually attached to a given message.
Duplicate the chat before retesting so prior history doesn’t pollute the next diagnostic.
Confirm your plan. Free, Go, Plus, Pro, Team, and Enterprise share the 512 MB / 2M-token caps, but only Enterprise reads figures in-chat.

Info to collect

PDF type (digitally native / scanned / mixed), page count, file size (MB).
Whether it has non-ASCII / math formulas / tables / figures / multi-column layout.
Full prompt text + bad-answer screenshot; specifically which pages / which table / which numbers were wrong.
Current model + upload route (chat / Project / Custom GPT).

Shortest fix path

Ordered by ROI. The first two solve ~70% of cases.

Step 0: The 30-second no-CLI fix

Before installing anything, try the browser rewrite. Open the PDF in Chrome or Edge, hit Print, choose Save as PDF (or Microsoft Print to PDF), save, and upload the new file. The browser re-renders the document with a clean, single-stream text layer, which frequently fixes a PDF that ChatGPT rejected with No text could be extracted from this file or that came back scrambled. This only works when the original has some real text layer (it cannot conjure text for a pure scan — for that, go to Step 2), but it costs nothing to try and clears a surprising share of layout/encoding cases.

Step 1: Verify text-selectability page by page

Open in a local reader (macOS Preview / Adobe / browser):

Go to the pages with bad content.
Try mouse-selecting the text.
Selects = text layer present; doesn’t = scanned / bitmap.

Or batch-check via CLI:

# Extract text, count chars per page
pdftotext -layout your.pdf - | wc -l
# Or use Python
pip install pdfplumber
python -c "import pdfplumber; pdf = pdfplumber.open('your.pdf'); print([len(p.extract_text() or '') for p in pdf.pages])"

Per-page char counts all 0 / single digits = scanned or broken extraction.

Step 2: OCR scanned pages yourself

ChatGPT does not reliably OCR scans on Free/Go/Plus/Pro — it expects a text layer. Add one locally with ocrmypdf (Tesseract-based). Scan quality matters: aim for 300 DPI or higher, high-contrast, deskewed pages.

# Install ocrmypdf
brew install ocrmypdf  # macOS
sudo apt install ocrmypdf   # Ubuntu/Debian

# OCR (English + Simplified Chinese)
ocrmypdf input.pdf output.pdf --language eng+chi_sim

# If a partial/broken text layer already exists, force a clean redo
ocrmypdf input.pdf output.pdf --force-ocr --deskew --language eng+chi_sim

Upload output.pdf and the “I can’t see any content on that page” answer disappears.

Step 3: High-quality structured extraction to Markdown

Complex tables, math, and multi-column layouts need a layout-aware extractor, not raw pdftotext. These produce Markdown that preserves table structure and reading order:

# Marker (CPU or GPU) — strong tables, formulas, and section structure
pip install marker-pdf
marker_single input.pdf --output_dir ./output --max_pages 300
# Outputs output/input/input.md

# Or Docling (IBM) — clean Markdown/HTML, best-in-class table fidelity
pip install docling
docling input.pdf --to md --output ./output

Upload the resulting Markdown instead of the PDF. Retrieval quality and table accuracy both jump because the model now indexes clean, structured text instead of a scrambled stream.

Which extractor when (as of June 2026):

Tool	Best for	Notes
`ocrmypdf` (Tesseract)	Scanned / image-only pages	Adds a text layer in place; keeps it a PDF
Docling (IBM)	Dense tables, multi-column papers	Its TableFormer model leads on table accuracy (~88% F1 in 2026 benchmarks); pure-Python install; Markdown/HTML/JSON out
Marker (`marker-pdf`)	Math, formulas, high-fidelity layout	Markdown out; GPU much faster but CPU works; strongest on equations
MinerU	Complex/HTML-rendered tables, mixed CJK docs	Newer (2026) open-source option; fast, handles complex tables well
`pdfplumber`	Targeted single-table pull to CSV	Scriptable; precise table extraction
Adobe Acrobat OCR	One-off, no command line	GUI; solid OCR for paper scans

For a RAG-style “answer from these tables” job, Docling is the safest default; reach for Marker when equations or layout fidelity matter most. All of these still mis-detect multi-level headings occasionally, so spot-check the section order before you trust a long conversion.

Step 4: Extract tables explicitly

Direct ChatGPT prompt:

The document contains a table on page 12. Extract that table:
- Quote the exact title / caption
- Quote every column header in order
- Quote each data row verbatim, row by row
- Format as Markdown table

Forcing structured extraction → model can’t flatten into prose.

Step 5: Split huge PDFs into 30-50 page chunks

If you suspect the 2M-token index ceiling (Step “huge PDF” above), split the document so each part is fully indexed and retrieval has fewer fragments to confuse:

# Fixed-size split by pages
qpdf --split-pages=30 large.pdf part-%d.pdf

# By chapter (if bookmarks exist)
pdftk full.pdf cat 1-46 output ch1.pdf
pdftk full.pdf cat 47-92 output ch2.pdf

Upload each chunk separately and query it separately. This guarantees the tail chapters are actually indexed instead of being dropped past the 2M-token cap.

Step 6: Decrypt before upload

# If you have the password
qpdf --decrypt --password=YOURPASSWORD encrypted.pdf decrypted.pdf

Upload decrypted.pdf.

Step 7: Critical figure pages → upload as images for vision

If extraction skips figures, convert those pages to PNG / JPG and let vision read them:

# pdftoppm (PDF → images)
pdftoppm -r 200 -png mypdf.pdf page

# Outputs page-1.png, page-2.png, ...

Upload critical pages as images and ask vision to read directly — better than fighting bad extraction.

How to confirm the fix

Open a fresh chat, upload the OCR’d / Markdown version, ask the same question — quotes + page numbers Ctrl+F-able in the source PDF = truly fixed.
Have it quote one sentence verbatim from page 50 — exact match against the source PDF = lossless extraction.
Have a colleague run the same OCR / conversion pipeline and upload — same results = stable process.

If still broken

Cut PDF to minimum: keep just the 1-3 problem pages, convert to Markdown, upload — see if depth comes back.
Swap extraction tool: Marker, then Docling, then Adobe Acrobat — compare the Markdown each produces and upload the cleanest.
Switch platform for high-stakes work. Claude (Opus 4.7 / Sonnet 4.6) has a 1M-token context at standard pricing, so a moderately large PDF can sit in context rather than behind retrieval; Gemini 3.1 Pro also offers 1M tokens; Google’s NotebookLM is purpose-built for grounded document Q&A with citations.
Package the source PDF + extracted Markdown + a bad-answer screenshot + the expected content, and file a ticket at help.openai.com.

Prevention

Always sanity-check before relying: ask ChatGPT to quote one sentence from page 3. If it can’t, fix extraction first.
Prefer digitally native PDFs (export directly from Word / LaTeX / Pages); avoid print-and-rescan chains.
For high-stakes analysis (finance / compliance / contracts), always convert to Markdown locally before upload — never raw PDF.
Table / number-dense pages → pre-convert to CSV / Markdown — accuracy jumps an order of magnitude.
If you only need one chapter of a 200-page report, split that chapter out locally — don’t upload the whole thing.

FAQ

How do I fix “No text could be extracted from this file”?

That error means the parser found no usable text layer. Two fast fixes in order: (1) open the PDF in Chrome or Edge and re-save with Print -> Save as PDF (Step 0) — this rebuilds a clean text layer and often clears it instantly; (2) if it is a true scan, run ocrmypdf locally (Step 2) to add a real text layer, then re-upload. If both fail, the file is likely encrypted (qpdf --is-encrypted) or corrupt.

Why does ChatGPT say there’s no content on a page that clearly has text?

That page is almost certainly a scanned image with no text layer. ChatGPT reads the extracted text, not the pixels, and on Free/Go/Plus/Pro it does not reliably OCR scans. Run ocrmypdf locally (Step 2) and re-upload.

What is the maximum PDF size ChatGPT accepts?

512 MB per file, and a separate 2,000,000-token index cap per file (as of June 2026). A PDF can pass the size limit but exceed the token cap, in which case the tail is silently dropped. For reliable indexing, keep files under roughly 25 MB and split very long documents.

Why does ChatGPT ignore the charts and figures in my PDF?

On Free, Go, Plus, and Pro, PDF ingestion is text-only — embedded images and charts are not read at all. Only ChatGPT Enterprise’s “Visual Retrieval with PDFs” feature interprets figures during a normal upload. To get a chart read on a consumer plan, export that page to PNG with pdftoppm and upload it as an image so vision reads it directly (Step 7).

Why are the numbers in tables wrong even though the PDF looks fine?

PDF tables are just drawn lines plus loose text boxes. The extractor often loses the row/column grid and flattens cells into a single stream, so values land in the wrong column. Convert the table to Markdown or CSV locally (Marker, Docling, or pdfplumber) and upload that instead of the raw PDF.

Should I just use a different tool for big or scanned PDFs?

For high-stakes documents, yes. Claude (Opus 4.7 / Sonnet 4.6) and Gemini 3.1 Pro both offer a 1M-token context window, so a moderately large file fits in context instead of behind a retrieval index, and NotebookLM is built specifically for cited document Q&A. ChatGPT is fine once you feed it clean, structured text.

Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #PDF analysis