ChatGPT doesn’t “see” your PDF — it reads text extracted by a PDF parser. Scanned pages with no text layer, multi-column layouts with shuffled extraction order, tables flattened into prose, exotic embedded fonts producing garbled characters, full-bitmap pages — all of these turn the original into noise text the model has to work from. The model summarizes that bad text and the answer is wrong. Most PDF analysis failures aren’t model failures — they happen in the extraction layer. Fix order: verify extraction first, then deal with the model.
Common causes
Ordered by hit rate, highest first.
1. Scanned pages have no text layer
The most common failure. PDFs from phone photos / paper scans look normal but each page is one big image. No text to extract — the model sees “[image]” and blanks, and answers “no content on that page.”
How to spot it: In a PDF reader (Preview / Adobe), try mouse-selecting text on the problem page. Can’t select = no text layer = scanned page.
2. Multi-column / complex layout extraction order is wrong
Academic papers / two-column magazine PDFs — the extractor often follows z-order (left-to-right, top-to-bottom) and interleaves the left column’s first paragraph with the right column’s first paragraph — sentences scrambled.
How to spot it: Ask it to “quote page 3 paragraph 1 verbatim” — output reads as oddly stitched sentences / no clear subject-verb chain = order is wrong.
3. Tables flattened, rows / columns lost
Tables in PDFs are just drawn lines + text boxes; the extractor often loses the row/column structure and outputs a flat stream of numbers. “Q1 revenue 100, Q2 200” can come out as “Q1 revenue Q2 100 200.”
How to spot it: Table-related questions wrong / numbers off-by-one column = this case. Asking “extract Table 2 row by row” produces a mess = confirms.
4. Embedded fonts / non-standard CIDs produce garbled output
CJK PDFs / math formulas / unusual fonts often use embedded subset fonts with non-standard CID maps. Extraction yields ”���” or random English letter substitutions in place of characters.
How to spot it: Ask it to quote a passage and the output is obviously garbled / character-substituted = font issue.
5. Huge PDF triggers “skim” mode
> 50MB / > 500 pages may trigger a different extraction path — only first pages + metadata are read, deep content missing.
How to spot it: Ask it to “list each chapter’s page count” against the real TOC; clearly missing tail chapters = large-file path engaged.
6. Encrypted / DRM-protected PDF can’t be extracted
Password-protected or DRM PDFs upload but extraction fails. Model returns “I cannot access this file’s content.”
How to spot it: Check locally with qpdf --is-encrypted = encrypted.
7. Images / figures completely ignored
Images / charts / flowcharts in a PDF — the extraction layer only sees “[image]” placeholders. The model doesn’t reference figure content because it never saw it.
How to spot it: A chart obviously contains numbers but the answer never mentions any = extraction never read the figure.
Before you start
- Confirm this is in Projects, Custom GPT, or plain chat — PDF handling differs slightly across the three.
- Duplicate the chat before retesting so history doesn’t pollute the next diagnostic.
- Confirm your plan: Free / Plus / Team / Enterprise differ in per-file size caps.
Info to collect
- PDF type (digitally native / scanned / mixed), page count, file size (MB).
- Whether it has non-ASCII / math formulas / tables / figures / multi-column layout.
- Full prompt text + bad-answer screenshot; specifically which pages / which table / which numbers were wrong.
- Current model + upload route (chat / Project / Custom GPT).
Shortest fix path
Ordered by ROI. The first two solve ~70% of cases.
Step 1: Verify text-selectability page by page
Open in a local reader (macOS Preview / Adobe / browser):
- Go to the pages with bad content.
- Try mouse-selecting the text.
- Selects = text layer present; doesn’t = scanned / bitmap.
Or batch-check via CLI:
# Extract text, count chars per page
pdftotext -layout your.pdf - | wc -l
# Or use Python
pip install pdfplumber
python -c "import pdfplumber; pdf = pdfplumber.open('your.pdf'); print([len(p.extract_text() or '') for p in pdf.pages])"
Per-page char counts all 0 / single digits = scanned or broken extraction.
Step 2: OCR scanned pages
# Install ocrmypdf (Tesseract-based)
brew install ocrmypdf # macOS
apt install ocrmypdf # Ubuntu
# OCR (English + Chinese)
ocrmypdf input.pdf output.pdf --language eng+chi_sim
# Skip existing text layer, force redo
ocrmypdf input.pdf output.pdf --force-ocr --language eng+chi_sim
Upload the output PDF — “I can’t see content” disappears immediately.
Step 3: High-quality structured extraction to Markdown
Complex tables / formulas / multi-column needs dedicated tooling:
# Marker (CPU or GPU) — much better extraction than ChatGPT's built-in
pip install marker-pdf
marker_single input.pdf ./output --max_pages 300
# Outputs output/input/input.md (with tables, formulas, sections)
# Or Docling (IBM)
pip install docling
docling input.pdf --output md > output.md
Upload the Markdown to ChatGPT — retrieval quality + table accuracy both jump significantly.
Step 4: Extract tables explicitly
Direct ChatGPT prompt:
The document contains a table on page 12. Extract that table:
- Quote the exact title / caption
- Quote every column header in order
- Quote each data row verbatim, row by row
- Format as Markdown table
Forcing structured extraction → model can’t flatten into prose.
Step 5: Split huge PDFs into 30-50 page chunks
# Fixed-size split by pages
qpdf --split-pages=30 large.pdf part-%d.pdf
# By chapter (if bookmarks exist)
pdftk full.pdf cat 1-46 output ch1.pdf
pdftk full.pdf cat 47-92 output ch2.pdf
Upload each chunk separately + query separately — bypasses the large-file skim path.
Step 6: Decrypt before upload
# If you have the password
qpdf --decrypt --password=YOURPASSWORD encrypted.pdf decrypted.pdf
Upload decrypted.pdf.
Step 7: Critical figure pages → upload as images for vision
If extraction skips figures, convert those pages to PNG / JPG and let vision read them:
# pdftoppm (PDF → images)
pdftoppm -r 200 -png mypdf.pdf page
# Outputs page-1.png, page-2.png, ...
Upload critical pages as images and ask vision to read directly — better than fighting bad extraction.
How to confirm the fix
- Open a fresh chat, upload the OCR’d / Markdown version, ask the same question — quotes + page numbers Ctrl+F-able in the source PDF = truly fixed.
- Have it quote one sentence verbatim from page 50 — exact match against the source PDF = lossless extraction.
- Have a colleague run the same OCR / conversion pipeline and upload — same results = stable process.
If still broken
- Cut PDF to minimum: keep just the 1-3 problem pages, convert to Markdown, upload — see if depth comes back.
- Swap extraction tool: Marker → Docling → Adobe Acrobat — compare outputs.
- Switch platform: high-stakes PDF analysis can also try Claude Projects, Gemini Workspace, dedicated RAG tools (NotebookLM).
- Package source PDF + extracted Markdown + bad-answer screenshot + expected content, file a ticket at help.openai.com.
Prevention
- Always sanity-check before relying: ask ChatGPT to quote one sentence from page 3. If it can’t, fix extraction first.
- Prefer digitally native PDFs (export directly from Word / LaTeX / Pages); avoid print-and-rescan chains.
- For high-stakes analysis (finance / compliance / contracts), always convert to Markdown locally before upload — never raw PDF.
- Table / number-dense pages → pre-convert to CSV / Markdown — accuracy jumps an order of magnitude.
- If you only need one chapter of a 200-page report, split that chapter out locally — don’t upload the whole thing.
Related reading
- ChatGPT PDF analysis not working
- ChatGPT large document incomplete analysis
- ChatGPT file analysis too shallow
- ChatGPT Projects
- ChatGPT file analysis
- ChatGPT Projects advanced workflow
Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #PDF analysis