ChatGPT: 'No Text Could Be Extracted From This File' (Scanned / Handwritten PDF)

Scanned or handwritten PDFs return 'No text could be extracted from this file' or hallucinated content because the extractor only reads the text layer. Fastest fix: upload page images, not the PDF.

Published: May 24, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You upload a scanned contract or a photo of handwritten notes saved as PDF, and ChatGPT replies No text could be extracted from this file, returns an empty extraction, or — worse — hallucinates plausible-sounding content that does not match the page. The cause: the default PDF extractor (pdfplumber / PyPDF inside Code Interpreter / Advanced Data Analysis) reads only the document’s text layer. A pure-image PDF has no text layer, so the extractor gets "".

Fastest fix (works in under a minute): don’t upload the PDF. Convert each page to a PNG or JPG and upload the images instead. ChatGPT routes images through its vision pipeline (GPT-5.5 as of June 2026), which OCRs printed and handwritten text directly. Users consistently report images extract “perfectly” where the same content as a PDF fails.

If you need a searchable PDF (for an archive, a recurring pipeline, or a colleague), OCR the file upstream first — steps below.

Has this changed? (June 2026)

Newer models (GPT-5.4 / GPT-5.5) sometimes auto-apply vision OCR to image-only PDFs in the web app, so an occasional scanned PDF now reads when it would have failed a year ago. But the behavior is inconsistent — it depends on your tier and which backend pipeline handles the upload, and Free/older paths still treat image-only PDFs as blank. Treat auto-OCR as a bonus, not a guarantee. The reliable play is still: upload images, or OCR before upload, and always verify the result (see “How to confirm the fix”).

Which bucket are you in?

Symptom	Likely cause	Go to
`No text could be extracted from this file`, every page blank	No text layer at all (pure scan / photo)	Step 1, then Step 2 or 5
Typed text extracts, handwriting comes back empty	OCR skips handwriting	Step 5 (vision)
Garbled output like `~~%@#`	Wrong OCR language pack	Step 3 (set `-l`)
Reads but invents content not on the page	Model hallucinating over an empty extraction	Step 1 to prove it’s empty, then Step 5
Error on a digital PDF that has selectable text	Malformed / nonstandard PDF structure	Step 6 (flatten)

Common causes

1. PDF has no text layer at all

Scanned-to-PDF from a printer, photos saved as PDF, or “Print to PDF” of an image — none have a text layer. pdfplumber.extract_text() returns "". ChatGPT correctly reports it has no text and either declines (No text could be extracted from this file) or, if you pushed it to “just summarize,” invents content.

How to spot it: open the PDF in any reader and try to select text with your cursor. If you cannot select it (the cursor stays an arrow, not a text caret), there is no text layer.

2. Handwriting where typed OCR fails

Even OCR-preprocessed PDFs often skip handwriting. Acrobat’s OCR is tuned for printed type. Handwritten cursive in margins, signatures, and form fills are typically dropped. Open-source OCR is worse: as of 2026, Tesseract is effectively unusable on real handwriting (benchmarks put it well below a usable threshold on cursive). For handwriting, go straight to vision (Step 5).

How to spot it: typed sections extract fine; handwritten sections come back blank.

3. Built-in extraction kicks in before vision

When you attach a PDF, Code Interpreter reaches for pdfplumber first. Even if the model could use vision, the PDF path does not reliably auto-fall-back to it. Attaching images instead sends the content straight to the vision pipeline — which is why Step 5 works.

4. Low-quality scan: skew, low DPI, JPEG artifacts

A scan below 200 DPI, skewed more than 5 degrees, or saved as a heavily compressed JPEG-in-PDF will confound most OCR engines. Tesseract wants at least 300 DPI for printed text (where it hits 95%+ accuracy); below ~200 DPI accuracy collapses.

5. Non-English handwriting / mixed scripts

OCR engines need a language pack matching the script. A Chinese handwritten note OCR’d with English-only Tesseract returns garbage. Acrobat Pro defaults to your locale and does not auto-detect mixed scripts.

Shortest path to fix

Step 1: Check whether the PDF has a text layer

Ask ChatGPT to run this in Code Interpreter (or run it locally):

import pdfplumber

with pdfplumber.open("/mnt/data/file.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text() or ""
        print(f"page {i+1}: {len(text)} chars")

If every page reports 0 chars, you need OCR or vision. If some pages have text and some don’t, the doc is mixed — OCR only the missing pages. (This also proves whether any “summary” the model gave you was real or hallucinated: 0 chars means it had nothing to summarize.)

Step 2: Upload page images instead of the PDF (fastest, no install)

The quickest reliable fix. Export each page as an image and attach the images to the chat:

Already have the file open? In most PDF viewers: export/save each page as PNG or JPG, or screenshot at high zoom.
Scripted, from the PDF:

from pdf2image import convert_from_path
images = convert_from_path("/mnt/data/file.pdf", dpi=300)
for i, img in enumerate(images):
    img.save(f"/tmp/page_{i+1}.png")

Then click the paperclip in ChatGPT, attach page_1.png, page_2.png, …, and say: “Transcribe the text on each image, including any handwriting, then summarize.” Vision handles printed and handwritten text in one pass.

For long documents, attach in batches of 5–10 pages so the chat doesn’t run out of context.

Step 3: OCR with Tesseract (free, scriptable, for a searchable file)

When you need a searchable text file or PDF — e.g. for a recurring pipeline:

# macOS install
brew install tesseract tesseract-lang
brew install poppler  # provides pdftoppm

# Convert PDF pages to images at 300 DPI, then OCR
pdftoppm -r 300 input.pdf page -png
for f in page-*.png; do
  tesseract "$f" "${f%.png}" -l eng
done

# Concatenate the OCR text
cat page-*.txt > ocr-output.txt

Use -l eng+chi_sim for mixed English / Simplified Chinese (any installed pack works: -l deu, -l fra, etc.). Upload ocr-output.txt to ChatGPT instead of the original PDF. Note: Tesseract is great on clean printed scans (95%+ at 300 DPI) but poor on handwriting — for handwriting use Step 5 instead.

Step 4: Adobe Acrobat Pro OCR (best printed-scan quality, GUI)

In Acrobat Pro: Tools → Scan and OCR → Recognize Text → In This File. Pick the correct language, then Save. The new PDF has a selectable text layer; re-upload to ChatGPT and the built-in extractor works. Acrobat handles skew correction and low-DPI cleanup automatically — best results with the least effort if you have a Pro license. For recurring work, record an Action Wizard action so the whole folder OCRs in one click.

Step 5: For handwriting, send page images to vision (most accurate)

When OCR genuinely fails (cursive, faded ink, non-Latin handwriting), bypass OCR entirely and let ChatGPT’s vision read the images you exported in Step 2:

Transcribe the handwriting on each image verbatim. If a word is illegible, mark it [illegible] instead of guessing. Then give me a clean summary.

The [illegible] instruction is important — it stops the model from inventing words to fill gaps, which is the #1 cause of plausible-but-wrong transcriptions. Vision models handle cursive noticeably better than any open-source OCR.

Step 6: If it’s a digital PDF that still errors — flatten it

If the PDF does have selectable text but still throws No text could be extracted from this file, the file structure is likely nonstandard (some third-party exporters produce these). Open it in Chrome or Edge, choose Print → Save as PDF (or “Microsoft Print to PDF”), and re-upload the flattened copy. This rebuilds a clean structure ChatGPT can parse. Alternatively, select the text in your PDF reader and paste it straight into the chat box.

Use Google Docs auto-OCR for a one-off file (no install)

Upload the PDF to Google Drive, right-click → Open with → Google Docs. Drive auto-OCRs and converts. Copy the resulting text into ChatGPT, or export as a TXT/DOCX and upload. Works decently on clear, consistent handwriting. Free, nothing to install.

How to confirm the fix

After OCR, re-run the text-layer probe before trusting any summary:

import pdfplumber
with pdfplumber.open("/mnt/data/ocred.pdf") as pdf:
    print("pages:", len(pdf.pages))
    print("first page chars:", len(pdf.pages[0].extract_text() or ""))
    print("sample:", (pdf.pages[0].extract_text() or "")[:200])

Non-zero char counts and a recognizable sample = OCR succeeded. If the sample looks like ~~%@#, the OCR engine matched the wrong language pack — re-run with the correct -l argument. If you uploaded images instead of a PDF, the verification is simpler: spot-check the model’s transcription against two or three lines you can read on the original page.

Prevention

Before sharing a PDF with ChatGPT, verify you can select text with your cursor. If not, upload images or OCR first.
For handwritten notes, photograph individual pages with good lighting and upload them as images, not as PDF — vision handles images natively and skips the broken PDF path.
For recurring work (invoice processing, contract review), build the OCR step once (Tesseract script or Acrobat Action Wizard) and feed only the OCR output to ChatGPT.
Scan at 300 DPI minimum and save with an embedded text layer (Acrobat does this by default with OCR on).
For mixed-language documents, OCR each section with the correct language pack rather than running one pass over the whole file.

FAQ

Why does ChatGPT say “No text could be extracted from this file”? The PDF is image-only (a scan or photo) with no text layer, so the extractor reads zero characters. Upload the pages as images instead, or OCR the file first.

Can ChatGPT OCR a scanned PDF by itself now? Sometimes. As of June 2026, GPT-5.4/5.5 may auto-apply vision OCR to image PDFs, but it’s inconsistent across tiers and backends and Free/older paths still fail. Don’t rely on it — upload images or OCR upstream for predictable results.

Why does it make up content that isn’t in my scan? When extraction returns an empty string and you ask for a summary anyway, the model fills the void with plausible guesses. Run the Step 1 probe to confirm whether it actually read anything before trusting the output.

Images or PDF — which should I upload? Images. Attaching PNG/JPG pages routes content straight through vision (handles printed and handwritten text); a PDF goes through text-layer extraction first, which fails on scans.

What’s the best free option for handwriting? ChatGPT’s own vision on uploaded page images, or Google Docs auto-OCR for clear writing. Tesseract is strong on printed scans but unreliable on handwriting as of 2026.

My OCR output is garbled symbols — what’s wrong? The OCR ran with the wrong language pack. Re-run Tesseract with the matching -l code (e.g. -l chi_sim for Simplified Chinese, -l eng+chi_sim for mixed), or set the correct language in Acrobat before recognizing text.

Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #ocr