You upload a scanned contract or a photo of handwritten notes saved as PDF, and ChatGPT says “I can’t read this,” returns an empty extraction, or worse, hallucinates plausible-sounding content that does not match the page. The cause is that the default PDF extractor (PyPDF / pdfplumber inside Code Interpreter) reads only the text layer. A pure-image PDF has no text layer. The model’s vision capability can read images, but PDFs are not automatically routed through vision. The fix is to run OCR before upload, or convert each page to an image and feed it through vision explicitly.
Common causes
1. PDF has no text layer at all
Scanned-to-PDF from a printer, photos saved as PDF, or “Print to PDF” of an image — none have a text layer. pdfplumber.extract_text() returns "". ChatGPT correctly reports it has no text and either declines or invents.
How to spot it: Open the PDF in any reader and try to select text with your cursor. If you cannot select (cursor stays as an arrow, not a text caret), there’s no text layer.
2. Handwriting where typed OCR fails
Even OCR-pre-processed PDFs often skip handwriting. Acrobat’s OCR is tuned for printed text. Handwritten cursive in margins, signatures, and form fills are typically dropped.
How to spot it: Typed sections extract fine, handwritten sections come back blank.
3. Built-in extraction kicks in before vision
Code Interpreter sees a PDF and reaches for pdfplumber first. Even if the model could use vision, it doesn’t auto-fall-back. You need to explicitly say “convert each page to an image and look at it.”
4. Low-quality scan: skew, low DPI, JPEG artifacts
A scan below 200 DPI, skewed more than 5 degrees, or saved as a heavily compressed JPEG-in-PDF will confound most OCR engines. Tesseract drops below 50% accuracy.
5. Non-English handwriting / mixed scripts
OCR engines need a language pack matching the script. A Chinese handwritten note OCR’d with English Tesseract returns garbage. Acrobat Pro defaults to your locale but does not auto-detect mixed scripts.
Shortest path to fix
Step 1: Check whether the PDF has a text layer
import pdfplumber
with pdfplumber.open("/mnt/data/file.pdf") as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text() or ""
print(f"page {i+1}: {len(text)} chars")
If every page reports 0 chars, you need OCR. If some have text and some don’t, the doc is mixed — OCR only the missing pages.
Step 2: OCR locally with Adobe Acrobat Pro
In Acrobat Pro: Tools - Scan and OCR - Recognize Text - In This File. Pick the correct language. Save. The new PDF has a selectable text layer. Re-upload to ChatGPT and the built-in extractor works.
Acrobat handles skew correction and low-DPI cleanup automatically — best results with the least effort if you have a Pro license.
Step 3: OCR with Tesseract for free / scripted workflows
# macOS install
brew install tesseract tesseract-lang
brew install poppler # for pdftoppm
# Convert PDF pages to images, then OCR
pdftoppm -r 300 input.pdf page -png
for f in page-*.png; do
tesseract "$f" "${f%.png}" -l eng
done
# Concatenate the OCR text
cat page-*.txt > ocr-output.txt
Use -l eng+chi_sim for mixed English / Simplified Chinese. Upload ocr-output.txt to ChatGPT instead of the original PDF.
Step 4: Use Google Docs auto-OCR for one-off files
Upload the PDF to Google Drive. Right-click - Open with - Google Docs. Drive auto-OCRs and converts. Copy the resulting text. Paste into ChatGPT or export as a TXT file and upload.
Works decently on handwriting when the writing is clear and consistent. Free, no install.
Step 5: For handwriting, send page images to vision
When OCR genuinely fails (cursive, faded ink), bypass OCR and use ChatGPT’s vision:
from pdf2image import convert_from_path
images = convert_from_path("/mnt/data/file.pdf", dpi=300)
for i, img in enumerate(images):
img.save(f"/tmp/page_{i+1}.png")
Then in the chat, attach page_1.png, page_2.png directly and say: “Transcribe the handwriting on each image, then summarize.” Vision models handle handwriting noticeably better than OCR on cursive.
For very long documents, do this in batches of 5-10 pages so the chat doesn’t run out of context.
How to confirm the fix
After OCR, re-run the text-layer probe before trusting any summary:
import pdfplumber
with pdfplumber.open("/mnt/data/ocred.pdf") as pdf:
print("pages:", len(pdf.pages))
print("first page chars:", len(pdf.pages[0].extract_text() or ""))
print("sample:", (pdf.pages[0].extract_text() or "")[:200])
Non-zero char counts and a recognizable sample = OCR succeeded. If sample looks like ~~%@#, the OCR engine matched the wrong language pack — re-run OCR with the correct -l argument.
Prevention
- Before sharing a PDF with ChatGPT, always verify you can select text with your cursor. If not, OCR first.
- For workflows that recur (invoice processing, contract review), build an OCR pipeline once with Tesseract or Adobe Acrobat Pro Action Wizard and feed only the OCR output to ChatGPT.
- For handwritten notes, photograph individual pages with good lighting and upload as images, not as PDF — vision handles images natively.
- Scan at 300 DPI minimum, save as PDF with embedded text layer (Acrobat does this by default with OCR).
- For mixed-language documents, OCR each section with the correct language pack rather than running one pass over the whole file.