You upload a 200-page PDF to Gemini 2.5 Pro, ask “summarize Chapter 3 with exact page citations,” and get one of three failures: no page numbers at all, citations like “[page 5]” that don’t match the real page 5, or hallucinated page numbers that don’t exist in the doc.
PDF citation accuracy is the single biggest gap between Gemini’s marketed long-context capability and how researchers actually use it. The fix is almost always: improve the OCR layer, prompt for citations more aggressively, and verify with quoted text rather than trusting page numbers alone.
Common causes
By frequency:
1. PDF is scanned, OCR is weak (most common)
Scanned PDFs (from a copier, mobile scanner app, or older archive) usually have no text layer, or have a poor OCR text layer. Gemini’s vision can read the pages but loses page-number context — model sees images, not “page 47 of 200.”
How to judge: try to select text in the PDF in a reader. If you can’t, there’s no text layer. If you can but it has typos, OCR is weak.
2. Page numbers in the PDF don’t match logical page numbers
A book PDF might have 12 unnumbered front-matter pages before “page 1” prints. Gemini’s reported “page 47” might be physical page 47 (= page 35 in the book’s own numbering) or vice versa.
How to judge: open the PDF in a viewer and check whether printed page numbers match PDF page indices.
3. Prompt didn’t demand citations
Gemini by default summarizes without citations. Asking “summarize Chapter 3” gives a summary; asking “summarize Chapter 3, citing exact page number and a quoted phrase for every claim” gives citations.
How to judge: re-prompting with explicit citation requirement produces them.
4. Long doc + small output budget = citations dropped first
When output cap is tight, the model trims citations as “extras” and keeps the prose. So with default 8K output, a multi-chapter summary loses page numbers.
5. Hallucinated citations on edge cases
If Gemini wasn’t certain which page a claim came from, it sometimes invents a plausible-looking number. This is a known long-context failure mode.
6. Wrong surface — gemini.google.com less reliable than API
Consumer app has lighter retrieval pipeline than AI Studio / API. For citation-grade work, the consumer app is the weakest surface.
Shortest path to fix
Step 1: Run OCR before uploading
If your PDF is scanned, run OCR first. Best options:
- Adobe Acrobat: Tools → Scan & OCR → Recognize Text → In This File. Highest quality, especially for tables.
- ABBYY FineReader: Best for complex layouts, multi-column.
- macOS Preview (built-in OCR via Quick Actions): free, decent for clean scans.
- Google Drive: upload PDF, right-click → Open with Google Docs (Google OCRs automatically).
- Adobe Acrobat web free tier handles small docs.
Verify: open the OCR’d PDF, try to select text. Selection should match what you see.
Step 2: Prompt explicitly for page-cited quotes
Instead of “summarize chapter 3,” use:
Summarize Chapter 3.
For EVERY claim, you must:
1. Cite the exact PDF page number (page X)
2. Include a short quoted phrase (5-15 words) from that page
3. If you cannot find a quoted phrase to support a claim, OMIT the claim
If a page number is uncertain, say "page uncertain" rather than guessing.
This pattern forces the model to attach evidence and prevents the worst hallucination mode (inventing numbers).
Step 3: Verify by spot-check
After getting citations, open the PDF and check 2-3 random ones. If page numbers are off by a consistent offset, that’s front-matter offset (see cause 2) — easy to correct mentally. If page numbers are random, OCR is bad or model hallucinated; back to step 1.
Step 4: For citation-grade work, use the API
from google import genai
from google.genai import types
client = genai.Client(api_key="YOUR_API_KEY")
with open("paper.pdf", "rb") as f:
pdf_bytes = f.read()
response = client.models.generate_content(
model="gemini-2.5-pro",
contents=[
types.Part.from_bytes(data=pdf_bytes, mime_type="application/pdf"),
"For each main argument, cite the exact PDF page and a verbatim quote."
],
config=types.GenerateContentConfig(max_output_tokens=32768),
)
API + Gemini 2.5 Pro is the highest-quality surface for citation work.
Step 5: Split very long docs into sections first
For PDFs over 100 pages, processing one chapter at a time gives sharper citations than asking for whole-book analysis. Use a separate API call per chapter and reassemble.
Step 6: Cross-check against the source
For research / legal / academic use, never trust a single-model citation. Verify by opening the page in a PDF viewer and confirming the quote exists. This step takes 30 seconds per citation and saves embarrassment.
Prevention
- Pre-OCR every scanned PDF before any Gemini analysis — set it as a habit
- Keep a standard “cite-with-quote” prompt snippet you reuse for all citation work
- For multi-chapter docs, process one chapter per turn rather than the whole doc — citations stay sharper
- Always spot-check at least 3 random citations before quoting Gemini’s output anywhere external
- If you’re doing this routinely, set up an API workflow rather than fighting the consumer app — citation quality is materially better
Related
- Gemini doc summary weak
- Gemini large context 1M truncated
- Gemini file upload issue
- Gemini context too short
- Gemini 2.5 output truncated
- Gemini deep research failed
Tags: #Gemini #Troubleshooting #PDF