Gemini PDF Citations Missing or Wrong Page Numbers

Q: What's the best model for PDF citations?

Gemini 3.1 Pro (`gemini-3.1-pro`) with its 1M-token context. It handles a 1000-page PDF in one request and, paired with File Search, returns verifiable `page_number` metadata. Flash variants are faster and cheaper but drop and invent citations more often.

Asked Gemini for page-numbered PDF citations and got vague refs, wrong pages, or invented numbers? It's almost always OCR quality, weak prompting, or the wrong surface. Verified fixes for June 2026.

Published: May 24, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You upload a 200-page PDF to Gemini 3.1 Pro, ask “summarize Chapter 3 with exact page citations,” and get one of three failures: no page numbers at all, citations like [page 5] that don’t match the real page 5, or invented page numbers that don’t exist in the doc.

Fastest fix: make sure the PDF has a real, selectable text layer (OCR it if it’s a scan), then prompt for a verbatim quote alongside every page number. If you need citations you can actually trust at scale, stop using the chat app and use the API’s File Search tool, which returns a real page_number for every grounded chunk. Details below.

PDF citation accuracy is the single biggest gap between Gemini’s marketed long-context capability and how researchers actually use it. The model can read the words; what it loses is the link between a sentence and the page it sat on.

Which bucket are you in?

Symptom	Most likely cause	Jump to
No page numbers at all	Prompt didn’t demand them, or output got truncated	Step 2 / cause 4
Page numbers off by a fixed amount (e.g. always +12)	Front-matter offset (printed page vs PDF index)	Step 3 / cause 2
Page numbers random / quotes don’t exist on that page	Weak OCR text layer, or hallucination	Step 1 / cause 5
Citations vanish only on long, multi-chapter requests	Output budget too small, citations trimmed first	Step 4 / cause 4
Works in API but not in the app	Consumer app is the weakest surface for this	Step 5

Common causes

By frequency:

1. PDF is scanned, OCR is weak (most common)

Scanned PDFs (from a copier, a phone scanner app, or an older archive) usually have no text layer, or a poor one. Gemini’s vision can read the page image, but it sees pixels, not “page 47 of 200,” so page attribution gets shaky.

How to judge: open the PDF in any reader and try to select a sentence with your cursor. If nothing highlights, there’s no text layer. If text highlights but is full of typos when you copy it, the OCR layer is bad.

2. Printed page numbers don’t match PDF page indices

A book PDF might have 12 unnumbered front-matter pages before “page 1” ever prints. So Gemini’s “page 47” might be physical PDF page 47 (which prints as page 35 in the book), or vice versa. This produces a consistent offset, which is the good news — it’s correctable.

How to judge: open the PDF in a viewer that shows the index (e.g. 47 / 200) and compare it to the number printed on the page.

3. The prompt didn’t demand citations

Gemini summarizes without citations by default. “Summarize Chapter 3” gives prose. “Summarize Chapter 3, and for every claim give the PDF page number and a quoted phrase” gives evidence-backed citations.

How to judge: re-prompt with an explicit citation requirement (Step 2). If citations appear, this was it.

4. Long doc plus a small output budget drops citations first

When the output cap is tight, the model treats page numbers and quotes as trimmable “extras” and keeps the prose. A multi-chapter summary under a small max_output_tokens will quietly lose its citations.

5. Hallucinated citations on edge cases

When Gemini isn’t sure which page a claim came from, it sometimes invents a plausible-looking number. This is a known long-context failure mode and is exactly why you cite a quote, not just a number — a fake quote is far easier to catch than a fake page number.

6. Wrong surface: the Gemini app is weaker than the API

The consumer app at gemini.google.com runs a lighter retrieval pipeline. It will show clickable page references when you upload a PDF, but for citation-grade work it is the weakest of the three surfaces (app < AI Studio < API File Search).

Shortest path to fix

Step 1: Give the PDF a real text layer (OCR scans first)

If selection failed in cause 1, OCR before you upload. As of June 2026:

Adobe Acrobat (Pro): Tools -> Scan & OCR -> Recognize Text, choose language and pages, then run. Highest quality, especially for tables and multi-column layouts.
ABBYY FineReader: best for complex layouts and dense academic typesetting.
macOS Preview (built-in): open the scan, then File -> Export... and tick “Embed Text” (this is Apple’s OCR; available since macOS Sonoma and still present in Tahoe). Free, fine for clean scans.
Google Drive: upload the PDF, right-click -> Open with -> Google Docs. Drive OCRs it on conversion; export back to PDF.
Acrobat web (free tier): handles small docs without a desktop install.

Verify the OCR: open the new PDF and select a paragraph. The highlighted selection should match what your eyes see, with no garbled characters.

Step 2: Prompt explicitly for page-cited quotes

Don’t write “summarize Chapter 3.” Write:

Summarize Chapter 3.

For EVERY claim, you must:
1. Cite the exact PDF page number (page X).
2. Include a short verbatim quote (5-15 words) from that page.
3. If you cannot find a supporting quote, OMIT the claim.

If a page number is uncertain, write "page uncertain" instead of guessing.

Demanding a verbatim quote is what defeats the hallucination mode in cause 5: a quote either exists on the page or it doesn’t, and you can confirm it in seconds.

Step 3: Spot-check, and detect the offset pattern

Open the PDF and check 2-3 random citations against the source.

If every page is off by the same amount, that’s the front-matter offset from cause 2. Note the offset (e.g. “PDF index = printed page + 12”) and correct mentally, or tell Gemini “cite the printed page number shown on the page, not the PDF index.”
If pages are randomly wrong or quotes don’t appear on the cited page, the OCR is bad or the model hallucinated. Go back to Step 1, or move to the File Search path in Step 5.

Step 4: Raise the output budget on long summaries

If citations only vanish on big requests (cause 4), the output cap is the culprit. In the API, set a generous max_output_tokens (32768 is comfortable for a long chapter). In the app, ask for one section at a time so the response never approaches the limit. Splitting a 100+ page PDF into per-chapter requests reliably produces sharper citations than asking for whole-book analysis in one shot.

Step 5: For citation-grade work, use the API File Search tool

This is the biggest change since this article was first written. The Gemini API now ships a built-in RAG tool, File Search, that returns a real, verifiable page_number for every grounded chunk instead of asking the model to recall the page from memory. This is the most reliable way to get trustworthy citations as of June 2026.

from google import genai
from google.genai import types
import time

client = genai.Client(api_key="YOUR_API_KEY")

# 1. Create a File Search store
store = client.file_search_stores.create(
    config={"display_name": "research-pdfs"}
)

# 2. Upload the PDF and import it into the store
uploaded = client.files.upload(file="paper.pdf")
op = client.file_search_stores.import_file(
    file_search_store_name=store.name,
    file_name=uploaded.name,
)
while not op.done:
    time.sleep(5)
    op = client.operations.get(op)

# 3. Ask, grounded against the store
response = client.models.generate_content(
    model="gemini-3.1-pro",
    contents="For each main argument, give a verbatim quote.",
    config=types.GenerateContentConfig(
        tools=[types.Tool(
            file_search=types.FileSearch(
                file_search_store_names=[store.name]
            )
        )],
    ),
)

print(response.text)

# 4. Read the real page numbers from grounding metadata
meta = response.candidates[0].grounding_metadata
for chunk in meta.grounding_chunks:
    ctx = chunk.retrieved_context
    print(ctx.page_number, "-", ctx.title)

The page numbers come from grounding_metadata.grounding_chunks[].retrieved_context.page_number, so they are tied to the indexed document rather than the model’s guess. If you don’t need RAG and just want a one-off pass, plain generate_content with an inline PDF still works (types.Part.from_bytes(data=pdf_bytes, mime_type="application/pdf")), but it relies on prompting (Step 2) for citations.

A few API facts worth knowing (June 2026):

A single request accepts a PDF up to 1000 pages / 50 MB; beyond that, split the document.
Natively embedded PDF text is extracted and not billed as tokens — another reason to OCR scans into real text rather than leaving them as images (image pages cost ~258 tokens each).
Set media_resolution in generationConfig if dense tables or small print are being misread; higher resolution tokenizes pages at finer detail.

Use Gemini 3.1 Pro (gemini-3.1-pro) rather than a Flash variant for citation work — the extra reasoning budget reduces both dropped and invented citations.

Step 6: Cross-check before you quote it anywhere

For research, legal, or academic use, never trust a single-model citation, even from File Search. Open the cited page in a PDF viewer and confirm the quote exists verbatim. It takes about 30 seconds per citation and saves a retraction.

How to confirm it’s fixed

You’re done when all three hold:

You can select clean text in the PDF (text layer is good).
Every claim in Gemini’s answer carries a page number and a quote, and a 3-citation spot-check matches the source exactly.
If you used File Search, grounding_chunks[].retrieved_context.page_number is populated and points at the right page.

If page numbers are off by a fixed offset, that’s the harmless front-matter case — not a failure.

Prevention

Pre-OCR every scanned PDF before any Gemini analysis. Make it a reflex.
Keep a reusable “cite-with-quote” prompt snippet for all citation work.
Process one chapter per turn for long docs rather than the whole book at once.
Always spot-check at least 3 random citations before quoting Gemini’s output anywhere external.
If you do this routinely, build a File Search workflow instead of fighting the consumer app. The citation quality difference is material.

FAQ

Why does Gemini cite “page 5” when the real text is on page 17? Almost always a front-matter offset (cause 2) or a weak OCR layer (cause 1). If the gap is constant across citations, it’s the offset; fix it by telling Gemini to use the printed page number. If the gap is random, OCR the PDF or switch to File Search.

Can the Gemini app at gemini.google.com give reliable page citations? It will show clickable references to PDF pages, which is fine for casual reading, but it runs a lighter retrieval pipeline than the API. For anything you’ll quote externally, use the API File Search tool and verify against the source.

Does OCR really matter if Gemini can “see” the page? Yes. Vision lets Gemini read the words, but a real text layer gives it stable page structure, makes native text free (not billed as tokens), and sharply reduces wrong-page attribution. Image-only PDFs are the number-one cause of bad citations.

What’s the best model for PDF citations? Gemini 3.1 Pro (gemini-3.1-pro) with its 1M-token context. It handles a 1000-page PDF in one request and, paired with File Search, returns verifiable page_number metadata. Flash variants are faster and cheaper but drop and invent citations more often.

How do I stop Gemini from inventing page numbers entirely? Require a verbatim quote for every claim and tell it to omit any claim it can’t quote (Step 2). A fake quote is trivial to catch on a spot-check; a fake page number alone is not. For zero guessing, use File Search, where page numbers come from the index rather than the model.

Tags: #Gemini #Troubleshooting #PDF