You upload an 80-page 10-K filing, ask “summarize” — get back five paragraphs that miss the key numbers, skip risk factors, and read like a press release. “Watermark” summaries are the LLM default behavior on long documents: when you don’t say what you care about, the model surfaces the obvious section headings and skips everything else.
To get a usable summary, tell it what to extract and how to structure it — not just “summarize”.
Common causes
By how generic each one makes the output:
1. Prompt is just “summarize” (most common)
Summarize this document → model doesn’t know if you want decisions, risks, numbers, or strategy — defaults to a non-specific overview.
How to judge: is your prompt a single verb (just “summarize”)?
2. Long doc triggers token compression, middle sections skipped
100K-token PDFs in the Web UI get compressed; the model often reads “first 20% + last 10%” and skips the middle. Critical appendix data gets ignored.
How to judge: middle-of-doc topics completely absent from the summary?
3. Tables / figures / code blocks lost
Complex tables flatten or vanish in PDF parsing. Gemini never sees the structure; summary lacks specific numbers.
How to judge: source has key tables (financials / comparison) but summary cites no specific numbers from them.
4. No output structure specified
Free-form prose → hard to locate the info you actually want.
5. Document itself is low-information
Marketing / SEO content is generic to start with; the summary inherits.
6. Used Flash / Lite
Flash is noticeably weaker than Pro on long-doc summarization.
Shortest path to fix
Step 1: Ask for outline first, then drill
Round 1:
Read this document. Give me ONLY a section-by-section outline (no summary yet):
- Section title
- Section length (pages)
- Key claim / topic (one sentence)
After outline, pick the 5-10 sections that matter, then:
Now give me a detailed summary of these sections only:
{section names}
For each:
- Key facts (with numbers)
- Decisions / recommendations
- Risks mentioned
- Direct quotes for critical claims
Step 2: Structured output template
Don’t let Gemini write prose. Give it tables / bullets:
Summarize this 10-K filing using this exact structure:
## Business Overview
- Main revenue segments + % of total
- Geographic mix
## Financial Highlights
| Metric | This year | Last year | YoY change |
|---|---|---|---|
| Revenue | | | |
| Operating margin | | | |
| Free cash flow | | | |
| Headcount | | | |
## Risk Factors (top 5)
1. ... (with page reference)
## Strategic Initiatives
- ...
## Management Tone Indicators
- Words used more / less than last year's filing
Explicit format + table slots + numeric requirements — model can’t ghost the gaps.
Step 3: Chunk long docs into 30-40 pages
80-page doc:
Split into pages 1-30, 31-60, 61-80
Upload each separately, request summary using Step 2 template
Finally merge: ask Gemini to consolidate the three chunks + extract cross-chunk themes
Sidesteps token compression skipping the middle.
Step 4: Extract tables separately first
If key info lives in tables:
Extract every table from this document.
For each:
- Table title
- Headers (row + column)
- All cell values as markdown
- Page number
Extract tables before semantic summary so numbers survive.
Step 5: Switch to Gemini 2.5 Pro
Top model picker → Gemini 2.5 Pro (not Flash / Lite)
Pro is ~40% deeper than Flash on long-doc summaries.
Step 6: Use the “decisions / risks / numbers” combo
Universal template:
Extract from this document:
DECISIONS: What did the author decide or recommend?
RISKS: What risks are mentioned? Use original phrasing.
NUMBERS: All quantitative claims (dates, percentages, dollar amounts) with surrounding context.
GAPS: What questions does the document raise but not answer?
Works for ~90% of business documents.
Step 7: Verify critical numbers
LLMs occasionally hallucinate numbers in long-doc summaries (< 5% but watch closely):
- Pick 5 critical numeric claims
- Ctrl+F in the source
- Mismatches → ask Gemini to re-extract: “Number X is wrong, find the actual value on page Y”
Prevention
- Always outline-first, drill-second; never single-pass “summarize”
- Structured output template (tables + bullets) prevents prose drift
- Long docs (> 50 pages) chunk into 30-40 page batches
- Extract tables separately before semantic summary so numbers don’t vanish
- Verify critical numbers yourself; ~5% hallucination rate in long-doc summaries
Related
Tags: #Gemini #Debug #Troubleshooting