You drop a PDF into Claude, ask it to extract a table, and the result is technically a table but the columns are shifted by one, rows are merged where they should be separate, or numeric values land under the wrong header. Visually it looks plausible until you cross-check against the source. This is one of the most common failure modes for PDF work because most PDFs encode tables as positioned text runs without explicit row or column metadata — Claude (and every other PDF tool) has to reconstruct the grid from coordinates, and small layout quirks throw it off. There are reliable workarounds on both the source side and the prompt side.
Common causes
Ordered by how often each is the underlying problem.
1. PDF stores text as positioned runs, not a real table
Most PDFs (especially exported from Word, PowerPoint, or scanners) lay text down by xy-coordinates. There is no “row 3, column 2” marker. Claude infers the grid from spacing, and gets it wrong when spacing is irregular.
How to judge: Open the PDF in Preview or Acrobat, try to select a single cell. If selecting drags whole lines instead of cell content, the PDF has no table structure.
2. Merged cells and multi-line cells confuse row detection
A table with cells spanning two rows, or a single cell containing a wrapped paragraph, makes row inference ambiguous. Claude either merges adjacent rows or splits one row into two.
How to judge: Count rows in the source vs Claude’s output. Off-by-one or off-by-two is the signature.
3. Header row visually styled but not structurally tagged
Bold-formatted text at the top of a table reads as a header to humans, but the PDF stores it identically to other rows. Claude may treat the header as a data row or drop it entirely.
How to judge: Is the first row in Claude’s output the column headers or actual data?
4. Scanned PDF with OCR errors
Image-based PDFs run through OCR before extraction. OCR mistakes (1 vs l, 0 vs O, decimal vs comma) propagate directly into the table. Misaligned OCR also breaks column inference.
How to judge: Is the PDF a scan or a born-digital file? Search for any word in the document; if search returns nothing, it is a scan and OCR was applied.
5. Multi-column page layout misread as one wide table
Some reports use two-column page layouts. Claude may read across the page width and merge unrelated columns.
How to judge: Open the PDF and look at the page layout. Two newspaper-style columns with a gap = this is your cause.
6. Numeric formatting (parentheses for negatives, currency symbols)
Accounting PDFs often use (1,234) for negative and $1,234.56 for currency. Claude may strip symbols inconsistently or misinterpret negatives as positives.
How to judge: Spot-check numbers against the source. Sign errors and missing currency are the tell.
Before you start
- Decide what shape you actually need: CSV, JSON, Markdown table, plain prose. Each prompts differently.
- Have the original Excel or CSV source if you have it — extracting from PDF is always lossy.
- Open the PDF in Acrobat or Preview to inspect structure before prompting.
Information to collect
- Number of pages and tables in the PDF.
- Whether the PDF is born-digital or scanned.
- Column count, row count, presence of merged cells.
- Sample of correct rows and a sample of wrong rows from Claude’s output.
- Page layout (single column, multi-column, landscape).
- Any tools used to produce the PDF (Word, LaTeX, scanner).
Step-by-step fix
Step 1: Inspect the PDF source structure
Open in Acrobat or Preview. Try to select one table cell. If selection drags whole lines, there is no table structure and extraction will rely on coordinate inference. This sets your expectations.
Step 2: Reframe the prompt with structure hints
Tell Claude the table shape up front: “The table on page 4 has 5 columns: Name, Date, Region, Revenue, Margin. Extract all rows as a CSV with that header order. Numbers may use parentheses for negatives — preserve as negative.” Explicit structure dramatically improves accuracy.
Step 3: Extract one row at a time as a verification
Ask: “Read row 7 of the table on page 4 and output each cell separately on its own line, prefixed by column name.” This forces Claude to anchor on the structure rather than scan loosely.
Step 4: Use a dedicated extraction tool first, then ask Claude to clean
For complex tables, pre-extract with a tool like Tabula, Camelot, or AWS Textract. Paste the CSV into Claude and ask it to clean column alignment and verify numeric formatting. This is far more reliable than asking Claude to extract from raw PDF.
Step 5: Split multi-column pages first
If the page is two-column, convert it to single column first. Acrobat: Export PDF, Single column layout. Or split the page in half visually before sending to Claude.
Step 6: Spot-check 3 random rows against the source
Pick row 3, row 15, and the last row. Compare each cell to the source PDF. Patterns of errors (every row off by one, headers swapped) point to a systematic issue you can fix in the prompt.
Step 7: For accounting PDFs, lock numeric formatting in the prompt
Add: “Treat (1,234) as -1234. Treat $1,234.56 as the number 1234.56 and the currency USD. Output two columns: amount, currency.” This eliminates ambiguous number handling.
Verify
- Header row in the output matches the source headers exactly.
- Row count in the output matches the row count in the source.
- Three random spot-checked rows are correct cell-by-cell.
- Numeric totals (if any) reconcile with the source totals.
Long-term prevention
- For internal reports, publish CSV alongside the PDF so consumers can skip extraction entirely.
- For scanned PDFs, run a high-quality OCR pass (Acrobat Pro, ABBYY) before Claude sees them.
- Avoid two-column layouts for any document that might be machine-read.
- In Projects that handle finance or research data, add a custom instruction: “When extracting tables, always echo the assumed header row first for user confirmation.”
- For repeating workflows, build a Tabula or Camelot pre-extraction step and pipe clean CSV into Claude.
Common pitfalls
- Trusting a Claude-extracted table without spot-checking. Errors are usually systematic, not random.
- Asking “give me the table” without specifying columns. Claude infers and often infers wrong.
- Pasting a scan and expecting OCR-quality results. Pre-OCR first.
- Forgetting that merged cells and footnote rows confuse extraction.
- Treating negative-in-parentheses numbers as positive without telling Claude how to parse them.
FAQ
- Why are columns shifted by one? Usually a header cell that wraps to two lines, or an unusually wide first column that throws coordinate inference.
- Can Claude read scanned PDFs? Yes, with built-in OCR. Quality depends on scan resolution. Pre-OCR with a dedicated tool for anything critical.
- What is the most reliable extraction format to request? CSV with explicit headers in the prompt. Markdown tables are nice for reading but lossy for re-import.
- Does the model version matter? Yes, newer models handle layout inference noticeably better. Use the latest Opus or Sonnet for table-heavy work.
- Should I split a multi-page table into one prompt per page? Yes, for very wide or long tables. Reassemble in Excel.
- Is there a hard size limit? Soft cap around 100 pages per PDF on web upload. Past that, split before uploading.
Related
- Claude attachment preview not rendering
- Claude file upload stuck on processing
- Claude inaccurate answers
- Claude long context unstable
- Claude citations broken links
- Claude conversation export broken
Tags: #Claude #Troubleshooting #PDF