Why are columns shifted by exactly one?

Usually a header cell that wraps to two lines, or an unusually wide first column that throws off grid inference. Naming the headers explicitly in the prompt (Step 2) fixes it.

Can Claude read scanned PDFs?

Yes — it analyzes each page as an image plus any extracted text. Quality tracks scan resolution. For anything critical, run a dedicated OCR pass first so a real text layer exists.

What is the most reliable format to request?

CSV with explicit headers named in the prompt. Markdown tables read nicely but are lossier to re-import into a spreadsheet.

Does the model version matter?

Yes. Newer models handle layout inference noticeably better; use the latest Claude Opus 4.7 or Sonnet 4.6 (1M-token context as of June 2026) for table-heavy and long documents.

Should I split a wide or long table across prompts?

Yes — extract per page or per section, then reassemble in Excel. This avoids context degradation on dense pages.

What are the size limits?

On the API, one request allows up to 600 pages and 32 MB total (100 pages on older 200k-context models). For web/app uploads, split very large PDFs before uploading.

Troubleshooting

Claude PDF Table Extraction Pulls Cells Into Wrong Columns

Claude reads a PDF but table cells land in the wrong columns, rows merge, or numbers shift one header over. Here is the fastest fix plus a cause-by-cause repair for both the source file and the prompt.

Published: May 24, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You drop a PDF into Claude, ask it to extract a table, and the result is technically a table, but the columns are shifted by one, rows are merged where they should be separate, or numeric values land under the wrong header. It looks plausible until you cross-check against the source.

Fastest fix: tell Claude the exact column headers and order up front, ask for CSV, then verify three rows. A prompt like The table on page 4 has 5 columns in this order: Name, Date, Region, Revenue, Margin. Output every row as CSV with that header. Negatives may appear as (1,234) — keep them negative. fixes the large majority of “wrong column” cases on the first retry, because you remove the guesswork Claude was doing about the grid.

This is one of the most common PDF failures because most PDFs encode tables as positioned text runs with no row or column metadata. Here is what actually happens under the hood, which changes how you fix it: as of June 2026, when you send a PDF, Claude converts each page to an image and extracts the text layer, then reasons over both together (per Anthropic’s PDF support docs). So it is not pure OCR and not pure coordinate parsing — it is vision plus text. That matters: if the visual layer is degraded (low-res scan, tiny fonts, a page rotated sideways) or the text layer is garbage (positioned runs with no grid), Claude’s reconstruction of the table drifts. Fix the weaker of the two layers and accuracy usually jumps.

Which bucket are you in

Run this 20-second triage before changing anything. Open the PDF in Preview (macOS) or Acrobat and try the checks.

What you see	Most likely cause	Jump to
Selecting “one cell” drags the whole line	No table structure; coordinate inference	Cause 1, Step 2
Output rows off by one or two vs source	Merged / multi-line cells	Cause 2, Step 3
Header row treated as data, or dropped	Header styled but not tagged	Cause 3, Step 2
Search finds no text in the file	Scanned image, OCR applied	Cause 4, Step 4
Two newspaper columns read as one wide table	Multi-column page layout	Cause 5, Step 5
Negatives flip sign, currency vanishes	Accounting number formatting	Cause 6, Step 7
Whole document looks roughly right but fuzzy	Dense/large PDF filling context	Cause 7, Step 6

Common causes

Ordered by how often each is the underlying problem.

1. PDF stores text as positioned runs, not a real table

Most PDFs, especially those exported from Word, PowerPoint, or a scanner, lay text down by xy-coordinates. There is no “row 3, column 2” marker. Claude infers the grid from spacing and the page image, and gets it wrong when spacing is irregular.

How to judge: open the PDF in Preview or Acrobat and try to select a single cell. If selecting drags whole lines instead of cell content, the PDF has no table structure.

2. Merged cells and multi-line cells confuse row detection

A cell spanning two rows, or a single cell containing a wrapped paragraph, makes row inference ambiguous. Claude either merges adjacent rows or splits one row into two.

How to judge: count rows in the source versus Claude’s output. Off-by-one or off-by-two is the signature.

3. Header row visually styled but not structurally tagged

Bold text at the top of a table reads as a header to humans, but the PDF stores it identically to other rows. Claude may treat the header as a data row or drop it entirely.

How to judge: is the first row in Claude’s output the column headers, or actual data?

4. Scanned PDF with OCR errors

Image-only PDFs have no text layer, so Claude relies almost entirely on the page image. OCR-style mistakes (1 vs l, 0 vs O, decimal vs comma) and misalignment propagate into the table.

How to judge: search for any word in the document (Cmd/Ctrl+F). If search returns nothing, it is a scan with no real text layer.

5. Multi-column page layout misread as one wide table

Some reports use two-column page layouts. Claude may read across the page width and merge unrelated columns.

How to judge: look at the page. Two newspaper-style columns with a gutter is your cause.

6. Numeric formatting (parentheses for negatives, currency symbols)

Accounting PDFs often use (1,234) for negative and $1,234.56 for currency. Claude may strip symbols inconsistently or read a parenthesized negative as positive.

How to judge: spot-check numbers against the source. Sign errors and missing currency are the tell.

7. The PDF is dense or large and fills the context before the page limit

Anthropic’s docs note that dense PDFs (many small-font pages, complex tables, heavy graphics) can fill the context window before reaching the page limit, and large files can fail even via the Files API. When that happens, later pages get a degraded read and tables on them drift.

How to judge: does accuracy fall off only on later pages, or only on the densest pages? That points here rather than at the table itself.

Before you start

Decide the shape you actually need: CSV, JSON, Markdown table, or plain prose. Each prompts differently. CSV with explicit headers is the most reliable to re-import.
If you have the original Excel or CSV, use it. Extracting from PDF is always lossy.
Open the PDF in Acrobat or Preview and inspect structure before prompting.
Collect: page and table count, born-digital vs scanned, column/row count, presence of merged cells, page layout (single/multi-column, landscape), and a sample correct row plus a sample wrong row from Claude’s output.

Step-by-step fix

Step 1: Inspect the PDF source structure

Open in Acrobat or Preview and try to select one table cell. If selection drags whole lines, there is no table structure and extraction relies on coordinate plus image inference. This sets your expectations and tells you whether to fix the file (Steps 4-5) or just the prompt (Steps 2-3).

Step 2: Reframe the prompt with explicit structure

State the table shape up front: The table on page 4 has 5 columns in this order: Name, Date, Region, Revenue, Margin. Extract all rows as CSV with that exact header. Numbers may use parentheses for negatives — preserve them as negative. Naming the headers and their order removes the grid-guessing that causes shifted columns. Anthropic’s own guidance is to use the logical page numbers shown in your PDF viewer when you reference pages, so Claude anchors to the right page.

Step 3: Extract one row at a time as a verification probe

Ask: Read row 7 of the table on page 4 and output each cell on its own line, prefixed by its column name. This forces Claude to anchor on the structure rather than scan loosely, and it exposes exactly where an off-by-one starts so you can correct the header list and re-run the full extract.

Step 4: For scans, pre-OCR; for complex tables, pre-extract then have Claude clean

If the file is a scan, run a high-quality OCR pass first (Acrobat Pro’s “Recognize Text,” or ABBYY FineReader) so a real text layer exists before Claude sees it. For complex borders-and-merges tables, pre-extract with a dedicated tool, then paste the CSV into Claude and ask it to fix alignment and number formatting. This is far more reliable than asking Claude to extract from raw PDF.

Camelot (Python, free) is the usual workhorse as of June 2026: use flavor="lattice" for tables with visible ruled lines between cells, flavor="stream" for whitespace-separated tables, or flavor="auto" to let it choose. Tabula and AWS Textract are alternatives; Textract is worth it for noisy scans where you want a managed OCR plus table model.

Step 5: Split multi-column pages before sending

If the page is two-column, convert to single column first so Claude does not read across the gutter. In Acrobat, export the page to a reflowable format (or use a split/crop tool) so each column becomes its own block, then send. For one-off pages, cropping the page in half and sending each half separately works.

Step 6: For dense or long PDFs, split and use a large-context model

If accuracy degrades on later or denser pages, split the document into sections and extract one section per prompt, then reassemble in Excel. Use the latest large-context model for table-heavy work — Claude Opus 4.7 and Sonnet 4.6 carry a 1M-token context as of June 2026, which holds far more dense pages before the read degrades. On the API, a single request allows up to 600 pages and 32 MB total (100 pages on the older 200k-context tier); in the web app, very large PDFs are best split before upload.

Step 7: For accounting PDFs, lock numeric formatting in the prompt

Add: Treat (1,234) as -1234. Treat $1,234.56 as the number 1234.56 with currency USD. Output two columns: amount, currency. This eliminates ambiguous number handling and stops sign flips.

How to confirm it’s fixed

The header row in the output matches the source headers exactly, in order.
The output row count equals the source row count (the off-by-one is gone).
Three random spot-checked rows (e.g. row 3, row 15, the last row) match the source cell-by-cell.
Any numeric totals reconcile with the source totals; negatives kept their sign.

Errors here are almost always systematic, not random — if one check fails, the same defect usually repeats, and fixing the header list or number rule in the prompt clears the whole table.

Long-term prevention

For internal reports, publish a CSV alongside the PDF so downstream consumers skip extraction entirely.
Pre-OCR scans (Acrobat Pro, ABBYY) before Claude sees them.
Avoid two-column layouts for any document that might be machine-read.
In a Claude Project that handles finance or research data, add a custom instruction: When extracting tables, first echo the assumed header row for user confirmation, then output the data.
For repeating workflows, build a Camelot or Tabula pre-extraction step and pipe clean CSV into Claude.
If Claude seems to ignore charts and table visuals entirely, check Settings, then Feature Preview, and confirm visual PDF handling is on — older accounts had a “Visual PDFs” toggle that gates the image layer.

Common pitfalls

Trusting a Claude-extracted table without spot-checking. The errors are systematic, so a few checks catch them.
Asking “give me the table” without naming the columns. Claude infers the grid and often infers wrong.
Pasting a scan and expecting clean results without a real text layer. Pre-OCR first.
Forgetting that merged cells and footnote rows confuse row detection.
Treating negative-in-parentheses numbers as positive without telling Claude how to parse them.

FAQ

Why are columns shifted by exactly one? Usually a header cell that wraps to two lines, or an unusually wide first column that throws off grid inference. Naming the headers explicitly in the prompt (Step 2) fixes it.
Can Claude read scanned PDFs? Yes — it analyzes each page as an image plus any extracted text. Quality tracks scan resolution. For anything critical, run a dedicated OCR pass first so a real text layer exists.
What is the most reliable format to request? CSV with explicit headers named in the prompt. Markdown tables read nicely but are lossier to re-import into a spreadsheet.
Does the model version matter? Yes. Newer models handle layout inference noticeably better; use the latest Claude Opus 4.7 or Sonnet 4.6 (1M-token context as of June 2026) for table-heavy and long documents.
Should I split a wide or long table across prompts? Yes — extract per page or per section, then reassemble in Excel. This avoids context degradation on dense pages.
What are the size limits? On the API, one request allows up to 600 pages and 32 MB total (100 pages on older 200k-context models). For web/app uploads, split very large PDFs before uploading.

Tags: #Claude #Troubleshooting #PDF