ChatGPT Misreads Your CSV / Excel Data File

Numbers come back wrong, headers misaligned, dates flipped. Force the analysis tool, declare your locale, and verify row counts — accurate results in three steps.

Published: May 17, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

For data files ChatGPT has two routes: treat the file as plain text and “eyeball” a summary, or invoke the analysis tool (Code Interpreter / Advanced Data Analysis) and run real Python. Which route it picks dominates accuracy — the text route is fine for vibes but always wrong for sums, counts, and pivots. Locale, encoding, and merged cells make even the Python route fail, so “force code” alone is not enough.

Fastest fix: open a new chat, attach the file, and start your prompt with Use the analysis tool. followed by Print df.shape, df.dtypes, and df.head() before computing anything. If the reply still has no Analyzed block, you are on the text route — that is cause #1 below. If it ran Python but the numbers are wrong, you are almost always in cause #2 (locale) or #3 (encoding).

Which bucket are you in

Symptom	Most likely cause	Jump to
No `Analyzed` / `python` block in the reply	Text route, never ran Python	Cause 1
Dates off by a fixed swap (month/day)	Locale (DD/MM vs MM/DD)	Cause 2
`??`, mojibake, or a stray column name	Encoding / BOM / quoting	Cause 3
`NaN` holes, missing rows, wrong sheet	Merged cells / hidden rows / multi-sheet	Cause 4
Totals too low on a big file	Sampling before aggregation	Cause 5
Math runs but the answer is plain wrong	Model wrote bad pandas code	Cause 6

Common causes

Ordered by hit rate, highest first.

1. Model used text extraction, never ran Python

The most common failure. You upload a CSV / XLSX, the model reads the first few rows and summarizes — no python code block appears, no Analyzed tag, no output file. Looks like analysis, is actually guessing.

How to spot it: No collapsible Analyzed block in the reply (click it to expand the Python). If that block is absent, Code Interpreter never ran. Ask it to “rerun using the analysis tool” and compare.

2. Locale mismatch on dates / decimals

03/04/2026 is March 4 in the US, April 3 in Europe. 1,234.56 is one-thousand-two-hundred in English-speaking countries; 1.234,56 is the same number in much of Europe. The model defaults to US format, so a European CSV gets every date and amount wrong.

There is a sharper, specific trap here: pandas date inference is per-column but order-dependent. If read_csv reads the first row as a valid US date (01/02/2026) it locks in MM/DD, then hits a row that only parses as DD/MM (13/02/2026) and silently switches — without re-checking the earlier rows. You end up with a column where the first chunk is parsed one way and the rest another. The fix is to never let pandas guess: pass dayfirst=True or an explicit format=.

How to spot it: Ask it to “print the first 5 raw values of the Date column alongside the parsed datetime.” Years match but month/day swapped, or only some rows swapped, confirms this is the cause.

3. CSV encoding / quoting / BOM issues

Excel’s plain “Save as CSV” default is your system code page (GBK on a Chinese Windows, Windows-1252 in the US/EU), not UTF-8. Chinese characters or euro signs come through as ?? or mojibake. Fields containing commas without surrounding quotes shift every later column. A BOM () at the file head turns the first column name into name, so df["name"] raises KeyError.

How to spot it: Have it “print the columns list and the raw bytes of the first row.” ? characters, a column name prefixed with , or a wrong column count all point to encoding/quoting.

4. Excel hidden rows, merged cells, multiple sheets

Merged cells collapse the value into the top-left cell and leave the rest as NaN. pandas reads hidden rows by default (so they are still in your totals even though Excel hides them). Multi-sheet workbooks default to the first sheet if sheet_name is not specified, so you may be analyzing the wrong tab entirely.

How to spot it: Ask it to run pd.read_excel(file, sheet_name=None).keys() to list every sheet, then print df.shape and compare to the row/column count Excel shows.

5. Large file got sampled before aggregation

The analysis tool caps spreadsheets/CSVs at roughly 50MB per file (as of June 2026 — high-density files with thousands of columns can choke well below that), and the container can run out of memory on millions of rows. When that happens the model sometimes reads only the first N rows and starts computing without telling you it sampled.

How to spot it: Have it print len(df) and compare to the file’s actual row count (wc -l locally, or a row estimate from file size). A shortfall means it sampled or truncated.

6. The model’s Python code itself is wrong

Even when Python runs, the model wrote it. df.groupby('region').sum() looks right, but if the amount column came in as a string (because of cause #3), the sum is empty or concatenates text. The model also occasionally emits deprecated pandas calls (e.g. infer_datetime_format=True, removed in pandas 2.0) that error or behave unexpectedly.

How to spot it: For high-stakes tasks, ask it to paste the code and hand-check one row of the output against the raw data.

Before you start

Confirm whether this happens in a plain chat, a Project, or a Custom GPT — the analysis tool’s availability and limits differ across the three, and Free users have very tight quotas.
Pick the right model in the picker. As of June 2026 the picker is Instant / Thinking / Pro (all GPT-5.5; o3 and the GPT-5.2 line were retired in ChatGPT around mid-2026). Thinking writes more reliable analysis code and supports every tool; Instant can silently take the text route on borderline prompts.
Duplicate the chat before retesting so prior history does not pollute the next diagnostic.
Confirm your plan: Free / Go / Plus / Business / Enterprise differ in analysis-tool quota, max file size, and execution time.

Info to collect

File type (csv / xlsx / tsv / json), size (MB), total rows, columns, presence of non-ASCII / euro / dates.
Encoding: run file -I data.csv locally to see utf-8 / utf-16 / windows-1252.
Full prompt text plus the ChatGPT reply; especially note “did the reply contain an Analyzed block?”
Current model (Instant / Thinking / Pro) and whether the analysis tool is on (it is on by default for Thinking; Instant supports it but may skip it).
One concrete example of the error: expected X, got Y, true value in the raw data is Z.

Shortest fix path

Ordered by ROI. The first three solve roughly 80% of cases.

Step 1: Force the analysis tool, inspect schema first

Open a new chat, switch the model to Thinking, and use this prompt template:

Use the analysis tool. Load `data.csv` into pandas, reading every column
as a string first (dtype=str) so nothing is silently coerced.
Print:
1. df.shape
2. df.dtypes
3. df.head()
4. For any date column, parse it with an explicit format and print the
   first 3 parsed values next to the raw strings.

After confirming the schema, compute: <your real question>

Reading every column as a string first prevents pandas from guessing types before you have seen the data. This single step surfaces most format problems and stops you running downstream math on the wrong dtypes.

Step 2: Declare date / decimal locale explicitly

The Date column is DD/MM/YYYY (European format) — parse with dayfirst=True
or format="%d/%m/%Y", and do NOT let pandas auto-infer (it switches format
mid-column and corrupts earlier rows).
The Amount column uses a comma as the decimal separator (e.g. "1.234,56" =
1234.56) — strip thousands separators before converting to float.

Do not skip this — locale errors are silent. The output looks normal but every number is wrong.

Step 3: Convert Excel to CSV, kill merged cells

Pre-process before upload:

Select all merged cells, unmerge, and fill in the repeated values.
Unhide all hidden rows and columns.
Save As -> CSV UTF-8 (Comma delimited) — not the plain “CSV” option, which uses your system code page.
One header row, no blank rows.

Or pre-process in Python locally:

import pandas as pd
df = pd.read_excel("source.xlsx", sheet_name="Sheet1")
df.to_csv("clean.csv", index=False, encoding="utf-8")

Step 4: Verify row counts and spot-check

Append to every analysis prompt:

Print:
- Total rows read: len(df)
- Non-null count per column
- Sanity check: pick one row from the result, find it in the raw data,
  and confirm the math matches.

If len(df) is less than the file’s actual row count from wc -l, the model sampled or truncated. Ask it to re-read the file fully, or split the file (Step 6).

Step 5: Paste the code and read it

For high-stakes work (financials, A/B test conclusions, anything customer-facing):

Show me the exact pandas code you used, with comments.

Read the groupby columns, the aggregation function, and the filter conditions. This is far more reliable than asking the model to “double-check.”

Step 6: Pre-slice huge files locally

For files near or above the ~50MB CSV ceiling (or over a million rows), split locally before upload:

# Split CSV by rows
split -l 100000 large.csv part_
# Or take a column subset
csvcut -c "date,amount,region" large.csv > slim.csv

Process each chunk as an independent task and aggregate the results yourself.

How to confirm the fix

Open a fresh chat, upload the same file, ask the same question — confirm the answer is stable (not a lucky guess last time).
Have ChatGPT print one total / mean / group result and hand-check it against your manual math or an Excel pivot — every digit must match.
Have a colleague run the same prompt in their account — confirms it is not just your session that got fixed.

If still broken

If you get a Code Interpreter session expired message, the sandbox timed out (roughly after 15-30 minutes of inactivity, or on a usage cap) and your uploaded file is gone. Re-upload and re-issue the prompt in the same turn.
Cut the file to the absolute minimum: 100 rows, only the columns involved — see whether the smallest case works.
Swap formats: xlsx -> csv, csv -> tsv, CSV -> Parquet — to rule out a format-specific parser bug.
Switch model from Instant to Thinking; the reasoning model writes more reliable analysis code and is less likely to skip the tool.
Package the source file, prompt, model, and a subscription screenshot, then file a ticket at help.openai.com.

Prevention

Standardize files: UTF-8, ISO dates (YYYY-MM-DD), a period as the decimal mark, one header row, no merged cells, no blank rows.
For every data task, start with print(df.shape, df.dtypes, df.head()) — schema bugs surface immediately.
Always force “use the analysis tool” for numerical work; never trust the model’s mental math.
Build a “double-check” habit for high-stakes tasks: code paste, a manual spot check, and an Excel-pivot comparison.
For recurring report shapes, bake the cleaning code into a Custom GPT’s instructions so schema handling stays consistent.

FAQ

Why does ChatGPT say it analyzed my file when it clearly did not? Because the text route produces a plausible-sounding summary from the first few rows without running any code. The tell is the missing Analyzed block. If you do not see one, no Python ran — force the analysis tool and re-ask.

My dates are off by a month for only some rows. Why? That is the pandas order-dependent inference trap (cause #2): it locked in US format on early rows, then switched to European mid-column without re-checking. Force dayfirst=True or an explicit format= so nothing is auto-guessed.

What is the actual file-size limit for CSV uploads? Roughly 50MB per spreadsheet/CSV as of June 2026, and high-density files with thousands of columns can fail well under that. Text-heavy documents are limited by tokens (about 2M per file), not raw size. Split large CSVs locally before uploading.

Code Interpreter session expired — did I lose my data? The sandbox is temporary and clears after a period of inactivity or on a usage cap; your uploaded file and any in-memory DataFrame are gone with it. Re-upload the file and re-issue your instruction in a single message.

Should I use Instant or Thinking for data work? Thinking. It writes more reliable analysis code, supports every tool, and is less likely to silently take the text route. Reserve Instant for quick, non-numeric questions.

Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #Data file