For data files, ChatGPT has two routes: treat the file as plain text and “eyeball” a summary, or invoke Code Interpreter / Advanced Data Analysis and run Python. Which route it picks dominates accuracy — the text route is fine for vibes but always wrong for sums, counts, and pivots. Locale, encoding, and merged cells make even the Python route fail, so “force code” alone isn’t enough.
Common causes
Ordered by hit rate, highest first.
1. Model used text extraction, never ran Python
The most common failure. You upload a CSV / XLSX, the model reads the first few rows and summarizes — no python code block appears, no “Analyzed” tag, no output file. Looks like analysis, is actually guessing.
How to spot it: No grey python code block in the reply, no “Analyzed” indicator — it didn’t use Code Interpreter. Ask it to “rerun using the analysis tool” and compare.
2. Locale mismatch on dates / decimals
03/04/2026 is March 4 in the US, April 3 in Europe. 1,234.56 is one-thousand-two in English-speaking countries, 1.23456 in parts of Europe. The model defaults to US format; if the CSV is European, every date and amount will be wrong.
How to spot it: Ask it to “print the first 5 raw values of the Date column alongside the parsed datetime.” Years match but month/day swapped means this is the cause.
3. CSV encoding / quoting / BOM issues
Excel’s “Save as CSV” default is GBK / Windows-1252, not UTF-8. Chinese characters or euro signs come through as ??. Fields with commas and missing quotes shift columns. BOM () at the file head turns the first column name into name.
How to spot it: Have it “print the columns list + raw bytes of the first row.” ? characters, column names prefixed with , or wrong column count = encoding/quoting issue.
4. Excel hidden rows, merged cells, multiple sheets
Merged cells collapse the value into the top-left and leave the rest as NaN. pandas reads hidden rows by default (so you think they’re gone, they’re not). Multi-sheet workbooks default to the first sheet if sheet_name isn’t specified.
How to spot it: Ask it to run pd.read_excel(file, sheet_name=None).keys() to see all sheets; print df.shape and compare to what Excel shows.
5. Large file got sampled before aggregation
Above Code Interpreter’s memory threshold (typically ~100MB / a few hundred thousand rows), the model sometimes reads only the first N rows and starts computing without telling you it sampled.
How to spot it: Have it print len(df) and compare to the file’s actual row count (wc -l or file size estimate). Mismatch = sampling.
6. The model’s Python code itself is wrong
Even when Python runs, the model wrote it. df.groupby('region').sum() looks right but if the column is a string instead of numeric, the result is empty.
How to spot it: For high-stakes tasks, ask it to paste the code + hand-check one row of the output against the raw data.
Before you start
- Confirm whether this happens in a plain chat, a Project, or a Custom GPT — Code Interpreter availability differs across the three (Free users have very tight Code Interpreter quotas).
- Duplicate the chat before retesting so history doesn’t pollute the next diagnostic.
- Confirm your plan: Free / Plus / Team / Enterprise differ in Code Interpreter quota, max file size, and execution time.
Info to collect
- File type (csv / xlsx / tsv / json), size (MB), total rows, columns, presence of non-ASCII / euro / dates.
- Encoding: run
file -I data.csvto see utf-8 / utf-16 / windows-1252. - Full prompt text + ChatGPT reply screenshot; especially note “did the reply contain a
pythoncode block?” - Current model (GPT-5.5 / GPT-5 / o3) and whether Code Interpreter is enabled (check Tools settings).
- One concrete example of the error: expected X, got Y, true value in raw data is Z.
Shortest fix path
Ordered by ROI. The first three solve ~80% of cases.
Step 1: Force Code Interpreter, inspect schema first
Open a new chat and use this prompt template:
Use the analysis tool. Load `data.csv` into pandas with UTF-8 encoding.
Print:
1. df.shape
2. df.dtypes
3. df.head()
4. For any date column, parse it and print the first 3 parsed values
next to the raw strings.
After confirming schema, compute: <your real question>
This single step surfaces 90% of format problems and stops you from running downstream math on wrong dtypes.
Step 2: Declare date / decimal locale explicitly
The Date column is DD/MM/YYYY (European format).
The Amount column uses comma as decimal separator (e.g. "1.234,56" = 1234.56).
Parse accordingly.
Don’t skip this — locale errors are silent. Output looks normal but every number is wrong.
Step 3: Convert Excel to CSV, kill merged cells
Pre-process before upload:
- Select all merged cells → unmerge → fill repeated values.
- Unhide all hidden rows / columns.
- Save As → CSV UTF-8 (Comma delimited) — not the default “CSV” which is Windows-1252.
- One header row, no blank rows.
Or pre-process in Python locally:
import pandas as pd
df = pd.read_excel("source.xlsx", sheet_name="Sheet1")
df.to_csv("clean.csv", index=False, encoding="utf-8")
Step 4: Verify row counts + spot-check
Append to every analysis prompt:
Print:
- Total rows read: len(df)
- Non-null count per column
- Sanity check: pick one row from the result, find it in the raw data,
confirm the math matches.
If len(df) is less than the file’s actual row count from wc -l, the model sampled. Ask it to re-read fully (or split the file).
Step 5: Paste the code and read it
For high-stakes work (financials, A/B test conclusions, anything customer-facing):
Show me the exact pandas code you used, with comments.
Read the groupby columns, aggregation function, filter conditions. More reliable than asking the model to “double-check.”
Step 6: Pre-slice huge files locally
For files > 200MB or > 1M rows: split locally before upload.
# Split CSV by rows
split -l 100000 large.csv part_
# Or take a column subset
csvcut -c "date,amount,region" large.csv > slim.csv
Process each chunk as an independent task and aggregate the results yourself.
How to confirm the fix
- Open a fresh chat, upload the same file, ask the same question — confirm the answer is stable (not a lucky guess last time).
- Have ChatGPT print one total / mean / group result and hand-check it against your manual math or an Excel pivot — every digit must match.
- Have a colleague run the same prompt in their account — confirms it’s not just your session that got fixed.
If still broken
- Cut the file to the absolute minimum: 100 rows, only the columns involved — see if the smallest case works.
- Swap formats: xlsx → csv, csv → tsv, CSV → Parquet — rule out a format-specific parser bug.
- Switch model: GPT-5.5 → o3 / GPT-5; reasoning models tend to write more reliable analysis code.
- Package source file + prompt + model + subscription screenshot, file a ticket at help.openai.com.
Prevention
- Standardize files: UTF-8, ISO dates (YYYY-MM-DD), period as decimal, one header row, no merged cells, no blank rows.
- For every data task, start with
print(df.shape, df.dtypes, df.head())— schema bugs surface immediately. - Always force “use the analysis tool” for numerical work; don’t trust the model’s mental math.
- Build a “double-check” habit for high-stakes tasks: code paste + manual spot check + Excel pivot comparison.
- For recurring report shapes, bake the cleaning code into a Custom GPT’s instructions so schema handling stays consistent.
Related reading
- ChatGPT file analysis too shallow
- ChatGPT large document incomplete analysis
- ChatGPT uploaded PDF not analyzed correctly
- ChatGPT Projects
- ChatGPT file analysis
- ChatGPT Projects advanced workflow
- ChatGPT Custom GPT Files Not Being Used
- ChatGPT Project Export Missing Attachments
- ChatGPT Project Instructions Ignored After Update
- ChatGPT Project Knowledge Stale After Editing a Source File
Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #Data file