ChatGPT Truncates a Large CSV to the First 1000 Rows

Upload a big spreadsheet, ask 'how many rows are there,' and ChatGPT says 1000 — that's sampling, not counting. Here's how to force a full-file analysis on every row.

Published: May 24, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You upload a 50,000-row CSV, ask “what is the average revenue across all rows,” and ChatGPT answers from the first 1000 — the total is obviously wrong. Fastest fix: make sure analysis actually runs Python, then phrase the prompt as an aggregation, not a summary. Paste this as your first message after the upload: Load the file with pandas, print df.shape, then compute the totals I ask for — run the code, do not sample. If df.shape matches your real row count, every later number is trustworthy.

Two different failure modes produce the same wrong total. In plain chat (no Python), ChatGPT only pulls a slice of the file text into context and extrapolates from it. In Code Interpreter / Advanced Data Analysis (ADA), the whole file is on disk, but a vague prompt makes the model print df.head() and narrate instead of running aggregations. The fix is to force pandas execution on the full DataFrame, stream the file in chunks if it’s near the sandbox memory ceiling, or pre-aggregate locally and upload only the rollup.

Which bucket are you in?

Symptom in the transcript	Likely cause	Go to
No Python/analysis tool ran; row count is suspiciously round (1000, 2000)	Plain chat sampled the file	Step 1 + Step 2
Code cell shows `df.head()` / `df.dtypes` but never `df.shape` or an aggregation	Vague “summarize” prompt	Step 2
`MemoryError`, “kernel restarted”, or a silent `nrows=1000` retry	Sandbox memory ceiling (~1 GB)	Step 3 / Step 4
Final number only reflects part of the data after a `chunksize` loop	Chunks not accumulated	Step 3
Model claims columns “don’t exist”	Display truncated wide output	Step 1 (print `df.columns`)

Common causes

1. Plain chat samples the file, not loads it

Without Code Interpreter, ChatGPT reads a spreadsheet by including a portion of its text in the model’s context, then reasons over that slice. Any “analysis” past the sampled rows is extrapolation, not computation. Spreadsheets are exempt from the 2M-token text cap, but that doesn’t mean every row enters context — the model still works from a preview unless Python runs.

How to spot it: No Python tool icon or “Analyzing” code cell during the answer. The reported row count is a suspiciously round number (1000, 2000).

2. Code Interpreter loads it all but only prints `df.head()`

In ADA, pandas can load the whole CSV. But if the prompt is vague (“summarize this file”), the model defaults to printing the first 5 rows plus column info, then narrating. It never runs df.shape or df.describe() on the full frame.

How to spot it: The code cell shows df.head() and df.dtypes but never df.shape or any aggregation. The narrative says “based on the sample…“

3. Sandbox memory ceiling on truly large files

The Code Interpreter container runs with roughly 1 GB of memory (as of June 2026; this is fixed inside the ChatGPT app and can’t be raised there). The per-file upload cap is 512 MB, and CSV/Excel files are effectively limited to about 50 MB depending on row width. Even a file that uploads fine can blow past 1 GB once pandas parses it into a DataFrame — string-heavy columns balloon in memory. pd.read_csv then raises MemoryError or the kernel restarts, and the model often retries with nrows=1000 to “make it work” and reports that subset as the answer.

How to spot it: The code trace shows a MemoryError, a kernel restart, or a quiet re-read with nrows= after a failure.

4. Truncation by chunked reading

If pandas uses chunksize= and the model forgets to accumulate across chunks, only the last chunk’s results appear in the final answer. Common with custom aggregation prompts.

5. Column-wise truncation in the display

pd.set_option("display.max_columns") defaults are low. A 50-column file prints as ... in the middle, and the model sometimes claims those hidden columns don’t exist.

Shortest path to fix

Step 1: Confirm the full file is loaded

Always start with a shape check so you’re not reasoning over a slice:

import pandas as pd
df = pd.read_csv("/mnt/data/big.csv")
print("rows:", df.shape[0])
print("cols:", df.shape[1])
print("columns:", list(df.columns))
print("memory MB:", round(df.memory_usage(deep=True).sum() / 1e6, 1))

If rows matches what you expect, the file loaded fully and you can trust subsequent aggregations. Printing columns also kills cause 5 — you’ll see every column name even when the display would otherwise hide them.

Step 2: Force aggregations on the full frame

Write the prompt as an aggregation request, not a summary request. Tell the model explicitly to run the code:

Load /mnt/data/big.csv with pandas. Print df.shape first.
Then compute and print the actual numbers, no sampling:
  - df["revenue"].sum()
  - df.groupby("region")["revenue"].sum()
  - df["date"].min(), df["date"].max()
Run the Python; do not summarize from df.head().

ADA executes each line and the answer is the real total. The phrase “do not summarize from df.head()” is what reliably stops cause 2.

Step 3: Stream large files with chunksize

If the file is near the memory ceiling, process it in chunks and accumulate — this is the fix for both cause 3 and cause 4:

import pandas as pd

total = 0
n = 0
for chunk in pd.read_csv("/mnt/data/huge.csv", chunksize=200_000):
    total += chunk["revenue"].sum()
    n += len(chunk)

print("rows processed:", n)
print("total revenue:", total)

Works for sums, counts, and most groupby aggregations. For exact percentiles or medians you need to keep partial state across chunks (or use the DuckDB approach below). The printed rows processed should equal your full row count — if it doesn’t, a chunk was dropped.

A leaner alternative inside the sandbox is DuckDB, which streams from disk and never materializes the whole file in RAM:

import duckdb
duckdb.sql("""
  SELECT region, sum(revenue) AS rev, count(*) AS rows
  FROM '/mnt/data/huge.csv'
  GROUP BY region
""").show()

Step 4: Pre-aggregate locally before upload

If memory is tight or you only need a rollup, do the aggregation on your machine and upload the small result:

# locally
import pandas as pd
df = pd.read_csv("huge.csv")
rollup = df.groupby(["region","quarter"]).agg(
    rev_sum=("revenue","sum"),
    rev_avg=("revenue","mean"),
    rows=("revenue","count"),
).reset_index()
rollup.to_csv("rollup.csv", index=False)

Upload rollup.csv (a few hundred rows). ChatGPT analyzes it instantly and your raw data stays on your machine.

Step 5: Split into N files for parallel analysis

For files where you do need row-level access but can’t fit them all at once, split first:

split -l 100000 huge.csv part_

Upload one part at a time per question, or all parts in a Project where ChatGPT can iterate across them. Keep the file count under 10 for reliable retrieval — and remember the quota: as of June 2026 Plus is 80 files per rolling 3-hour window, Free is about 3 files per day, and Team/Enterprise is 160 files per 3 hours.

How to confirm it’s fixed

The transcript shows a Python/analysis code cell (not just prose).
df.shape[0] (or rows processed) equals your real row count.
The headline numbers come from an aggregation line (.sum(), .groupby(), a DuckDB SELECT), not from df.head().
Re-ask the same total once more — a correct, computed answer is stable across re-runs; a sampled estimate usually drifts.

Prevention

For repeated analysis of large data, build the pipeline once locally with pandas or DuckDB and upload only the aggregated output.
When uploading, always include df.shape and df.columns in the first prompt so you can verify the full file loaded.
Prefer Parquet over CSV when files exceed 100 MB — typically 5-10x smaller, faster to read, and it preserves dtypes (so string columns don’t bloat memory on load).
For ad-hoc exploration, sample deliberately: df.sample(10_000, random_state=42) gives a reproducible subset that fits cleanly in any prompt.
For dashboards and recurring reports, use Code Interpreter with the full file but always write the prompt as explicit aggregations, never as “summarize.”

FAQ

Why does ChatGPT say my file has exactly 1000 rows when it has 50,000? Because no Python ran. Plain chat sampled a slice of the file into context and reported the size of that slice. Re-ask with “run pandas and print df.shape” so the count comes from a real read.

What’s the actual file-size limit for a CSV upload? The hard cap is 512 MB per file, but CSV/Excel files are effectively limited to roughly 50 MB depending on row width, and the parsed DataFrame still has to fit inside the ~1 GB sandbox (as of June 2026). If yours is bigger, stream it (Step 3) or pre-aggregate locally (Step 4).

Can I raise the Code Interpreter memory limit? Not inside the ChatGPT app — the container memory is fixed (~1 GB). Higher memory is only configurable through the OpenAI API’s Code Interpreter tool, not the consumer app. In the app, chunked reading or DuckDB is the workaround.

It worked for one question, then the next answer was wrong again. Why? Each new analysis turn can start a fresh code state, and a vague follow-up (“now break it down by month”) may trigger sampling again. Re-state the aggregation explicitly each time, or keep all the computations in one prompt.

How do I get an exact median or 95th percentile over a chunked file? Per-chunk percentiles don’t combine into a correct global value. Either load the single column you need (pd.read_csv(path, usecols=["revenue"])) so only that column lives in memory, or run it in DuckDB with quantile_cont(revenue, 0.95), which computes it over the whole file without loading it all into RAM.

Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #large-file

Which bucket are you in?

Common causes

1. Plain chat samples the file, not loads it

2. Code Interpreter loads it all but only prints df.head()

3. Sandbox memory ceiling on truly large files

4. Truncation by chunked reading

5. Column-wise truncation in the display

Shortest path to fix

Step 1: Confirm the full file is loaded

Step 2: Force aggregations on the full frame

Step 3: Stream large files with chunksize

Step 4: Pre-aggregate locally before upload

Step 5: Split into N files for parallel analysis

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

ChatGPT Reads CSV But Reports Wrong Column Names or Merged Columns

ChatGPT Silently Rejects Password-Protected PDFs

ChatGPT Reads Excel but Ignores Formulas (Returns Them as Strings)

ChatGPT Still Uses the Old File Version After You Re-Uploaded

ChatGPT: 'No Text Could Be Extracted From This File' (Scanned / Handwritten PDF)

ChatGPT Treats Uploaded JSON as Plain Text Instead of Structured Data

2. Code Interpreter loads it all but only prints `df.head()`