ChatGPT Truncates a 50k-Row CSV to the First 1000 Rows

Upload a big spreadsheet, ask 'how many rows are there,' and ChatGPT confidently says 1000 — token-budget truncation. Here is how to analyze the whole file.

You upload a 50,000-row CSV, ask “what is the average revenue across all rows,” and ChatGPT answers based on the first 1000 — totals are obviously wrong. Two different failure modes cause this: in plain chat, ChatGPT samples the file into context and the rest is dropped; in Code Interpreter (Advanced Data Analysis), the file is fully on disk but the model summarizes df.head() instead of running aggregations. The fix is to force pandas execution on the full DataFrame, chunk the file if it exceeds sandbox memory, or pre-aggregate locally and upload the rollup.

Common causes

1. Plain chat samples the file, not loads it

Without Code Interpreter, ChatGPT reads a file by including a portion of its text in context. Token budget caps that at roughly the first 1000-3000 rows. Any “analysis” past that point is extrapolation, not computation.

How to spot it: No Python tool icon during analysis. The reported row count is suspiciously round (1000, 2000).

2. Code Interpreter loads it all but only prints df.head()

In ADA, pandas can load the whole CSV. But if the prompt is vague (“summarize this file”), the model defaults to printing the first 5 rows plus column info, then narrating. It never runs df.shape or df.describe() on the full frame.

How to spot it: Code cell shows df.head() and df.dtypes but never df.shape or any aggregation. The narrative says “based on the sample…“

3. Sandbox memory ceiling on truly huge files

A 2GB CSV with 20M rows fills the sandbox memory and pd.read_csv either raises MemoryError or kills the kernel. The model retries with nrows=1000 to “make it work” and reports that subset as the answer.

How to spot it: Code Interpreter shows a kernel-restart or memory error in the trace.

4. Truncation by chunked reading

If pandas uses chunksize= and the model forgets to accumulate, only the last chunk’s results appear in the final answer. Common with custom aggregation prompts.

5. Column-wise truncation in the display

pd.set_option("display.max_columns") defaults are low. A 50-column file prints as ... in the middle, and the model sometimes claims those columns don’t exist.

Shortest path to fix

Step 1: Confirm the full file is loaded

Always start with:

import pandas as pd
df = pd.read_csv("/mnt/data/big.csv")
print("rows:", df.shape[0])
print("cols:", df.shape[1])
print("memory MB:", df.memory_usage(deep=True).sum() / 1e6)

If rows is what you expect, the file loaded fully and you can trust subsequent aggregations.

Step 2: Force aggregations on the full frame

Write the prompt as an aggregation request, not a summary request:

Load /mnt/data/big.csv with pandas. Print df.shape.
Then compute:
  - df["revenue"].sum()
  - df.groupby("region")["revenue"].sum()
  - df["date"].min(), df["date"].max()
Show the actual computed numbers, no sampling.

ADA will execute each line; the answer is the real total.

Step 3: Stream large files with chunksize

If the file is too big to fit in memory:

import pandas as pd

total = 0
n = 0
for chunk in pd.read_csv("/mnt/data/huge.csv", chunksize=200_000):
    total += chunk["revenue"].sum()
    n += len(chunk)

print("rows processed:", n)
print("total revenue:", total)

Works for sums, counts, and most groupby aggregations. For percentiles you may need to keep partial state across chunks.

Step 4: Pre-aggregate locally before upload

If memory is tight or you only need a rollup, do the aggregation on your machine:

# locally
import pandas as pd
df = pd.read_csv("huge.csv")
rollup = df.groupby(["region","quarter"]).agg(
    rev_sum=("revenue","sum"),
    rev_avg=("revenue","mean"),
    rows=("revenue","count"),
).reset_index()
rollup.to_csv("rollup.csv", index=False)

Upload rollup.csv (a few hundred rows). ChatGPT can analyze it instantly and you keep raw data private.

Step 5: Split into N files for parallel analysis

For files where you do need row-level access but can’t fit them all at once, split:

split -l 100000 huge.csv part_

Upload one part at a time per question, or all parts in a Project where ChatGPT can iterate across them. Keep file count under 10 for reliable retrieval.

Prevention

  • For repeated analysis of large data, build the pipeline once locally with pandas / DuckDB and upload only the aggregated output to ChatGPT.
  • When uploading, always include df.shape and df.head() in the first prompt so you can verify the full file loaded.
  • Prefer Parquet over CSV when files exceed 100MB — 5-10x smaller, much faster to read, preserves dtypes.
  • For ad-hoc exploration, sample deliberately: df.sample(10_000, random_state=42) gives a reproducible subset that fits cleanly in any prompt.
  • For dashboards and recurring reports, use Code Interpreter with the full file but always write the prompt as explicit aggregations, never as “summarize.”

Tags: #ChatGPT #ChatGPT files #Troubleshooting #Debug #large-file