Codex Agent Skips Files Containing Binary Data

Codex silently skips PNGs, PDFs, sqlite DBs, compiled artifacts. The fix is allowlisting text extensions, hashing binaries instead of reading, and giving Codex a binary-aware tool.

You ask Codex to enumerate every asset under public/ or audit every file in data/. The agent returns a list that quietly omits the PNG sprites, the sqlite seed DB, the pre-compiled .wasm blob, the embedded font binaries. Worse, it gives no warning — the response reads as if those files do not exist. Then a reviewer notices a missing item, and now you do not know how many other audits silently dropped binaries.

Codex’s read tools default to UTF-8 text. A file with non-text bytes either errors quietly, is described as “binary file, skipped”, or is summarized as “[redacted: 4MB binary]” — and the agent treats it as unimportant. The fix is not to force binary reads (that produces garbage), but to teach the agent to acknowledge binaries via different tools: hashing, metadata, structured probes.

Common causes

Ordered by hit rate.

1. Read tool errors on non-UTF-8 bytes and the agent treats it as “not there”

Most agent read tools throw or return an empty body for non-text bytes. The agent does not retry with a different tool — it just moves on, and the file disappears from its mental model.

How to spot it: Ask the agent to re-list with explicit ls -la shell output. If the file appears in ls but not in the agent’s previous summary, it was silently dropped.

2. The file extension is unknown to the agent’s heuristics

.parquet, .arrow, .msgpack, .protobuf, custom .bin — these are not on most agents’ “text” allowlists. Even a JSON file with a .dat extension can get treated as binary.

How to spot it: Compare extensions against the actually-text files the agent did pick up. Anything outside .md .txt .json .yaml .ts .js .py .go .rs .toml may be silently skipped.

3. File contains a BOM or mixed encoding

A UTF-16 LE file, a Latin-1 file, or a UTF-8-with-BOM file may trip strict UTF-8 readers. The agent skips it as “encoding error” without surfacing the issue.

How to spot it: file path/to/file shows UTF-16 or ISO-8859. Codex’s reader probably refused it.

4. File is technically text but exceeds the read tool’s size cap

A 30 MB CSV is text, but most agent read tools cap at 1-5 MB and bail. The bail looks identical to the binary case — silent skip.

How to spot it: wc -c file returns > 5_000_000 and the file is missing from the agent’s output.

5. Symlinked binary or LFS pointer

Git LFS stores a 130-byte text pointer in the repo and the actual binary on the LFS server. Codex reads the pointer, sees a tiny non-content blob, and treats it as junk.

How to spot it: cat file.psd shows version https://git-lfs.github.com/spec/v1 followed by an OID. The real binary was never fetched.

6. .gitignore / .codexignore silently filters the file

Project ignore files mask binaries from agent tooling on purpose. But when you forget the ignore is in effect, you assume the agent saw the file.

How to spot it: git check-ignore -v path/to/file returns a matching rule. The agent obeyed the ignore file.

Before you start

  • Run find . -size +1M -type f | head to list what is plausibly being skipped.
  • Run file path/to/each to confirm which are actually binary vs misclassified.
  • Decide what the agent needs to know about each binary: existence + hash, dimensions, schema, or full byte read.

Information to collect

  • Full output of ls -la <dir> vs the agent’s perceived file list — diff to find skipped entries.
  • For each skipped file: extension, file output, size, whether it is an LFS pointer.
  • The agent transcript section where the file should have appeared but did not.
  • Project .gitignore, .codexignore, and any custom ignore configs.

Step-by-step fix

Ordered by ROI.

Step 1: Give the agent a binary-aware probe tool first

Replace “read this file” with “probe this file”:

For each file in data/:
1. Run: file <path>  (get MIME-ish description)
2. Run: wc -c <path>  (get size in bytes)
3. Run: sha256sum <path>  (get content hash)
4. If file output starts with "ASCII" / "UTF-8" / "JSON" → read content
5. Otherwise → record {path, type, size, sha256} only

The agent now knows binaries exist, what they are, and their fingerprint — without choking on bytes.

Step 2: Allowlist extensions explicitly

In the task prompt:

Text extensions to read: .md .txt .json .jsonl .yaml .yml .toml .ini
  .ts .tsx .js .jsx .mjs .cjs .py .rb .go .rs .java .kt .swift
  .html .css .scss .sql .sh .dockerfile .env.example

All other extensions: record path + size + sha256 only.
DO NOT attempt to read.
DO NOT omit from output. Include with a "binary" marker.

This eliminates silent drops — every file appears in output, with content or with a binary marker.

Step 3: Convert binaries to text representations the agent can read

For files where content matters:

Binary typeConversion commandOutput
PNG / JPGidentify -verbose img.pngdimensions, colorspace, metadata
PDFpdftotext file.pdf -extracted text
sqlitesqlite3 db .schemaschema DDL
parquetparquet-tools schema file.parquetcolumn schema
wasmwasm-objdump -h file.wasmsection headers
zip / tarunzip -l file.zipmanifest

Feed the converted text into the agent. It now reasons about content, not bytes.

Step 4: Resolve LFS pointers before agent runs

If .gitattributes lists LFS-tracked extensions:

git lfs install
git lfs pull

Verify with file my-asset.psd — should now say Photoshop image instead of ASCII text. Without this, the agent only ever sees pointers.

Step 5: Split large text files before reading

If wc -c data.csv exceeds 5 MB:

head -200 data.csv > data.head.csv
tail -200 data.csv > data.tail.csv
shuf -n 200 data.csv > data.sample.csv

Have the agent read head + tail + sample. For a 30 MB CSV, this is 600 lines instead of 30 MB and usually answers any structure question.

Step 6: Surface skipped files in the agent’s output schema

Force a structured output that cannot omit binaries:

Output JSON: { read_text: [...], skipped_binary: [...], skipped_too_large: [...], skipped_ignored: [...] }

Every file in the input directory must appear in exactly one array.
Mismatched total = error. Re-run.

The schema makes silent drops impossible. Any drop becomes a visible empty slot.

Verify

  • Diff the agent’s reported file list against find <dir> -type f. No entry should be missing.
  • For each skipped_binary entry, confirm the recorded sha256 matches sha256sum <path>.
  • Try the same task with a fresh binary added to the directory — confirm it surfaces in skipped_binary rather than disappearing.

Long-term prevention

  • Default every agent file-survey prompt to the read_text + skipped_binary schema; never let “implicit drop” be an option.
  • Add a probe step (file + wc -c + sha256sum) before any read step in agent templates.
  • Maintain a project list of “binary-but-useful” formats with their conversion commands (PDF → pdftotext, sqlite → schema, etc.).
  • Run git lfs pull in any sandbox where agents will examine assets.
  • Audit .codexignore / .gitignore quarterly so you know what the agent is forbidden from seeing.
  • For LLM training / RAG indexing pipelines, add an explicit “binaries vs text” sorting step before the LLM ever sees the corpus.

Common pitfalls

  • Forcing the agent to base64 a 4 MB binary into the prompt to “let it see” — wastes context, gives no real insight, model cannot reason over raw bytes.
  • Assuming .json is always safe to read — a 50 MB JSON dump exceeds the read cap and gets silently skipped.
  • Forgetting that .env files are routinely ignored — agent never sees them, so audits miss env-driven config.
  • Treating UTF-16 files as “binary” forever; a quick iconv -f UTF-16 -t UTF-8 fixes it.
  • Adding a giant LFS pull into the agent loop instead of pre-pulling in CI.

FAQ

Q: Codex says “file not found” for a file I can cat in shell. Why?

Most likely it tried to read and the encoding tripped the tool, which surfaced as “not found”. Run file on it — if non-UTF-8, transcode first.

Q: My agent reads a .png file as huge garbled output. How do I stop that?

You probably have a permissive read tool that base64-encodes binaries. Allowlist text extensions explicitly and route binaries to the probe tool.

Q: Why not just hash + describe every file regardless of type?

You can. It is fast and safe. The reason most templates do not is that text-content reading is more useful for source code — but for public/ / data/ / assets/, probe-only is the right default.

Q: Does setting --max-file-size higher help?

For large text files yes (cause #4). For non-text binaries (causes #1, #2, #3, #5), no — the tool still cannot parse the bytes.

Tags: #Codex #agent #Troubleshooting #binary-files #encoding