Codex Agent Silently Skips Files Containing Binary Data

Q: Will `.codexignore` keep the agent away from a folder?

No. As of June 2026 Codex does not read `.codexignore` (issues #6530 / #2847). Use `.gitignore` or `.ignore` (which ripgrep honors) to hide a path, or list the exclusion in `AGENTS.md` as a soft instruction.

Codex quietly drops PNGs, PDFs, sqlite DBs, and compiled artifacts from audits. Fix it by probing binaries (file + wc -c + sha256sum), allowlisting text extensions, and forcing a no-omit output schema.

Published: May 23, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You ask Codex to enumerate every asset under public/ or audit every file in data/. The agent returns a list that quietly omits the PNG sprites, the sqlite seed DB, the pre-compiled .wasm blob, the embedded font binaries. Worse, it gives no warning, the response reads as if those files do not exist. A reviewer then notices a missing item, and now you do not know how many other audits silently dropped binaries.

Fastest fix: stop telling Codex to read the directory and tell it to probe it. Have it run file, wc -c, and sha256sum on every entry first, read only the ones file reports as text, and emit a structured output where every input file must land in exactly one bucket (read_text, skipped_binary, skipped_too_large, skipped_ignored). That single change makes silent drops impossible. The rest of this guide explains why the drops happen and how to convert each binary type into something the agent can actually reason about.

Two things drive the behavior, and both are working as designed as of June 2026:

Codex’s shell enumeration leans on ripgrep, which ships inside the CLI. Ripgrep auto-skips binary files and auto-respects .gitignore / .ignore / .rgignore. So rg --files, grep -r, and similar one-liners quietly omit binaries before the model ever sees them.
Codex truncates command output hard: roughly 10 KiB or 256 lines, whichever comes first (it keeps the first 128 and last 128 lines and drops the middle). A “text” file that is genuinely large therefore looks half-read or skipped, which is easy to mistake for the binary case.

The fix is not to force binary reads (that produces garbage the model cannot reason over). It is to teach the agent to acknowledge binaries through different tools: hashing, metadata, and structured probes.

Common causes

Ordered by hit rate.

1. The enumeration command (rg / grep) auto-skips binaries

This is the most common and most invisible cause. Ripgrep, which Codex uses for --files and content search, treats any file with NUL bytes as binary and skips it by default. If your prompt told Codex to “list the files with rg --files” or “grep the directory,” binaries were filtered out before the model saw them, with no message.

How to spot it: ask the agent to re-list with ls -la (or find <dir> -type f) instead of rg. If the file appears under ls/find but not in the agent’s previous summary, the search tool dropped it. To prove ripgrep is the culprit, run rg --files --binary <dir> and compare counts; the binaries reappear with --binary.

2. The read tool errors on non-UTF-8 bytes and the agent treats it as “not there”

When the model does try to read a binary directly, the read returns an error or an empty body for non-text bytes. The agent does not retry with a different tool, it just moves on, and the file disappears from its mental model.

How to spot it: if a file shows up in ls/find but the agent claims it “could not be read” or never mentions it after an explicit read attempt, you are in this bucket. Run file <path> to confirm it is non-text.

3. The file extension is unknown to the agent’s text heuristics

.parquet, .arrow, .msgpack, .protobuf, custom .bin, these are not on most agents’ “text” allowlists. Even a JSON file saved as .dat can get treated as binary and skipped.

How to spot it: compare extensions against the text files the agent did pick up. Anything outside the usual .md .txt .json .yaml .ts .js .py .go .rs .toml set may have been silently skipped.

4. File has a BOM or non-UTF-8 encoding

A UTF-16 LE file, a Latin-1 file, or a UTF-8-with-BOM file can trip strict UTF-8 readers. The agent skips it as an encoding error without surfacing the issue. (Note: recent Codex builds now warn on invalid UTF-8 specifically when loading AGENTS.md instruction files instead of dropping them silently, but that warning does not extend to arbitrary files you ask it to read.)

How to spot it: file path/to/file shows UTF-16 or ISO-8859. The strict reader probably refused it.

5. File is text but exceeds the output truncation cap

A 30 MB CSV is text, but Codex truncates a single command’s output to about 10 KiB / 256 lines (first 128 + last 128 lines). cat data.csv returns the head and tail with the middle elided, so the agent reasons over a fraction of the file and may conclude the file is “mostly empty” or skip it entirely.

How to spot it: wc -c file returns a large number and the agent’s summary covers only the top and bottom of the file, or omits it. There is no native --max-file-size knob in the stable CLI as of June 2026; a configurable per-command output limit was requested (issues #5913 and #6426) but has not shipped, so you control this by slicing the file yourself (Step 5).

6. Symlinked binary or Git LFS pointer

Git LFS stores a small text pointer (around 130 bytes) in the repo and keeps the real binary on the LFS server. Codex reads the pointer, sees a tiny non-content blob, and treats it as junk.

How to spot it: cat file.psd shows version https://git-lfs.github.com/spec/v1 followed by an oid sha256: line. The real binary was never fetched.

7. `.gitignore` filters the file out of the agent’s view

Because the search path runs through ripgrep, anything matched by .gitignore, .ignore, or .rgignore is invisible to rg-based enumeration. Forget that an ignore rule is in effect and you assume the agent saw the file.

How to spot it: git check-ignore -v path/to/file returns the matching rule. Note that as of June 2026 .codexignore is not honored by Codex (open issues #6530, #2847, and the still-open feature request #24993) — only the ripgrep ignore files and .gitignore actually hide a file, so do not rely on .codexignore to either expose or hide anything.

Which bucket am I in?

Symptom	Likely cause	First check
File in `ls`/`find` but never in agent output, no read attempt	Search tool auto-skip (#1) or ignore rule (#7)	`rg --files <dir>` vs `find <dir> -type f`; `git check-ignore -v <path>`
Agent says “could not read” / read returned empty	Non-UTF-8 read error (#2) or odd encoding (#4)	`file <path>`
Only certain extensions vanish	Extension not on text allowlist (#3)	Diff skipped vs read extensions
Big text file half-summarized or dropped	Output truncation (#5)	`wc -c <path>`
Tiny “asset” with a URL inside	LFS pointer (#6)	`cat <path>` shows `git-lfs` spec line

Before you start

Run find . -size +1M -type f | head to list what is plausibly being skipped.
Run file path/to/each to confirm which are actually binary versus misclassified text.
Run rg --files <dir> | wc -l and find <dir> -type f | wc -l and compare — a gap is your binary/ignored count.
Decide what the agent actually needs about each binary: existence + hash, dimensions, schema, or a full byte read (rarely).

Information to collect

Full ls -la <dir> (or find <dir> -type f) output versus the agent’s perceived file list — diff to find the dropped entries.
For each skipped file: extension, file output, byte size, and whether it is an LFS pointer.
The agent transcript section where the file should have appeared but did not.
Project .gitignore, .ignore, .rgignore, and .gitattributes (the last tells you which extensions are LFS-tracked).

Step-by-step fix

Ordered by ROI.

Step 1: Give the agent a binary-aware probe tool first

Replace “read this file” with “probe this file.” Put this in your prompt or AGENTS.md:

For each file under data/ (enumerate with: find data -type f):
1. Run: file <path>        # MIME-ish description
2. Run: wc -c <path>       # size in bytes
3. Run: sha256sum <path>   # content hash
4. If `file` output contains "ASCII text", "UTF-8", "JSON", or "Unicode text" → read content
5. Otherwise → record {path, type, size, sha256} only; do not read bytes

Enumerating with find (not rg --files) is deliberate: find does not auto-skip binaries or honor ignore files. The agent now knows binaries exist, what they are, and their fingerprint, without choking on bytes.

Step 2: Allowlist extensions explicitly

In the task prompt or AGENTS.md:

Text extensions to read: .md .txt .json .jsonl .yaml .yml .toml .ini
  .ts .tsx .js .jsx .mjs .cjs .py .rb .go .rs .java .kt .swift
  .html .css .scss .sql .sh .dockerfile .env.example

All other extensions: record path + size + sha256 only.
DO NOT attempt to read them.
DO NOT omit them from the output. Include each with a "binary" marker.

This eliminates silent drops — every file appears in the output, either with content or with a binary marker.

Step 3: Convert binaries to text representations the agent can read

For files where the content matters, transcode to text first and feed that in:

Binary type	Conversion command	Output
PNG / JPG	`identify -verbose img.png`	dimensions, colorspace, metadata
PDF	`pdftotext file.pdf -`	extracted text
sqlite	`sqlite3 db .schema`	schema DDL
parquet	`python -c "import pyarrow.parquet as pq; print(pq.read_schema('file.parquet'))"`	column schema
wasm	`wasm-objdump -h file.wasm`	section headers
zip / tar	`unzip -l file.zip`	archive manifest

Keep each conversion under the ~10 KiB output cap (for example pdftotext file.pdf - | head -200) so Codex does not truncate the result mid-stream. The agent then reasons about content, not bytes.

Step 4: Resolve LFS pointers before the agent runs

If .gitattributes lists LFS-tracked extensions:

git lfs install
git lfs pull

Verify with file my-asset.psd — it should now say Adobe Photoshop Image instead of ASCII text. Without this, the agent only ever sees pointers. Pre-pull in CI rather than inside the agent loop so a multi-gigabyte fetch does not blow the session budget.

Step 5: Slice large text files before reading

If wc -c data.csv is large (anything over a few hundred KB will already be truncated at the ~10 KiB / 256-line cap):

head -200 data.csv > data.head.csv
tail -200 data.csv > data.tail.csv
shuf -n 200 data.csv > data.sample.csv

Have the agent read the head, tail, and sample. For a 30 MB CSV this is 600 lines instead of 30 MB and usually answers any structure question without tripping truncation.

Step 6: Force a no-omit output schema

Make Codex emit a structure that physically cannot drop a file:

Output JSON:
{ "read_text": [...], "skipped_binary": [...], "skipped_too_large": [...], "skipped_ignored": [...] }

Every file from the input directory must appear in exactly one array.
If the total across arrays != the count from `find <dir> -type f | wc -l`, that is an error: re-run.

The schema makes silent drops impossible. Any drop becomes a visible count mismatch the agent must reconcile.

How to confirm it’s fixed

Diff the agent’s reported file list against find <dir> -type f. No entry should be missing.
Confirm the array total matches find <dir> -type f | wc -l exactly.
For each skipped_binary entry, confirm the recorded sha256 matches sha256sum <path>.
Add a fresh binary (e.g. head -c 4096 /dev/urandom > <dir>/probe.bin) and re-run — it must surface under skipped_binary, not disappear.

Long-term prevention

Default every file-survey prompt to the read_text + skipped_binary + skipped_too_large + skipped_ignored schema; never let an implicit drop be an option.
Add a probe step (file + wc -c + sha256sum) before any read step in your AGENTS.md templates.
Enumerate with find -type f, not rg --files, when you need binaries to show up; reserve rg for text-only content search.
Maintain a project list of “binary-but-useful” formats with their conversion commands (PDF → pdftotext, sqlite → schema, etc.).
Run git lfs pull in any sandbox where agents will examine assets.
Audit .gitignore / .ignore / .rgignore quarterly so you know what ripgrep (and therefore the agent) is forbidden from seeing — and remember .codexignore does nothing today.
For RAG indexing or fine-tuning pipelines, add an explicit “binaries versus text” sorting step before the model ever sees the corpus.

Common pitfalls

Base64-ing a 4 MB binary into the prompt to “let it see” — wastes context, and the model cannot reason over raw bytes anyway.
Assuming .json is always safe to read — a 50 MB JSON dump blows past the 10 KiB output cap and gets truncated to head + tail.
Relying on .codexignore to hide secrets or noise from the agent — Codex does not honor it yet (issue #6530); put the rule in .gitignore or .ignore instead.
Forgetting that anything in .gitignore is invisible to rg-based enumeration, so .env, dist/, and build artifacts never reach the agent.
Treating UTF-16 files as “binary” forever — a quick iconv -f UTF-16 -t UTF-8 file fixes it.
Putting a giant git lfs pull inside the agent loop instead of pre-pulling in CI.

FAQ

Q: Codex says “file not found” for a file I can cat in the shell. Why?

Usually it tried to read the file and the encoding tripped the read tool, which surfaced as “not found” or an empty body. Run file <path> — if it is non-UTF-8, transcode with iconv first, then read.

Q: My agent enumerated a directory with rg --files and the binaries are missing. Is that a bug?

No, that is ripgrep working as designed: it skips binary files and honors .gitignore. Re-enumerate with find <dir> -type f (or rg --files --binary) when you need binaries listed.

Q: My agent reads a .png and prints a wall of garbled output. How do I stop that?

You have a permissive read path that dumped the raw bytes. Allowlist text extensions explicitly (Step 2) and route everything else to the probe tool (Step 1).

Q: Why not just hash and describe every file regardless of type?

You can — it is fast and safe. Most templates still read text content because source code is more useful read than hashed. But for public/, data/, and assets/, probe-only is the right default.

Q: Does Codex have a --max-file-size flag to raise the limit?

Not in the stable CLI as of June 2026. Output is capped at roughly 10 KiB / 256 lines per command (first 128 + last 128 lines), and a configurable per-command output limit has been requested (issues #5913 and #6426) but has not shipped. For now, slice large text files yourself (Step 5).

Q: Will .codexignore keep the agent away from a folder?

No. As of June 2026 Codex does not read .codexignore (issues #6530 / #2847). Use .gitignore or .ignore (which ripgrep honors) to hide a path, or list the exclusion in AGENTS.md as a soft instruction.

External references:

Tags: #Codex #agent #Troubleshooting #binary-files #encoding