You asked Codex to “refactor the billing module.” It returns a PR touching 50 files and 1,500 lines. The diff is plausible — every change is defensible — but nobody on your team will review 1,500 lines of agent-written code. The PR sits open for a week, conflicts with main, and eventually gets closed with “let’s break this up.”
Large PRs from Codex aren’t a model problem. They’re a scope problem: the task description authorized everything Codex touched. The fix is bounding diff size at three layers: the prompt (one concern per task), AGENTS.md (per-PR budget), and the verifier (CI gate that fails over-large diffs).
Common causes
Ordered by hit rate, highest first.
1. The task was open-ended (“refactor X”)
“Refactor the billing module” / “improve performance” / “clean up the auth layer” — these have no natural stopping point. Codex keeps editing until it runs out of obvious improvements, not when it hits a meaningful chunk.
How to spot it: Re-read your prompt. If it contains “refactor”, “improve”, “clean up”, “modernize”, or any other open verb without a scope boundary, you authorized infinity.
2. Codex did opportunistic adjacent cleanup
Original task: fix a typo in billing.ts. Codex finished in 4 lines, looked around, noticed inconsistent imports in the same directory, “helpfully” normalized them across 30 files.
How to spot it: The diff has a small core change and a large halo of “while I was here” edits. Halo files don’t touch the bug; they were edited for consistency.
3. AGENTS.md doesn’t cap diff size
Without an explicit “max diff per PR” rule, Codex treats every plausible edit as in-scope. Most teams have an implicit norm; Codex doesn’t read implicit.
How to spot it: grep -i "diff\|pr size\|scope" AGENTS.md — empty means no cap.
4. The refactor mixed independent concerns
The PR renames a function, adds new tests, fixes 3 unrelated bugs found along the way, and updates docs. Each piece is small; together they’re a 1,500-line review burden.
How to spot it: The PR description (or commit list) has 4+ independent bullets. Each could be its own PR.
5. Codex didn’t propose a plan first
When asked for a big change directly, Codex jumps to code. No plan = no opportunity for you to say “do steps 1-3 first, hold off on 4-7.”
How to spot it: Session history shows no plan or breakdown before the code. Just “task” → “1,500 lines of code.”
6. The “fix one bug” cascaded by accident
A “fix” to formatDate required updating 14 callers (legitimately). But Codex bundled it all into one PR instead of: PR1 = new formatDate signature, PR2 = caller updates one at a time.
How to spot it: Diff has a small “real” change + many call-site updates. Could have shipped as a chain of PRs.
Shortest path to fix
Ordered by ROI. Steps 1 and 2 prevent oversize PRs before they happen.
Step 1: Demand a plan before code
For any non-trivial task, prompt:
Before writing any code, propose a plan:
1. Numbered list of discrete steps, each shippable as its own PR.
2. For each step: target files (count + paths), estimated diff size, dependencies on other steps.
3. Identify the smallest step that delivers visible value — start there.
Wait for my approval before generating code.
You’ll see the 50-file plan before it happens and can say “ship step 1 only.”
Step 2: Cap diff size in the prompt + AGENTS.md
In the task:
Constraints:
- Touch at most 5 files.
- Diff under 200 lines (excluding tests and generated code).
- Do not edit files outside the task scope, even for consistency.
- If the task can't fit, STOP and propose splitting.
In AGENTS.md (durable rule):
## PR sizing
- Default cap: 200 lines added/removed per PR (excluding tests/generated).
- Hard cap: 500 lines; anything larger requires explicit approval.
- Adjacent "helpful cleanups" must ship as separate PRs, not bundled.
- Refactors → propose a plan and ship as a chain of small PRs.
Step 3: Disable opportunistic edits
In the prompt:
DO NOT make any changes outside the immediate task:
- No "while I was here" import normalization.
- No formatting changes unless they're in the lines you're editing.
- No renaming for consistency.
- If you see other issues, list them at the end as TODOs — don't fix them.
Step 4: For received oversize PRs, split with Codex’s help
If the PR already exists and you don’t want to discard the work:
This PR has 50 files. Split it into 4 logical PRs:
1. Read the diff: `gh pr diff <number>`.
2. Propose a 4-PR split based on independent concerns.
3. For each PR, list files + line count + commit message.
I'll cherry-pick into separate branches after.
Then in git:
# Branch per PR
git checkout main
git checkout -b pr1-rename-helpers
git cherry-pick <commits-for-pr1>
git checkout main
git checkout -b pr2-add-tests
git cherry-pick <commits-for-pr2>
# ... etc
Step 5: Add a CI check that fails over-large PRs
# .github/workflows/pr-size.yml
name: PR size
on: pull_request
jobs:
size:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- name: Check PR size
run: |
ADDED=$(git diff --shortstat origin/${{ github.base_ref }}...HEAD -- \
':!**/*.snap' ':!**/__generated__/**' ':!pnpm-lock.yaml' \
| grep -oP '\d+(?= insertions)')
if [ "${ADDED:-0}" -gt 500 ]; then
echo "::error::PR adds ${ADDED} lines (max 500). Split it."
exit 1
fi
Excludes lockfiles, snapshots, generated code — the actual reviewable lines are what’s gated.
Step 6: For genuine large refactors, ship behind a feature flag
If the change really does need to be atomic (e.g., migrating an API), ship it staged:
Step 1: Add the new code path alongside the old (small PR, flagged off).
Step 2: Migrate callers one at a time (small PRs, flag still off).
Step 3: Flip the flag (one-line PR).
Step 4: Remove the old code (small PR).
This turns one 1,500-line PR into four 300-line PRs, each independently reviewable.
Prevention
- Plan-first rule in AGENTS.md: any non-trivial task starts with a numbered plan, you approve before code
- Hard PR-size cap in AGENTS.md (e.g., 200 lines default, 500 hard) — Codex respects what’s written
- Disable opportunistic adjacent edits in every prompt — “fix only the named issue”
- CI gate on PR size so oversize PRs can’t reach review
- For refactors, ship behind a feature flag in a chain of small PRs, not one big atomic PR
- When Codex proposes a 50-file change, push back (“ship step 1 only”) instead of accepting the whole plan