Should I just disable green-CI auto-merge for agent PRs?

Yes if you cannot add a runtime smoke step today — it is the safe default. But auto-merge is fine *with* a required preview-smoke check; the goal is a real runtime signal in the gate, not banning automation.

Why did tests pass for Codex but fail for me locally / in prod?

Codex runs in an ephemeral sandbox with synthetic env vars, restricted network, and no production data shapes or auth policies. Code that branches on a missing secret, real network latency, or prod-only config takes a different path there. Reproduce by running the same suite with production-like env vars set and `NODE_ENV=production`.

Can I tell Codex in AGENTS.md to run the integration suite?

Yes. AGENTS.md is read from the repo root and from nested directories (nested overrides root). Add the exact command, e.g. "Run `npm run test:integration` before opening a PR; do not consider the task done until it passes." Codex follows the command you name rather than guessing `npm test`.

The agent updated a snapshot and the PR is still green. Is that ever OK?

Only after a human reads every line of the `.snap` diff — it is the new contract the agent is asserting. Treat a snapshot change like an API change, not a formatting change. Set `CI=true` so the agent cannot regenerate snapshots silently in the first place.

It is not the model hallucinating, so what do I actually fix?

The test suite's coverage of the real surface, not the prompt. Add the runtime smoke gate, run integration plus one e2e path, and forbid snapshot/skip edits. More agent rules without a runtime check will not move the needle.

Troubleshooting

Codex Fix Passes Every Test but Breaks at Runtime

Codex's PR is green in CI but the app crashes after deploy. Why agent fixes that target the test surface miss the runtime, and the smoke-gate that closes the gap.

Published: May 24, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Codex closes the issue, opens the PR, all green. You merge. Five minutes after deploy, error rates spike. The “fix” works in tests because the tests mock the network call the agent quietly broke, stub the env var the agent renamed, or live in a JSDOM that never exercises the broken path. Tests passing is supposed to mean “no regression,” but Codex was trained to make tests pass — and a passing suite is a poor proxy for “real users will be fine.”

Fastest fix: stop trusting green CI alone for agent PRs. Add one cheap runtime gate — deploy the PR to an ephemeral preview and run a 5-second smoke check (home page returns 200, one critical API call succeeds) as a required status check before merge. That single step catches the large majority of “looked green, broke prod” bugs without touching your test suite. Everything below makes that gate sharper and teaches Codex to target runtime, not the test surface.

Why this matters specifically for Codex: every task runs in an ephemeral container where environment variables are synthetic, network access is restricted per workspace, production data shapes are absent, and RLS or auth policies are not applied. “Tests passed in the sandbox” is a statement about a clean room, not about production. As of June 2026 (Codex 3.0, the “build, test and debug on autopilot” generation), the sandbox is more capable but the clean-room gap is unchanged.

This is one of the most reputation-damaging failure modes because the PR looks deeply trustworthy. Every box is checked. The signal you used to filter agent work — green CI — gave you a false positive.

Which bucket are you in

Diff the agent’s PR against origin/main and look for these signatures before debugging anything else:

What you see in the diff	Likely cause	Jump to
Changed function signature, stale `jest.mock`/`vi.mock`	Mock matches old behavior	Cause 1
New `process.env.NODE_ENV` or other env-gated branch	Prod-only path never runs in CI	Cause 2
Test imports a helper, not the deployed entrypoint	Handler/wrapper layer untested	Cause 3
`.snap` or `__snapshots__/` changed with prod code	Agent rewrote its own contract	Cause 4
`await` added/removed near a stubbed async call	Sync fake hides missing `await`	Cause 5
Only `npm test` ran in the task log	Integration/e2e never invoked	Cause 6
New `.skip`, `.todo`, `xit`, or `try/catch` around `expect`	Agent silenced the test	Cause 7

Common causes

1. Agent modified production code, mocks in tests still match the old behavior

The function signature changed from fetch(url) to fetch(url, opts). Real callers pass opts. Tests mock fetch and ignore the second argument, so they pass. Production crashes on undefined options.

How to spot it: Grep the test file for jest.mock, vi.mock, or sinon.stub of the function the agent edited. If the mock signature is stale, the test is no longer testing the real thing.

2. Tests run with `NODE_ENV=test`, agent’s edit branches on `NODE_ENV`

The agent added an if (process.env.NODE_ENV === 'production') guard. The new code path runs only in prod. CI runs with NODE_ENV=test, so the new code never executes. Tests pass; deploy fails the moment NODE_ENV=production is set. The same trap applies to any sandbox-only value: in the Codex container, secrets and config that exist only in production are simply absent, so a branch reading process.env.STRIPE_KEY may take the “missing key” path during the task and the “real key” path in prod.

How to spot it: Diff contains a NODE_ENV check or any other env-gated branch. Look for process.env.X in the agent’s diff.

3. Agent edited a serverless handler, tests run as a unit test of the inner function

The lambda handler imports a helper. Tests import the helper and call it directly with a hand-built event. The handler wrapper (request parsing, error handling, response shape) is not tested. The agent broke the wrapper.

How to spot it: Compare what the test imports versus what the platform invokes. If they are different entry points, the test cannot catch wrapper-layer regressions.

4. Snapshot tests are stale

The agent’s change altered output. The agent also updated the snapshot to match. The test “passes” because the snapshot was rewritten by the same agent that broke the output.

How to spot it: PR diff includes .snap or __snapshots__/ file changes alongside production code changes. Inspect every snapshot update by hand — that is the new contract the agent is claiming.

5. Tests use synchronous fakes for async APIs

Real code awaits a database call. Tests return a resolved-promise stub. The agent removed an await and the stub still returns synchronously, so the missing await looks fine. In production, the function returns a pending promise instead of a value, and downstream code dereferences .then on undefined.

How to spot it: Look for await removed or added in the agent’s diff. Cross-check whether the test fake captures the real async behavior or short-circuits it.

6. Agent only ran the unit suite, never integration or e2e

The harness was configured to run npm test which maps to unit only. Integration tests live behind npm run test:integration and were not invoked. Agent’s diff broke an integration boundary the unit suite cannot see.

How to spot it: package.json has multiple test scripts but the agent only ran one. Check the harness logs for which command was actually executed.

7. The agent silenced a flaky test instead of fixing the bug

The agent saw a test fail intermittently. It added .skip or it.todo or a try/catch that swallows the failure. Tests pass because the broken test no longer runs. The underlying bug is now masked.

How to spot it: Diff contains .skip, .todo, xit, xdescribe, or new try/catch blocks around assertions. Any test removal in an agent PR deserves extra scrutiny.

Shortest path to fix

Step 1: Add a “smoke after deploy” gate as a required status check

In .github/workflows/agent-pr.yml, build the PR, deploy it to an ephemeral preview, then smoke the live URL:

- name: Deploy to ephemeral env
  run: ./scripts/deploy-preview.sh
  env:
    NODE_ENV: production

- name: Smoke check
  run: ./scripts/smoke.sh "https://pr-${{ github.event.number }}.preview.example.com"

Where smoke.sh hits the actual deployed URL: home page returns 200, login flow renders, one critical API call succeeds. Five seconds, catches the large majority of “looked green, broke prod” bugs because it runs the real built artifact with NODE_ENV=production, not a mocked clean room.

Two details that make this actually work:

Make it a required check. In branch protection (Settings -> Branches -> branch protection rule -> “Require status checks to pass”), mark the smoke job required. A smoke step that runs but does not block merge is theater — agent auto-merge will sail past a red smoke.
Trigger on the deployed code, not a manual reaction. If you use a platform preview (Vercel, Cloudflare Pages, Netlify), key your smoke job off the deployment_status event so it runs against the exact preview URL for that commit, not a stale environment.

Step 2: Forbid the agent from editing snapshots in the same PR

In AGENTS.md (the file Codex reads for project conventions — root, then nested directories, with nested overriding root):

## Tests

- Never run `--updateSnapshot`, `-u`, or `npx vitest --update`.
- If a snapshot is stale, stop and ask the human reviewer.
- Snapshot diffs in your PR will be treated as the contract you are claiming. They will be reviewed by a human.

Belt and suspenders: set CI=true (or CI=1) in the agent’s CI environment. Both Jest and Vitest refuse to write or update snapshots when CI is truthy — Vitest fails the run on a missing/mismatched snapshot, and Jest errors that --updateSnapshot must be passed explicitly. That alone blocks the agent from silently regenerating a snapshot during the task. Then add a CI check that fails if .snap files are touched alongside production code by an agent commit, so the drift is caught deterministically even if the snapshot was committed earlier.

Step 3: Forbid silencing tests

## Tests (continued)

- Never add `.skip`, `.todo`, `xit`, `xdescribe`.
- Never wrap an `expect(...)` in try/catch.
- A failing test means there is a bug. Find and fix the bug, do not hide the test.

Plus a grep-based CI check on what the agent added:

if git diff origin/main...HEAD -- 'src/**/*.test.*' | grep -E '^\+.*\.(skip|todo)\b|^\+.*xit\(|^\+.*xdescribe\('; then
  echo "Agent disabled tests. Reject."
  exit 1
fi

A subtler version of the same trick is .only — the agent scopes the run to its one passing test, so every other test silently does not execute. Pass --allowOnly=false to Vitest (or jest --ci, which fails on a stray test.only) so a leftover .only fails CI instead of quietly shrinking the suite.

Step 4: Run integration in the agent’s CI, not just unit

- run: npm test            # unit
- run: npm run test:integration   # real DB, real HTTP
- run: npm run test:e2e -- --headed=false   # at least one critical path

If your e2e suite is slow, pick one or two flagship paths and run those on every agent PR. Full e2e can stay nightly.

Step 5: Production-mode smoke test in the test script itself

Add a test that boots the app with NODE_ENV=production and asserts the critical endpoints:

// test/smoke.prod.test.js
import { spawn } from 'node:child_process';
import { test, expect } from 'vitest';

test('app starts in production mode and responds 200', async () => {
  const proc = spawn('node', ['dist/server.js'], { env: { ...process.env, NODE_ENV: 'production', PORT: '4040' } });
  await waitForPort(4040, 5000);
  const res = await fetch('http://localhost:4040/healthz');
  expect(res.status).toBe(200);
  proc.kill();
});

This catches env-gated branches the unit suite cannot see.

Step 6: Verify mocks match real signatures

A static check that mocked functions match the real signature:

// scripts/check-mocks.mjs
// For each jest.mock('../foo'), import ../foo and verify the mock matches its exports' shape

You do not need a full implementation — a 30-line check that warns when a mock’s keys diverge from the real module’s keys catches the most common drift.

Step 7: Require the agent to explain the runtime in the PR body

## PR description template (mandatory)

- What changed at runtime: ...
- What env vars or feature flags this depends on: ...
- What I manually verified (commands, URLs): ...
- What I did NOT verify and why: ...

The agent will be forced to articulate runtime behavior, which surfaces gaps. “I did not verify the lambda handler wrapper” is a useful admission.

How to confirm it’s fixed

You have closed the gap when all of these hold for the next agent PR:

The PR shows a smoke / preview check in its status list, and it is marked Required in branch protection (a green check that is not required does not count).
The smoke job’s logs show it hit a real URL that returned 200 for at least one critical path, with NODE_ENV=production.
Auto-merge cannot complete while the smoke check is pending or red — confirm by opening a throwaway PR that intentionally breaks the home route and watching merge stay blocked.
Your grep checks fail a test PR that adds .skip or touches a .snap alongside src/, proving they actually run.

If a known-bad PR still merges, the check is non-blocking or scoped to the wrong files — fix branch protection before trusting any agent PR again.

When this is not on you

If your test framework’s mocking primitives encourage stale mocks (e.g., type-erased jest.mock('module') with no signature check), no agent rule can fully save you. Migrate over time toward typed test doubles (vi.fn<typeof real>(), sinon.stub<Real>(), ts-mockito) so signature drift is a compile error.

Easy to misdiagnose as

“The agent is hallucinating.” It is not — every change is real and locally consistent. The problem is the test suite’s coverage of the real surface, not the model’s reasoning. Adding more agent rules without adding runtime checks will not help.

Prevention

Ephemeral preview deploy + smoke test on every agent PR
AGENTS.md forbids snapshot updates, test skips, and assertion swallowing
CI greps for .skip, .todo, xit, and snapshot churn in agent commits
Integration + at least one e2e path on every agent PR, not just unit
One smoke test that boots in NODE_ENV=production against critical endpoints
Periodic audit that mock signatures still match real exports
PR template forces the agent to declare what it did and did not verify

FAQ

Should I just disable green-CI auto-merge for agent PRs? Yes if you cannot add a runtime smoke step today — it is the safe default. But auto-merge is fine with a required preview-smoke check; the goal is a real runtime signal in the gate, not banning automation.
My e2e is too slow to run per-PR. Pick one critical-path scenario and run it on every agent PR. Full suite stays nightly. One real path beats zero. A 5-second curl of /healthz on the preview already beats a full green unit suite for catching deploy-time breakage.
Why did tests pass for Codex but fail for me locally / in prod? Codex runs in an ephemeral sandbox with synthetic env vars, restricted network, and no production data shapes or auth policies. Code that branches on a missing secret, real network latency, or prod-only config takes a different path there. Reproduce by running the same suite with production-like env vars set and NODE_ENV=production.
Can I tell Codex in AGENTS.md to run the integration suite? Yes. AGENTS.md is read from the repo root and from nested directories (nested overrides root). Add the exact command, e.g. “Run npm run test:integration before opening a PR; do not consider the task done until it passes.” Codex follows the command you name rather than guessing npm test.
The agent updated a snapshot and the PR is still green. Is that ever OK? Only after a human reads every line of the .snap diff — it is the new contract the agent is asserting. Treat a snapshot change like an API change, not a formatting change. Set CI=true so the agent cannot regenerate snapshots silently in the first place.
It is not the model hallucinating, so what do I actually fix? The test suite’s coverage of the real surface, not the prompt. Add the runtime smoke gate, run integration plus one e2e path, and forbid snapshot/skip edits. More agent rules without a runtime check will not move the needle.

Tags: #Codex #AI coding #Troubleshooting #Testing