The AI wrote a migration to add a NOT NULL column to users. On your laptop it ran in 80ms against a near-empty dev database. In staging it ran in 12s. In production it hung for 9 minutes, then threw null value in column "country" violates not-null constraint. The migration assumes a world that does not match prod: every existing row has the column populated, no long-running transactions hold locks, the table has 50 rows not 50 million. AI-generated migrations are uniquely dangerous because they read clean and confident, and the dev environment lies to you. Production schemas have legacy data, partial null backfills, constraints from old migrations, and table sizes that turn a 0.1s op into a 30-minute incident.
Common causes
Ordered by frequency in real incidents.
1. Adding a NOT NULL column with no default and no backfill
AI emits:
ALTER TABLE users ADD COLUMN country text NOT NULL;
Dev: empty table, no rows to violate the constraint. Prod: 8M existing rows, every one of them now has NULL in country, the migration aborts.
How to spot it: Error message contains null value in column "X" violates not-null constraint during migration.
2. Long-held exclusive lock on a hot table
ALTER TABLE on Postgres takes an ACCESS EXCLUSIVE lock. If anything else is writing to the table, your migration waits for them; they wait for your migration; reads back up.
How to spot it: Migration “hangs” for minutes with no progress, then app latency spikes site-wide. Check pg_locks for blocked / blocking pids.
3. Index creation without CONCURRENTLY
AI writes CREATE INDEX idx_users_email ON users(email);. On a 10M-row table, this locks the table for the duration. CONCURRENTLY is required for online index creation.
How to spot it: App writes time out during the index build; migration takes 30+ minutes.
4. Constraint added on column with violating data
ALTER TABLE orders ADD CONSTRAINT chk_amount_positive CHECK (amount > 0); succeeds in dev where no rows have amount <= 0. Prod has 47 historical rows with amount = 0. Migration fails.
How to spot it: check constraint "X" is violated by some row on the migration step.
5. Foreign key added without index on referencing column
AI adds a FK on orders.user_id referencing users.id. Without an index on orders.user_id, every delete on users does a full table scan on orders. Migration succeeds; the next prod DELETE FROM users WHERE id = ? takes 4 minutes.
How to spot it: After migration ships, queries that touch the parent table get dramatically slower. EXPLAIN shows sequential scans on the child.
6. Column rename in a single migration with running app
ALTER TABLE users RENAME COLUMN name TO full_name; works in dev because no one is hitting it. In prod, the old code is still running, references users.name, and breaks the moment the migration commits. AI did not generate the safe two-phase rename.
How to spot it: Migration completes successfully, app immediately starts throwing column "name" does not exist.
7. Migration assumes a default collation, timezone, or encoding
Dev DB is en_US.UTF-8, prod is C or some legacy locale. Sort orders, case-insensitive comparisons, or text-vs-varchar handling differ. AI generated a query that depends on dev’s locale behavior.
How to spot it: Migration runs but a unique index on a text column produces duplicate-key errors that “should not happen”.
Before you start
- Confirm exact prod schema:
pg_dump --schema-onlyor your DB’s equivalent. Do not trust the ORM model. - Get a row count for any table the migration touches:
SELECT count(*) FROM <table>. - Check active sessions / locks on prod for the target table.
- Ensure you have a tested rollback plan — every forward migration needs a down migration or recovery script.
- Run the migration against a production clone, not a fresh dev DB. Anonymize the data if needed, but keep the row count and constraint history.
Information to collect
- The full migration SQL the AI generated.
- Production schema for affected tables:
pg_dump --schema-only -t <table>. - Row counts and any NULL distribution:
SELECT count(*), count(<column>) FROM <table>. - Current locks during business hours:
SELECT * FROM pg_locks WHERE not granted. - Existing constraints / indexes / triggers on the target table.
- Whether the app does zero-downtime deploys or stops-the-world during migration.
Step-by-step fix
Ordered: prevent the immediate incident first, then harden.
Step 1: Run the migration against a prod clone
pg_dump prod > prod.dump
createdb prod_clone
pg_restore -d prod_clone prod.dump
psql prod_clone < migrations/2026_05_add_country.sql
If it fails here, you caught it before prod. Most AI migration bugs surface immediately on real data volume + constraints.
Step 2: Convert NOT NULL adds to a two-step safe pattern
Replace:
ALTER TABLE users ADD COLUMN country text NOT NULL;
With:
-- Phase 1: nullable add with default for new rows
ALTER TABLE users ADD COLUMN country text;
-- Phase 2: backfill in chunks
UPDATE users SET country = 'US' WHERE country IS NULL AND id BETWEEN 1 AND 100000;
-- (repeat in batches; use \watch or a job)
-- Phase 3: enforce NOT NULL only after backfill complete
ALTER TABLE users ALTER COLUMN country SET NOT NULL;
This pattern is required on any table with more than ~10k rows.
Step 3: Use CONCURRENTLY for index creation on large tables
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CONCURRENTLY does not lock writes. It takes longer and cannot run in a transaction — split the migration if needed.
Step 4: Validate constraints against existing data before adding
Before:
ALTER TABLE orders ADD CONSTRAINT chk_amount_positive CHECK (amount > 0);
Run:
SELECT count(*) FROM orders WHERE NOT (amount > 0);
If non-zero, decide: fix the data, exempt with a NOT VALID constraint, or scope the check to new rows only.
ALTER TABLE orders ADD CONSTRAINT chk_amount_positive CHECK (amount > 0) NOT VALID;
-- Backfill / repair offending rows
ALTER TABLE orders VALIDATE CONSTRAINT chk_amount_positive;
NOT VALID skips the initial check; VALIDATE runs it later without blocking writes.
Step 5: Add an index before adding a foreign key
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
ALTER TABLE orders ADD CONSTRAINT fk_orders_user FOREIGN KEY (user_id) REFERENCES users(id);
Always check whether the referencing column has a usable index. AI almost never adds this step.
Step 6: For renames, use the expand-and-contract pattern
Never do RENAME COLUMN while old code is running. Use three deploys:
-- Deploy 1: add new column, dual-write at app layer
ALTER TABLE users ADD COLUMN full_name text;
-- App code writes to both `name` and `full_name`
-- Deploy 2 (later): read from full_name, stop writing to `name`
-- Deploy 3 (later): drop old column
ALTER TABLE users DROP COLUMN name;
Tell the AI to generate the expand-and-contract version explicitly; otherwise it defaults to a single destructive rename.
Step 7: Set a statement timeout on the migration
SET statement_timeout = '5min';
SET lock_timeout = '30s';
If the migration cannot acquire the lock within 30 seconds, abort and retry — better than holding up the app for 9 minutes. Many migration frameworks (Rails, Prisma, Flyway) have this built in.
Verify
- Migration runs end-to-end against a prod clone with realistic data volume.
- No new constraint or index leaves the database in a degraded query plan — check
EXPLAINon the top-N affected queries. - No
NULLrows in newly-required columns; verify withSELECT count(*) FROM <table> WHERE <col> IS NULL. - App health metrics during migration: latency, error rate, lock waits stay in normal range.
- Down-migration / rollback path is tested and works.
Long-term prevention
- Maintain a “migration safety checklist” in
CLAUDE.mdand require AI to walk through it before emitting any migration. - Use a tool like
squawk,pgroll, orgh-ost(MySQL) that statically checks migrations for known dangerous patterns. - Always run migrations against a recent prod clone in CI. Catching prod-only failures locally is the highest-ROI gate.
- Enforce a
statement_timeoutandlock_timeoutat the database role level for the migration user. - Adopt the expand-and-contract pattern as a project-wide rule. AI follows it once it sees it in your existing migration history.
- Keep migrations small and idempotent: one logical change per file. AI tends to merge multiple changes; reject that in review.
Common pitfalls
- Running the AI’s migration against an empty dev DB and declaring it “tested”.
- Trusting that the AI knows about
CONCURRENTLY,NOT VALID, or expand-and-contract — it usually does not unless you instruct it explicitly. - Adding a NOT NULL column with a
DEFAULTthinking it is safe — on older Postgres (pre-11),DEFAULTonALTER TABLE ADDrewrites the whole table. - Skipping the index-before-FK step because “the FK constraint already creates an index” — it does not on Postgres, only on the referenced PK side.
- Combining schema change + data backfill in one migration on a large table. Split them.
- Ignoring
lock_timeoutand watching the migration block 1000 queries before someone kills it.
For related issues see AI hallucinated a file, AI removed working logic, and AI code broke the build.
FAQ
Q: The AI wrote a migration that “passed locally”. Why is that not enough?
Local dev DBs have tiny row counts, no concurrent traffic, and no legacy data. Most migration failures are about scale, concurrency, or historical NULLs. Always test against a prod clone.
Q: Can I just ask the AI to “make this migration safe for production”?
Sometimes. You will get better results if you list the specific rules: CONCURRENTLY for indexes, NOT VALID + VALIDATE for constraints, two-phase NOT NULL adds, expand-and-contract for renames. The AI then follows the named recipes.
Q: Migration is hung at “waiting for AccessExclusiveLock”. What now?
Either kill it (SELECT pg_cancel_backend(pid)) and retry off-hours, or set lock_timeout and accept a clean abort. Do not let it block reads/writes for many minutes — outages compound.
Q: My ORM generated the migration, not the AI. Does this still apply?
Yes. Most ORMs default to “convenient” not “safe at scale”. The same expand-and-contract, NOT VALID, and CONCURRENTLY rules apply to ORM-generated migrations.
Tags: #Troubleshooting #AI coding #migrations #Database #Schema