AI-Generated SQL Locks a Hot Table for Minutes

AI wrote a query or update that holds long locks on a production table, blocking reads and writes. Identify the pattern and rewrite it to be lock-friendly.

Your AdSense events table is taking writes at 800/sec. The AI wrote a “quick cleanup script” — a single UPDATE events SET status = 'archived' WHERE created_at < now() - interval '90 days'. You run it. Forty seconds later your pager fires: write latency on events jumped from 5ms to 12 seconds, the ingest queue is backing up, customer dashboards are timing out. The query is still running, holding row-level locks on three million rows, and blocking every concurrent writer. AI-generated SQL reads cleanly because it expresses intent. What it fails to model is concurrency, lock granularity, and the cost of holding a single transaction open against a hot table.

Common causes

Ordered by how often each shows up in real outages.

1. Unbounded UPDATE or DELETE in a single transaction

The AI writes one statement that touches millions of rows. Postgres takes a row lock per row and an EXCLUSIVE lock per page. Any concurrent writer to the same rows waits.

How to spot it: SELECT * FROM pg_stat_activity WHERE state = 'active' shows your query running for 30s+; pg_locks shows hundreds of blocked sessions.

2. SELECT ... FOR UPDATE without LIMIT

The AI thinks FOR UPDATE makes the read “safe”. Without a LIMIT or WHERE-key clause, it locks every row matching the predicate, often the whole table.

How to spot it: Reads on the table look fine but UPDATEs queue up. pg_locks shows row-exclusive locks held by your transaction.

3. Implicit table scan due to missing index

UPDATE orders SET ... WHERE customer_email = ? with no index on customer_email does a full table scan, locking every row in the process even if only 3 rows match.

How to spot it: EXPLAIN ANALYZE of the same UPDATE shows Seq Scan. The lock count is way larger than the affected row count.

4. Long-running transaction wrapping the query

The AI wraps everything in a BEGIN/COMMIT block: read data, process in app, write back. If the “process in app” step takes 90 seconds, locks are held for 90 seconds.

How to spot it: SELECT now() - xact_start FROM pg_stat_activity WHERE state = 'idle in transaction' returns large intervals.

5. Lock-escalating predicate on enums / low-cardinality columns

UPDATE events SET status = 'archived' WHERE status = 'pending' with only two distinct status values. Half the table matches; the database may upgrade to a table-level lock.

How to spot it: pg_locks shows a ShareUpdateExclusive or Exclusive lock at the relation level rather than tuple level.

6. CTE that the planner materializes and locks defensively

WITH old AS (DELETE FROM events WHERE ... RETURNING *) INSERT INTO archive SELECT * FROM old — Postgres holds locks across both operations. On large result sets, archive writes block too.

How to spot it: The query is structured as a writable CTE; both source and destination tables show locks.

7. VACUUM FULL or REINDEX (non-CONCURRENT) generated by AI for “cleanup”

VACUUM FULL takes an ACCESS EXCLUSIVE lock — it blocks everything including reads. AI sometimes suggests this as a “performance fix”.

How to spot it: Migration / maintenance script contains VACUUM FULL <table>. App-wide outage for the duration.

Before you start

  • Check whether the query is actually still running. If it is, decide quickly whether to let it finish or kill it.
  • Identify the table’s current write traffic. A 10-second lock on a write-once table is fine; a 1-second lock on a 1000/sec table is an incident.
  • Confirm you have a non-prod environment with realistic data volume to test the rewritten version.
  • Have your pg_cancel_backend / pg_terminate_backend privileges ready before running anything risky in prod.

Information to collect

  • The exact SQL the AI generated.
  • Output of EXPLAIN (ANALYZE, BUFFERS) for the query, ideally against prod-like data.
  • Output of SELECT pid, state, wait_event, query, now() - query_start AS dur FROM pg_stat_activity WHERE datname = current_database() ORDER BY dur DESC.
  • Output of SELECT * FROM pg_locks WHERE NOT granted during the incident.
  • Table size and index list: \d+ <table>.
  • Concurrent write rate on the table.

Step-by-step fix

Ordered: stop the bleeding, then re-architect.

Step 1: Decide whether to kill the running query

If app latency is up and the query is the cause:

SELECT pid, query, now() - query_start AS dur
  FROM pg_stat_activity
  WHERE query LIKE 'UPDATE events%' AND state = 'active';

SELECT pg_cancel_backend(<pid>);
-- If cancel does not work within 10s:
SELECT pg_terminate_backend(<pid>);

pg_cancel_backend is graceful (SIGINT); pg_terminate_backend is forceful (SIGTERM). Use cancel first.

Step 2: Rewrite bulk updates to chunked batches

Replace:

UPDATE events SET status = 'archived'
  WHERE created_at < now() - interval '90 days';

With:

DO $$
DECLARE
  rows_affected int := 1;
BEGIN
  WHILE rows_affected > 0 LOOP
    WITH batch AS (
      SELECT id FROM events
        WHERE created_at < now() - interval '90 days'
          AND status <> 'archived'
        ORDER BY id
        LIMIT 5000
        FOR UPDATE SKIP LOCKED
    )
    UPDATE events SET status = 'archived'
      WHERE id IN (SELECT id FROM batch);
    GET DIAGNOSTICS rows_affected = ROW_COUNT;
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;

FOR UPDATE SKIP LOCKED lets concurrent writers proceed by skipping rows you have temporarily locked. The pg_sleep gives the planner / vacuum / replication slack.

Step 3: Add the missing index before the next big update

If EXPLAIN shows a sequential scan:

CREATE INDEX CONCURRENTLY idx_events_created_at_status
  ON events(created_at) WHERE status <> 'archived';

Partial indexes are excellent for cleanup-style queries — small index, fast lookups, vanishes when no work remains.

Step 4: Bound FOR UPDATE queries

Replace:

SELECT * FROM orders WHERE status = 'pending' FOR UPDATE;

With:

SELECT * FROM orders WHERE status = 'pending'
  ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED;

You process 100 at a time, skip rows other workers grabbed, and never accidentally lock the whole pending set.

Step 5: Move long processing out of the transaction

Replace:

BEGIN;
SELECT * FROM orders WHERE id = ? FOR UPDATE;
-- 90 seconds of app-side work
UPDATE orders SET ... WHERE id = ?;
COMMIT;

With:

BEGIN;
UPDATE orders SET processing_started_at = now() WHERE id = ? AND processing_started_at IS NULL;
COMMIT;
-- 90 seconds of app-side work, no transaction held
BEGIN;
UPDATE orders SET ... WHERE id = ?;
COMMIT;

Optimistic claiming via processing_started_at replaces the long FOR UPDATE.

Step 6: Set timeouts on the session

Before any AI-generated bulk operation in prod:

SET statement_timeout = '30s';
SET lock_timeout = '5s';
SET idle_in_transaction_session_timeout = '60s';

If the query exceeds the budget it aborts cleanly. Far better than a 10-minute lock.

Step 7: Replace VACUUM FULL with pg_repack or table rewrites

If the AI suggested VACUUM FULL:

# Online table reorganization without ACCESS EXCLUSIVE lock
pg_repack -t events -d mydb

pg_repack rebuilds the table online. For most maintenance, a regular autovacuum tune is enough — VACUUM FULL is almost never the right answer in prod.

Verify

  • EXPLAIN ANALYZE on the rewritten query shows index usage and bounded work.
  • A test run on a prod clone with concurrent write load completes without write-latency spikes.
  • During the real run, pg_stat_activity never shows your query running longer than the chunk budget.
  • No idle in transaction sessions from your job persist beyond a few seconds.
  • App latency dashboards stay flat during the operation.

Long-term prevention

  • Add to CLAUDE.md / .cursorrules: “Never generate bulk UPDATE / DELETE without LIMIT + loop. Always assume the target table has concurrent writes.”
  • Use a job framework (Sidekiq, BullMQ, pgboss) to manage chunked bulk operations rather than ad-hoc SQL scripts.
  • Set statement_timeout, lock_timeout, and idle_in_transaction_session_timeout at the role level so even a careless AI query cannot hold locks forever.
  • Add a wrapper script for “any DML hitting more than N rows must use the batch helper” — enforce it in code review.
  • Keep an internal cookbook of safe SQL patterns the AI can reference: chunked update, SKIP LOCKED queue, partial index for cleanup.
  • For Postgres, enable log_lock_waits = on and a low deadlock_timeout, then alert on slow log entries.

Common pitfalls

  • Running the AI’s “quick cleanup query” during peak traffic because it looks small.
  • Assuming WHERE with no index “is fast on a small table” — small in dev, huge in prod.
  • Killing a query mid-run on a replica setup without realizing the lock had already replicated.
  • Setting statement_timeout only on the migration script but not on ad-hoc psql sessions where AI suggestions get pasted.
  • Trusting that BEGIN; ... COMMIT; is “safer” — it actually extends lock duration.
  • Letting AI generate a CTE with writable subqueries on a hot table without reviewing the lock implications.

For related issues see AI migration fails on prod schema, AI removed working logic, and AI tests pass but feature is broken.

FAQ

Q: The AI’s query looks like every example on Stack Overflow. Why is it bad here?

Stack Overflow examples are usually toy databases. Production tables have 100x the rows, concurrent writers, and replication. The same SQL is safe at one scale and catastrophic at another.

Q: Can I just add a LIMIT 1000 and re-run repeatedly?

That is exactly the chunked-batch approach above, plus a loop and SKIP LOCKED. Manually re-running is fine for one-offs; scripted loops are required for cleanup jobs.

Q: Why is FOR UPDATE SKIP LOCKED important?

Without SKIP LOCKED, a chunked worker blocks behind whatever locked the same rows last. SKIP LOCKED makes workers walk around contended rows and keep moving — essential for parallel jobs.

Q: My ORM generated the bulk query, not the AI. Same fix?

Yes. ActiveRecord update_all, Sequelize bulkUpdate, Prisma updateMany all default to a single-statement bulk write. Configure batching at the ORM level or rewrite the call.

Tags: #Troubleshooting #AI coding #sql #Database #locks