Will killing a backend cause data loss?

No. Postgres rolls back the transaction cleanly. The application sees a connection error and should retry. The committed data of every other transaction is untouched.

Should I use `VACUUM FULL` regularly?

No. It is an `ACCESS EXCLUSIVE` lock rewrite that blocks all reads and writes on the table and needs free disk roughly equal to the table size. Use `pg_repack` for routine online bloat cleanup; reserve `VACUUM FULL` for emergencies in a maintenance window.

Why did regular VACUUM return success but `n_dead_tup` didn't drop?

Because the dead rows were still visible to the oldest open transaction (the xmin horizon). VACUUM ran, found those rows "dead but not yet removable," and left them. It only reclaims them after the blocking transaction or slot is gone.

Is this an autovacuum bug I should report?

Almost never. Autovacuum is correctly refusing to remove rows that an active snapshot can still see — that is MVCC working as designed. Fix the blocking transaction, not the vacuum settings.

On a managed database (RDS, Cloud SQL, Aurora) can I still kill the backend?

Yes, `pg_terminate_backend` works on your own sessions. You cannot kill provider-owned sessions such as `rdsadmin` or `cloudsqladmin`; those hold xmin only briefly during backups or upgrades, so wait them out.

Troubleshooting

Postgres Autovacuum Stalled by a Long-Running Transaction

Dead tuples climb, the table bloats, and VACUUM logs say dead but not yet removable. One forgotten transaction is pinning the xmin horizon so vacuum reclaims nothing. Find and kill it.

Published: May 24, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your Postgres dashboard shows a steadily climbing dead-tuple count. The largest table has tripled in disk size over two weeks but row count is flat. You ran VACUUM ANALYZE manually and it returned in seconds without complaining. Autovacuum logs say it ran. Yet pg_stat_user_tables.n_dead_tup keeps rising and p99 query latency on that table has crept from 40 ms to 900 ms.

The culprit is almost never autovacuum itself. It is one stuck transaction somewhere in the cluster pinning the global xmin horizon, so vacuum cannot actually reclaim any rows even when it appears to run successfully.

TL;DR — the fastest fix

Run VACUUM (VERBOSE) public.your_table;. If the output reports a non-zero count alongside oldest xmin — for example "1165 are dead but not yet removable, oldest xmin: 108204095" (PostgreSQL 17+ also phrases the per-relation line as "... dead tuples cannot be removed yet, oldest xmin: ...") — vacuum is being blocked by the xmin horizon, not failing. The oldest xmin value is the transaction id pinning everything. Find the oldest transaction holding that horizon:

SELECT pid, datname, usename, state,
       backend_xmin,
       now() - xact_start AS xact_age,
       now() - state_change AS idle_age,
       left(query, 60) AS query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC
LIMIT 5;

The top row is almost always the offender. Terminate it with SELECT pg_terminate_backend(<pid>);, then re-run VACUUM. If pg_stat_activity shows nothing old, the holder is a replication slot or a prepared transaction (see causes 2 and 4). Most incidents are resolved in under five minutes once you stop blaming autovacuum.

Common causes

Ordered by hit rate, highest first. Use this table to jump straight to your bucket — every row holds back the same xmin horizon, but you diagnose and clear each one differently.

What you see	Likely cause	Where to look
A session stuck in `idle in transaction` for minutes/hours	Forgotten open transaction (cause 1)	`pg_stat_activity.state`
A `SELECT` that has been `active` for hours	Long analytics query or `pg_dump` snapshot (cause 3)	`pg_stat_activity` + `backend_xmin`
No old session at all, yet horizon won’t move	Abandoned replication slot (cause 2)	`pg_replication_slots` (`active = false`)
Holder survives reconnects and restarts	Prepared transaction from 2PC (cause 4)	`pg_prepared_xacts`
Horizon held back but nothing local is old	Standby feeding xmin back (cause 5)	`pg_stat_replication.backend_xmin`
Logs show `canceling autovacuum task` repeatedly	Autovacuum keeps yielding to a lock (cause 6)	`pg_stat_user_tables.autovacuum_count`
`age(datfrozenxid)` climbing toward 200M+	Anti-wraparound vacuum stuck (cause 7)	`pg_database`

1. An idle-in-transaction session held open for hours

Some worker opened a transaction with BEGIN, did one query, and then went to sleep waiting on an external API. The connection is alive, the transaction never committed.

How to spot it: Run SELECT pid, state, xact_start, now() - xact_start AS age, query FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY xact_start; Any row with age over a few minutes is a smoking gun.

2. A replication slot with no consumer

A logical replication slot was created for a CDC pipeline that died, was deleted, or fell behind. The slot keeps advancing catalog_xmin backwards so vacuum cannot remove rows newer than the slot.

How to spot it: SELECT slot_name, active, restart_lsn, confirmed_flush_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag FROM pg_replication_slots; A slot with active = false and growing lag is the bug. On PostgreSQL 18 (released September 25, 2025; current minor as of June 2026 is 18.4) the pg_replication_slots view also exposes an invalidation_reason column, so a slot already dropped for inactivity reads idle_timeout there, and inactive_since tells you how long it has been idle. PostgreSQL 18 adds idle_replication_slot_timeout to auto-invalidate inactive slots, but it defaults to 0 (disabled), and even when set it only takes effect at the next checkpoint — see prevention below.

3. A long analytics query running for hours

A BI tool or a pg_dump started a REPEATABLE READ snapshot for an export. Until it finishes, no row visible to that snapshot can be cleaned up.

How to spot it: Look for long-running SELECT in pg_stat_activity with state = 'active' and a low backend_xmin.

4. Prepared transaction left behind by 2PC

A two-phase-commit transaction got into the prepared state and nobody ever called COMMIT PREPARED or ROLLBACK PREPARED. It survives reconnects and restarts.

How to spot it: SELECT * FROM pg_prepared_xacts; Any row here is almost certainly the bug. They are extremely sticky.

5. A standby with `hot_standby_feedback = on` and a slow query

A replica is running a long query and feeding its xmin back to the primary. The primary’s vacuum horizon is held back by what is happening on the replica.

How to spot it: On the primary, check SELECT application_name, backend_xmin FROM pg_stat_replication; and compare with txid_current(). If backend_xmin is far behind, your replica is the brake.

6. Autovacuum is running but getting cancelled

On heavily-locked tables, autovacuum keeps starting and then yielding when a session asks for a conflicting lock. It looks like it ran in logs but actually completed zero work.

How to spot it: Check pg_stat_user_tables.autovacuum_count versus last_autovacuum. If count is climbing but n_dead_tup never drops, vacuums are being cancelled. The Postgres log will show canceling autovacuum task messages.

7. `autovacuum_freeze_max_age` reached and an anti-wraparound vacuum is stuck

Once the cluster gets close to transaction-ID wraparound, autovacuum launches aggressive freeze workers that cannot be cancelled. If one of these is blocked, everything stalls.

How to spot it: SELECT datname, age(datfrozenxid) FROM pg_database; Values above 200 million mean you are in the danger zone.

Shortest path to fix

Step 1: Find the oldest xmin holder

This is the single most useful query. Run it first.

SELECT 'idle_in_tx' AS source, pid::text, xact_start::text, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
UNION ALL
SELECT 'long_query', pid::text, xact_start::text, query
FROM pg_stat_activity
WHERE state = 'active' AND xact_start < now() - interval '5 min'
UNION ALL
SELECT 'prepared_xact', gid, prepared::text, '2PC'
FROM pg_prepared_xacts
UNION ALL
SELECT 'repl_slot', slot_name, xmin::text, 'replication'
FROM pg_replication_slots
WHERE xmin IS NOT NULL
ORDER BY 3;

The earliest entry is what is holding back vacuum.

Step 2: Kill or unblock the offender

For an idle-in-transaction session, ask owners first, then terminate.

SELECT pg_terminate_backend(12345);

For an orphan replication slot:

SELECT pg_drop_replication_slot('dead_cdc_slot');

For a prepared transaction:

ROLLBACK PREPARED 'transaction-gid-here';

Step 3: Verify the xmin horizon moved

-- PostgreSQL 13+ (64-bit, wraparound-safe — preferred):
SELECT now(), pg_current_xact_id(),
       pg_snapshot_xmin(pg_current_snapshot());
-- pre-13 (legacy 32-bit equivalents):
-- SELECT now(), txid_current(), txid_snapshot_xmin(txid_current_snapshot());

The snapshot xmin should now be close to the current xid. If it is still far behind, you have a second holder; repeat step 1.

Step 4: Manually vacuum the bloated table aggressively

Once the horizon is unblocked, force a vacuum that actually reclaims pages.

VACUUM (VERBOSE, ANALYZE) public.orders;
-- if you can afford a brief exclusive lock and you want disk back:
VACUUM (FULL, VERBOSE) public.orders;

VACUUM FULL rewrites the table and takes an ACCESS EXCLUSIVE lock — do this in a maintenance window or use pg_repack for online rebuilds.

How to confirm it’s fixed

After the vacuum, you want two things to be true: dead tuples dropped, and the “not yet removable” message is gone.

SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'orders';

n_dead_tup should fall sharply (toward zero on a freshly vacuumed table). Re-run VACUUM (VERBOSE) public.orders; and confirm the output no longer reports rows that cannot be removed yet. If n_dead_tup is still high but VACUUM VERBOSE now removes rows cleanly, you simply have a backlog — let autovacuum catch up or repeat the manual vacuum. If the “not yet removable” line persists, a second xmin holder remains; go back to Step 1.

Step 5: Lower the idle-in-transaction timeout cluster-wide

ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';
ALTER SYSTEM SET statement_timeout = '30min';  -- per session if needed
SELECT pg_reload_conf();

This terminates forgotten transactions automatically rather than waiting for a human.

Step 6: Add an autovacuum monitor query

-- alert when oldest xact age > 30 min
SELECT max(extract(epoch FROM now() - xact_start)) AS oldest_tx_seconds
FROM pg_stat_activity
WHERE xact_start IS NOT NULL;

Wire this into Prometheus or whatever you use; page on > 1800.

Step 7: Tune autovacuum on the hot table

Default autovacuum thresholds are conservative for write-heavy tables. For a 50M-row table that gets 1M updates a day, lower the scale factor.

ALTER TABLE public.orders SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_vacuum_cost_limit = 2000,
  autovacuum_naptime = 10
);

When this is not on you

A managed provider (RDS, Cloud SQL, Aurora) may run its own internal long transactions for backups or major-version upgrades. These hold xmin briefly and you cannot kill them. If pg_stat_activity shows a session owned by rdsadmin or cloudsqladmin as the only old transaction, just wait it out.

Easy to misdiagnose as

A slow disk. The symptom is identical: queries get slower, IOPS climb, dashboards look saturated. People reach for bigger instances. The actual problem is that the index is full of dead tuples Postgres cannot remove, so every scan touches 5x more pages than it should. Bigger disks make this last longer but not better.

Another common misdiagnosis: blaming the query planner. The plans got worse because statistics are stale, but the statistics are stale because analyze cannot do its job when the table is full of dead rows.

Prevention

Set idle_in_transaction_session_timeout cluster-wide (available since PostgreSQL 9.6). There is almost never a legitimate reason to hold a transaction idle for hours.
On PostgreSQL 18, set idle_replication_slot_timeout (default 0 = disabled) so abandoned slots get invalidated automatically instead of silently pinning the horizon forever. Invalidation fires at the next checkpoint, so the real delay is your value plus up to one checkpoint_timeout; it does not apply to slots that do not reserve WAL or to standby slots synced from a primary. On PostgreSQL 17 and earlier, alert on slot inactivity instead — there is no auto-cleanup.
Alert on oldest_tx_seconds, the n_dead_tup / n_live_tup ratio, and pg_replication_slots lag.
Never create a replication slot without a matching cleanup story; treat orphan slots like memory leaks.
For tables over 10M rows with significant update load, override autovacuum_vacuum_scale_factor to 0.01-0.02 instead of relying on the global default of 0.2.
Run a weekly SELECT * FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20; as a health check.

FAQ

Will killing a backend cause data loss? No. Postgres rolls back the transaction cleanly. The application sees a connection error and should retry. The committed data of every other transaction is untouched.
Should I use VACUUM FULL regularly? No. It is an ACCESS EXCLUSIVE lock rewrite that blocks all reads and writes on the table and needs free disk roughly equal to the table size. Use pg_repack for routine online bloat cleanup; reserve VACUUM FULL for emergencies in a maintenance window.
Why did regular VACUUM return success but n_dead_tup didn’t drop? Because the dead rows were still visible to the oldest open transaction (the xmin horizon). VACUUM ran, found those rows “dead but not yet removable,” and left them. It only reclaims them after the blocking transaction or slot is gone.
I killed the long transaction but disk space hasn’t returned. Plain VACUUM marks pages reusable but does not shrink the file on disk; the table stops growing rather than shrinking. To physically return space to the OS you need VACUUM FULL or pg_repack.
Is this an autovacuum bug I should report? Almost never. Autovacuum is correctly refusing to remove rows that an active snapshot can still see — that is MVCC working as designed. Fix the blocking transaction, not the vacuum settings.
On a managed database (RDS, Cloud SQL, Aurora) can I still kill the backend? Yes, pg_terminate_backend works on your own sessions. You cannot kill provider-owned sessions such as rdsadmin or cloudsqladmin; those hold xmin only briefly during backups or upgrades, so wait them out.

External references

PostgreSQL docs: Routine Vacuuming — the xmin horizon and why dead tuples stay
PostgreSQL docs: VACUUM — exact options including FULL, VERBOSE, ANALYZE
PostgreSQL docs: pg_replication_slots — slot state and the PG18 invalidation_reason column

Tags: #Backend #Troubleshooting #infra #postgres #Database #autovacuum #bloat