Message Queue Dead-Letter Queue Building Up

Q: What is the SQS `maxReceiveCount` default and range?

The default is `10` and the valid range is `1` to `1000` (as of June 2026). For most services 3 to 5 is the right value.

Q: Why do redriven SQS messages look brand new?

A native DLQ redrive assigns each message a new `messageID` and `enqueueTime` and resets its retention period; SQS treats them as new messages. Account for that if anything downstream keys off message age or ID.

DLQ growing in SQS, RabbitMQ, or Kafka without bound. Sample and classify the failures, fix the poison-message root cause, cap the retry budget, then redrive safe messages.

Published: May 23, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your SQS dead-letter queue had 3 messages last week and 8000 today. Or your RabbitMQ dead-letter exchange (DLX) is showing a growing backlog. Or your Kafka dead-letter topic is unbounded. Each spike is one class of message your consumer cannot process: a schema mismatch, a downstream timeout, a malformed payload, or a logic bug. Letting the DLQ grow silently means you are dropping work and, once messages cross the queue retention limit (SQS default 4 days, max 14), losing it for good.

Fastest path to stable: pull 10 sample messages without deleting them, classify the failure (schema / timeout / malformed / business-logic), fix the one root cause, cap maxReceiveCount to 5, then redrive the safe messages back. Do not bulk-replay before you know why they failed — they will bounce straight back to the DLQ.

The rest of this guide walks each class in detail, but the order is always the same: sample first, fix the root cause, only then replay.

First, find which bucket you are in

Sample the messages before you touch anything. The fix is completely different per class, and most DLQ floods are a single class.

Signal in the sample	Likely cause	Fix section
Every failed message is the same schema version with one new/missing field	Schema drift between producer and consumer	Step 2
Failure metadata mentions `timeout`, `ETIMEDOUT`, `503`; growth tracks downstream latency	Downstream API slow or down	Step 3
Same payload hash repeats; consumer crash-loops; `ApproximateReceiveCount` near max	Single poison message	Step 6
Mixed ages, `maxReceiveCount` over 10, real failures surface days late	Retry budget too generous	Step 4
DLQ grew for weeks with no page	No alerting	Step 7
DLQ growth lines up with consumer scale-in events	Consumer scaled down during an incident	Notes below

The six causes, ranked by hit rate

Schema drift between producer and consumer. Producer adds a new required field; the consumer parser throws on the old code path. Every new message fails until the consumer ships. Spot it: all DLQ samples are the same producer version with one field the consumer cannot parse.
Downstream API timing out. The consumer fans out to a third party that started slowing down. Each message times out, retries N times, then lands in the DLQ. Spot it: DLQ growth correlates with downstream latency; failure reason in metadata mentions a timeout.
A single poison message. One malformed message crashes the consumer process; it restarts, picks the same message, crashes again. Spot it: consumer logs show repeated crashes on the same payload hash; SQS ApproximateReceiveCount is near max.
Retry budget too generous, hiding real failures. maxReceiveCount = 100 means a broken message stays in the queue for 100 attempts before reaching the DLQ, so real failures surface days late. SQS default is 10; the valid range is 1 to 1000 (as of June 2026). Anything over 10 is usually a smell.
No DLQ monitoring or alerting. The DLQ grew for weeks but nobody was paged. By the time anyone notices there are 50000 messages.
Consumer scaled down during an incident. The autoscaler dropped consumer count on low CPU; the main queue grew, the visibility timeout expired, messages re-enqueued, and eventually crossed the DLQ threshold. Spot it: DLQ growth timestamps line up with consumer scale-in events. Fix the scaling policy (scale on queue depth / ApproximateNumberOfMessagesVisible, not CPU) and the redrive in Step 5 handles the backlog.

Before you start

Snapshot DLQ depth and growth rate; capture 10 sample messages.
Identify which consumer service owns the DLQ.
Check whether processing is idempotent (safe to replay) or not (risky to replay).
Tag messages by failure class before replaying.
Coordinate with the producer team if schema fixes are needed.

Information to collect

DLQ size, growth rate, age of the oldest message.
10 to 20 sample message bodies with metadata (ApproximateReceiveCount, ApproximateFirstReceiveTimestamp).
Consumer logs from the time DLQ growth started.
Producer recent deploys.
Downstream service health during the period.

Step-by-step fix

Step 1: Sample and classify

Receive without deleting so the messages stay in the DLQ while you inspect them. Set the visibility timeout long enough to read all 10, but short enough that they reappear if you walk away.

# SQS: receive without deleting
aws sqs receive-message \
  --queue-url "$DLQ_URL" \
  --max-number-of-messages 10 \
  --visibility-timeout 30 \
  --attribute-names All \
  --message-attribute-names All > sample.json

Categorize each sample:

Schema mismatch (specific producer version, missing or extra field)
Downstream timeout (mentioned in error metadata)
Malformed JSON or encoding (parse error)
Business-logic failure (validation rejected the message on purpose)
Unknown (read the payload)

Each class needs a different fix. Note that a long --visibility-timeout here does not consume a delivery attempt against maxReceiveCount unless the message later fails processing in the live consumer.

Step 2: Fix the schema mismatch

Be liberal in what you accept. Mark new fields optional with sensible defaults so old and new producer versions both parse.

import { z } from 'zod';

const MessageSchema = z.object({
  id: z.string(),
  userId: z.string(),
  // New field: optional with a sensible default
  source: z.string().optional().default('unknown'),
  // Old field: kept for back-compat
  type: z.string(),
});

function parse(raw: string) {
  try {
    return MessageSchema.parse(JSON.parse(raw));
  } catch (err) {
    metrics.counter('mq_parse_failure').inc({ reason: String(err) });
    throw err;
  }
}

Deploy the consumer first when adding fields. Mark all new fields optional initially, then make them required only once every consumer instance is on the new version.

Step 3: Add a per-message timeout for downstream calls

Without a per-call timeout, one slow downstream blocks your whole consumer pool and turns a latency blip into a DLQ flood.

async function processMessage(msg: Message) {
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), 10000);

  try {
    await fetch(downstream, { signal: controller.signal });
  } finally {
    clearTimeout(timer);
  }
}

Keep the per-call timeout comfortably below the queue visibility timeout, or the broker will redeliver the message while the first attempt is still running and you will double-process.

Step 4: Cap the retry budget

Tighten the redrive policy so a broken message reaches the DLQ in a handful of attempts, not a hundred. Each broker has a different knob and a different default (as of June 2026):

Broker	Retry-budget setting	Default	Recommended
SQS	`maxReceiveCount` in the source queue’s `RedrivePolicy`	`10`	`3` to `5`
RabbitMQ (quorum)	`x-delivery-limit` queue argument	`20` (since 4.0)	`5`
Kafka (Spring)	`attempts` in `@RetryableTopic`	`3`	`3` to `5`

SQS settings live on the source queue, not the DLQ.

# SQS: tighten the redrive policy on the SOURCE queue
aws sqs set-queue-attributes \
  --queue-url "$MAIN_URL" \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"<dlq-arn>\",\"maxReceiveCount\":\"5\"}",
    "VisibilityTimeout": "60"
  }'

5 attempts is a sane default. After 5 failures, DLQ-and-alert is the right move. Remember maxReceiveCount counts receives, not wall-clock retries, so it interacts with your visibility timeout.

For RabbitMQ (quorum queues, the default queue type since RabbitMQ 4.0): quorum queues enforce a delivery-limit (default 20 as of RabbitMQ 4.x, June 2026); once a message’s delivery count exceeds it, the message is dropped or dead-lettered if a DLX is configured. Set it explicitly and lower:

channel.assertQueue('main', {
  durable: true,
  arguments: {
    'x-queue-type': 'quorum',
    'x-delivery-limit': 5,
    'x-dead-letter-exchange': 'dlx',
    'x-dead-letter-routing-key': 'failed',
  },
});

If you need the broker to retry dead-lettering until the DLX confirms receipt, set x-dead-letter-strategy to at-least-once (the default is at-most-once, which can silently lose dead-lettered messages). At-least-once requires overflow set to reject-publish and a max-length limit on the source queue; with the default drop-head overflow it silently falls back to at-most-once.

For Kafka, there is no built-in DLQ — use the retry-topic plus dead-letter-topic pattern (for example Spring Kafka @RetryableTopic with a @DltHandler): with the default attempts = 3 (one original delivery plus two retries), failed records flow through orders-retry-0 and orders-retry-1 with exponentially increasing backoff, then to orders-dlt. Use exponential backoff with jitter so a downstream blip does not trigger a synchronized retry storm.

Step 5: Replay safe messages after the root-cause fix

On SQS, do not hand-roll a copy loop unless you need to transform messages. Use native DLQ redrive, which moves messages from the DLQ back to the source queue (or a custom destination of the same type) with a rate cap.

In the SQS console: open the queue you configured as a dead-letter queue, choose Start DLQ redrive, under Message destination pick Redrive to source queue(s) (or Redrive to custom destination with an ARN), set Velocity control to System optimized or Custom max velocity (max 500 messages/second), then choose Redrive messages. You can stop it with Cancel DLQ redrive.

From the CLI / SDK use StartMessageMoveTask (track with ListMessageMoveTasks, stop with CancelMessageMoveTask):

# SQS native redrive: DLQ -> its source queue, capped at 50 msg/s
aws sqs start-message-move-task \
  --source-arn "<dlq-arn>" \
  --max-number-of-messages-per-second 50

Caveats as of June 2026 (per the SQS DLQ redrive docs): a redrive task runs at most 36 hours; you can have at most 100 active redrive tasks per account; the custom max velocity caps at 500 messages/second; redriven messages get a new messageID and enqueueTime and their retention period resets; SQS cannot filter or modify messages during a redrive. Start with a low velocity and ramp up while watching the source queue.

If you do need to transform or filter while replaying (or you are not on SQS), drain the DLQ yourself and verify a small sample first:

import { SQSClient, ReceiveMessageCommand, SendMessageCommand, DeleteMessageCommand } from '@aws-sdk/client-sqs';

async function replayDLQ(client: SQSClient, dlqUrl: string, mainUrl: string) {
  while (true) {
    const { Messages } = await client.send(new ReceiveMessageCommand({
      QueueUrl: dlqUrl,
      MaxNumberOfMessages: 10,
      WaitTimeSeconds: 5,
    }));
    if (!Messages || Messages.length === 0) break;

    for (const m of Messages) {
      await client.send(new SendMessageCommand({
        QueueUrl: mainUrl,
        MessageBody: m.Body!,
        MessageAttributes: m.MessageAttributes,
      }));
      await client.send(new DeleteMessageCommand({
        QueueUrl: dlqUrl,
        ReceiptHandle: m.ReceiptHandle!,
      }));
    }
  }
}

Only replay after confirming the root cause is fixed. Redrive 10 messages first and watch them succeed before redriving the rest.

Step 6: Quarantine poison messages

For deterministic failures (a payload that will always crash the parser), do not loop them. Route them to a separate quarantine queue after a couple of receives so they stop poisoning the live path.

// Track receive count; send to permanent quarantine after N
if (Number(msg.attributes.ApproximateReceiveCount) > 3) {
  await sendToQuarantine(msg);
  return;
}

The quarantine queue does not auto-retry. An operator inspects it and decides. This is the right home for the “Unknown” bucket from Step 1.

Step 7: Add a DLQ depth alert

A DLQ should never silently fill. Even one stuck message means something needs investigation.

# CloudWatch alarm
alarm_name: dlq-depth-high
metric: ApproximateNumberOfMessagesVisible
queue: my-service-dlq
threshold: 1
period: 300
comparison: GreaterThanThreshold

Page on-call when the DLQ holds more than 1 message for 5 minutes. (Some teams set the threshold to a small number like 5 to absorb known-transient noise, but treat > 0 as the goal.) For RabbitMQ use a management alert on the DLX target queue; for Kafka alert on the consumer-group lag and message count of the *-dlt topic.

How to confirm it is fixed

DLQ depth returns to baseline (typically zero) after the redrive.
The redrive task reaches status COMPLETED (aws sqs list-message-move-tasks --source-arn <dlq-arn>).
Consumer error rate over 24 hours stays under 0.1 percent.
Sample 10 freshly processed messages and confirm the parse path works for both old and new producer schemas.
Downstream latency p99 stays under the per-call timeout you set in Step 3.

Long-term prevention

Make DLQ alerts mandatory for every queue in production, with the threshold at or near zero.
Consumer schemas use optional fields by default; producers deploy after consumers.
Standard retry budget across the org: 3 to 5 attempts (maxReceiveCount / x-delivery-limit), then DLQ.
Quarterly DLQ review: any DLQ over 0 needs investigation.
Idempotent processing by default (use an idempotency key keyed on message ID) so replays are always safe.
Scale consumers on queue depth, not CPU, so an incident does not silently starve the consumer pool.

Common pitfalls

Increasing maxReceiveCount to “buy time” — it just delays seeing real failures.
Replaying the DLQ without fixing the root cause — the same messages return immediately.
No idempotency keys, so replays double-process.
Forgetting that an SQS redrive resets messageID and the retention clock, breaking dedupe assumptions downstream.
On RabbitMQ, expecting at-least-once dead-lettering to work with the default drop-head overflow — it silently degrades to at-most-once.
Ignoring the DLQ until end of quarter; messages hit the retention limit and are lost.

FAQ

Should I delete DLQ messages I do not understand? No. Move them to a quarantine queue for inspection. Deletion loses the signal you need to find the bug.

How often should I drain the DLQ? Ideally never — a DLQ at zero is the goal. If it grows, fix the cause that day and redrive once the fix is live.

Can I auto-replay the DLQ? Yes, with care: only for failure classes you have already root-caused, such as a downstream outage that has cleared. Never auto-replay unknown failures, and never without idempotent processing.

What is the SQS maxReceiveCount default and range? The default is 10 and the valid range is 1 to 1000 (as of June 2026). For most services 3 to 5 is the right value.

Why do redriven SQS messages look brand new? A native DLQ redrive assigns each message a new messageID and enqueueTime and resets its retention period; SQS treats them as new messages. Account for that if anything downstream keys off message age or ID.

Does increasing the visibility timeout fix DLQ buildup? Only if the cause is genuinely “the consumer needs more time per message.” If the cause is a poison message or schema drift, a longer visibility timeout just slows the bleed. Diagnose first using the table above.

What is the RabbitMQ or Kafka equivalent of SQS redrive? Neither has a one-click redrive. On RabbitMQ, use the Shovel plugin (or rabbitmqadmin) to move messages from the DLX target queue back to the main exchange. On Kafka, run a small consumer/producer that reads *-dlt and re-publishes to the source topic — the Step 5 transform-loop pattern, adapted to your client. In both cases, fix the root cause first.

How do I know the producer/consumer schema mismatch is really gone? After deploying the consumer fix, redrive 10 messages and confirm they leave the DLQ and do not return within a few minutes. Then sample 10 freshly produced messages and check the parse path accepts both the old and the new schema version (Step 2’s optional-field approach is what makes both pass).

Tags: #Backend #Troubleshooting #message-queue

First, find which bucket you are in

The six causes, ranked by hit rate

Before you start

Information to collect

Step-by-step fix

Step 1: Sample and classify

Step 2: Fix the schema mismatch

Step 3: Add a per-message timeout for downstream calls

Step 4: Cap the retry budget

Step 5: Replay safe messages after the root-cause fix

Step 6: Quarantine poison messages

Step 7: Add a DLQ depth alert

How to confirm it is fixed

Long-term prevention

Common pitfalls

FAQ

Related

Related Articles

Scheduled Cron Job Skipped Silently With No Error Logged

Postgres Migration Stuck on ALTER TABLE in Production

Docker Container Restarts With Exit Code 137 (OOM Killed): Fix It

Fix gRPC DEADLINE_EXCEEDED Errors Under Load

JWT 'jwt expired' on Fresh Tokens: Fix Clock Skew

Kafka Consumer Lag Keeps Growing After Scaling Consumers