Message Queue Dead-Letter Queue Building Up

DLQ growing in SQS / RabbitMQ / Kafka without bound. Fix by classifying failures, fixing root-cause poison messages, and adding retry-with-backoff.

Your SQS dead-letter queue had 3 messages last week and 8000 today. Or your RabbitMQ DLX is showing a growing backlog. Or your Kafka __consumer_offsets retry topic is unbounded. Each spike is a class of message your consumer cannot process — schema mismatch, downstream timeout, malformed payload, or a logic bug. Letting the DLQ grow silently means you are dropping work and possibly violating retention SLAs. Fix it by classifying the failures, fixing the root cause of poison messages, and adding retry-with-backoff with a budget.

Common causes

Ordered by hit rate.

1. Schema drift between producer and consumer

Producer adds a new required field, consumer parser throws on the old code path. Every new message fails until consumer ships.

How to spot it: Sample a few DLQ messages. All from same producer version with one new field that the consumer cannot parse.

2. Downstream API timing out

Consumer fans out to a third-party API that started slowing down. Each message times out after 30s, then retries N times, then DLQ.

How to spot it: DLQ growth correlates with downstream latency. Failure reason in metadata mentions timeout.

3. Single bad message poisoning the queue

One malformed message crashes the consumer process; consumer restarts, picks the same message, crashes again. Loop.

How to spot it: Consumer logs show repeated crashes with same payload hash. SQS ApproximateReceiveCount near max.

4. Retry budget too generous, hiding real failures

maxReceiveCount = 100 means each broken message stays in the queue for 100 attempts before moving to DLQ. Real failures show up days late.

How to spot it: Check queue policy. maxReceiveCount over 10 = retry budget too high.

5. No DLQ monitoring or alerting

DLQ has been growing for weeks but nobody is paged. By the time anyone notices, there are 50000 messages.

How to spot it: No CloudWatch / Datadog alert on DLQ depth.

6. Consumer scaled down during incident

Auto-scaler dropped consumer count during low CPU; main queue grew, visibility timeout expired, messages re-enqueued, eventually hit DLQ threshold.

How to spot it: Compare DLQ growth timestamps with consumer scale-in events.

Before you start

  • Snapshot DLQ depth and growth rate; capture 10 sample messages.
  • Identify which consumer service owns the DLQ.
  • Check whether processing is idempotent (safe to replay) or not (risky to replay).
  • Tag messages by failure class before replaying.
  • Coordinate with the producer team if schema fixes are needed.

Information to collect

  • DLQ size, growth rate, age of oldest message.
  • 10 to 20 sample message bodies with metadata (ReceiveCount, ApproximateFirstReceiveTimestamp).
  • Consumer logs from the time DLQ growth started.
  • Producer recent deploys.
  • Downstream service health during the period.

Step-by-step fix

Step 1: Sample and classify

# SQS: receive without deleting
aws sqs receive-message \
  --queue-url $DLQ_URL \
  --max-number-of-messages 10 \
  --visibility-timeout 30 \
  --attribute-names All \
  --message-attribute-names All > sample.json

Categorize each sample:

  • Schema mismatch (specific producer version, missing field)
  • Downstream timeout (mentioned in error metadata)
  • Malformed JSON / encoding (parse error)
  • Business logic failure (validation rejected)
  • Unknown (read the payload)

Each class needs a different fix.

Step 2: Fix the schema mismatch

// Be liberal in what you accept
import { z } from 'zod';

const MessageSchema = z.object({
  id: z.string(),
  userId: z.string(),
  // New field, optional with sensible default
  source: z.string().optional().default('unknown'),
  // Old field, kept for back-compat
  type: z.string(),
});

function parse(raw: string) {
  try {
    return MessageSchema.parse(JSON.parse(raw));
  } catch (err) {
    metrics.counter('mq_parse_failure').inc({ reason: err.message });
    throw err;
  }
}

Deploy consumer-first when adding fields. Mark all new fields optional initially.

Step 3: Add per-message timeout for downstream calls

async function processMessage(msg: Message) {
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), 10000);
  
  try {
    await fetch(downstream, { signal: controller.signal });
  } finally {
    clearTimeout(timer);
  }
}

Without per-call timeout, one slow downstream blocks your whole consumer pool.

Step 4: Cap retry budget

# SQS: tighten redrive policy
aws sqs set-queue-attributes \
  --queue-url $MAIN_URL \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"<dlq-arn>\",\"maxReceiveCount\":\"5\"}",
    "VisibilityTimeout": "60"
  }'

5 attempts is a common default. After 5 failures, DLQ-and-alert is the right move.

For RabbitMQ:

channel.assertQueue('main', {
  arguments: {
    'x-dead-letter-exchange': 'dlx',
    'x-dead-letter-routing-key': 'failed',
    'x-message-ttl': 300000,
  },
});

Step 5: Replay safe messages after root-cause fix

// SQS: replay DLQ back to main
import { SQSClient, ReceiveMessageCommand, SendMessageCommand, DeleteMessageCommand } from '@aws-sdk/client-sqs';

async function replayDLQ(client: SQSClient, dlqUrl: string, mainUrl: string) {
  while (true) {
    const { Messages } = await client.send(new ReceiveMessageCommand({
      QueueUrl: dlqUrl,
      MaxNumberOfMessages: 10,
      WaitTimeSeconds: 5,
    }));
    if (!Messages || Messages.length === 0) break;
    
    for (const m of Messages) {
      await client.send(new SendMessageCommand({
        QueueUrl: mainUrl,
        MessageBody: m.Body!,
        MessageAttributes: m.MessageAttributes,
      }));
      await client.send(new DeleteMessageCommand({
        QueueUrl: dlqUrl,
        ReceiptHandle: m.ReceiptHandle!,
      }));
    }
  }
}

Only replay after confirming root cause is fixed. Verify a small sample first.

Step 6: Quarantine poison messages

// Track receive count, send to permanent quarantine after N
if (msg.attributes.ApproximateReceiveCount > 3) {
  await sendToQuarantine(msg);
  return;
}

Quarantine queue does not auto-retry. Operators inspect and decide.

Step 7: Add DLQ depth alert

# CloudWatch alarm
alarm_name: dlq-depth-high
metric: ApproximateNumberOfMessagesVisible
queue: my-service-dlq
threshold: 10
period: 300
comparison: GreaterThanThreshold

Anything over 10 in DLQ for 5 minutes = page on-call.

Verify

  • DLQ depth returns to baseline (typically zero) after replay.
  • Consumer error rate over 24 hours stays under 0.1 percent.
  • Sample 10 successful messages and confirm parse path works for both old and new producer schemas.
  • Downstream latency p99 stays under timeout threshold.

Long-term prevention

  • Make DLQ alerts mandatory for every queue in production.
  • Consumer schemas use optional fields by default; producer deploys after consumer.
  • Standard retry budget across the org: 3 to 5 attempts, then DLQ.
  • Quarterly DLQ review: any DLQ over 0 needs investigation.
  • Idempotent processing as default; replays should be safe.

Common pitfalls

  • Increasing maxReceiveCount to “buy time” — it just delays seeing real failures.
  • Replaying DLQ without fixing root cause — same messages return immediately.
  • No idempotency keys, so replays double-process.
  • Ignoring DLQ until end of quarter; messages expire and are lost.

FAQ

Should I delete DLQ messages I do not understand? No. Move to a quarantine queue for inspection. Deletion loses signal.

How often should I drain DLQ? Ideally never — DLQ at zero is the goal. If it grows, fix the cause that day.

Can I auto-replay DLQ? Yes, with care: only for failure classes you have already root-caused (downstream outage that has cleared). Never for unknown failures.

Tags: #Backend #Troubleshooting #message-queue