Kafka Consumer Lag Keeps Growing Even After Scaling Consumers

You added more consumer pods. Lag is still going up. The bottleneck is almost never "not enough consumers" — it is partition count, poison messages, or commit-offset drift.

A topic is producing 50k messages per second. Your consumer group used to keep up. Then traffic doubled, lag started climbing, and you scaled the consumer Deployment from 8 pods to 24. Lag kept climbing anyway. Some pods now sit at 0% CPU. Kafka UI shows the group is “Stable” but the end-to-end latency from produce to commit has gone from 200 ms to 14 minutes and keeps growing.

The instinct to add consumers is right for an undersaturated CPU-bound consumer. It is the wrong instinct in almost every other case. This guide walks through the actual common bottlenecks and how to tell them apart.

Common causes

Ordered by hit rate, highest first.

1. More consumers than partitions

A Kafka partition can be consumed by exactly one consumer in a group. If the topic has 12 partitions and you have 24 consumers, 12 of them are idle. Scaling further does nothing.

How to spot it: kafka-consumer-groups.sh --describe --group orders-consumer will show partitions, current-offset, and consumer-id. If multiple consumer-ids are missing assignments, partitions are the cap.

2. A poison message blocking a partition

One message in a partition fails to process. The consumer keeps retrying it forever (or for a long backoff), never committing past it. All later messages in that partition wait. The other partitions look fine — only one is stuck.

How to spot it: Lag on the group is concentrated on one or two partitions while others are at 0. Logs show the same offset being retried.

3. Commit happens after slow downstream write

The consumer reads quickly but processing each message does a synchronous write to a slow downstream (a Postgres insert, an external API, an embedding model call). End-to-end throughput equals downstream throughput, not Kafka throughput. CPU on consumer pods stays low.

How to spot it: Consumer CPU under 30%, downstream service p99 latency well above the per-message budget you would need (messages_per_second / num_partitions).

4. Rebalances thrashing the group

Every time you scale or a pod restarts, Kafka pauses the entire group, reassigns partitions, and resumes. With short max.poll.interval.ms or long-running message handlers, rebalances trigger constantly. The group spends more time rebalancing than consuming.

How to spot it: Look for repeated Attempt to heartbeat failed or Member ... sending LeaveGroup in consumer logs. kafka-consumer-groups.sh shows the group flickering between Stable and PreparingRebalance.

5. max.poll.records too high

The consumer polls 5000 messages at once, then takes 90 seconds to process them. During that 90 seconds, no heartbeat happens, the broker kicks the consumer out, the group rebalances, and the work has to be redone. Lag oscillates wildly.

How to spot it: Repeating pattern of lag drop then spike. Logs say Auto-offset-commit failed or This consumer instance is no longer part of the group.

6. Producer skew — all messages on one partition

Producer is using a key that hashes to a small set of partitions, or no key at all with a sticky partitioner under bursty load. One partition gets 80% of traffic. No matter how many consumers you have, that one partition is consumed by exactly one of them.

How to spot it: kafka-topics.sh --describe --topic orders plus per-partition produce metrics. If one partition’s produce rate is 10x the others, you have skew.

7. Compression mismatch between producer and consumer

Producer sends with zstd. Consumer’s fetch.max.bytes is too small to hold a full decompressed batch, so it fetches tiny amounts, decompresses, processes, fetches again. Throughput collapses.

How to spot it: Network throughput at the consumer is far below what the partitions are producing, and consumer CPU is dominated by decompression.

Shortest path to fix

Step 1: Measure where the lag actually lives

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-consumer

You want the per-partition LAG column. If one partition has 99% of the lag, you have a poison message or producer skew. If lag is even across all partitions, you have a throughput problem.

Step 2: Check partition count versus consumer count

kafka-topics.sh --bootstrap-server kafka:9092 --describe --topic orders

If partitions < consumers, scaling consumers further is wasted. Increase partitions:

kafka-topics.sh --bootstrap-server kafka:9092 \
  --alter --topic orders --partitions 48

Partition count can only go up, never down. Pick a number that gives you 2-4x headroom for scaling.

Step 3: Handle poison messages with a dead-letter pattern

In your consumer, set a retry budget per message. After N failures, send the message to a DLQ topic and commit forward.

try {
  await processMessage(message);
} catch (err) {
  const attempts = (message.headers?.attempts ?? 0) + 1;
  if (attempts >= 3) {
    await producer.send({
      topic: 'orders.dlq',
      messages: [{ ...message, headers: { ...message.headers, attempts, lastError: err.message } }]
    });
  } else {
    throw err;  // will be retried on next poll
  }
}
await consumer.commitOffsets([{ topic, partition, offset: message.offset + 1 }]);

Never block a partition forever on a single bad message.

Step 4: Tune max.poll.records and max.poll.interval.ms together

The relationship is: max.poll.records * avg_processing_time_ms < max.poll.interval.ms.

max.poll.records: 500
max.poll.interval.ms: 300000   # 5 minutes
session.timeout.ms: 45000
heartbeat.interval.ms: 3000

Smaller batches commit more often and survive slow handlers without getting kicked out.

Step 5: If downstream is the bottleneck, batch the writes

Instead of one DB insert per message, accumulate 200 messages and do one bulk insert. Commit after the bulk insert succeeds.

const batch = [];
for await (const message of consumer) {
  batch.push(message);
  if (batch.length >= 200) {
    await db.bulkInsert(batch.map(parse));
    await consumer.commitOffsets(lastOffsetFor(batch));
    batch.length = 0;
  }
}

This is often the single largest win.

Step 6: Fix producer skew with a better partition key

If your key is userId and 0.1% of users generate 50% of events, your partitions will skew no matter what. Either pick a different key (event id, or a composite key that spreads load) or shard hot users explicitly.

const key = isHotUser(userId) ? `${userId}:${randomShard()}` : userId;
producer.send({ topic, messages: [{ key, value }] });

Step 7: Switch to cooperative-sticky assignor to cut rebalance pain

Default range or round-robin assignors stop the world on every rebalance. Cooperative-sticky keeps most assignments stable.

partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor

A scaling event no longer pauses the entire group.

When this is not on you

Broker-side throttling will cap a consumer group regardless of how you tune it. If the cluster has consumer_byte_rate quotas set per-client and you are hitting them, no amount of consumer-side tuning helps. Check kafka.server:type=Fetch and kafka.server:type=ClientQuotaManager JMX metrics, or ask whoever runs the cluster.

Cluster under-provisioning is also a real cause: brokers maxing out disk or network mean fetches are slow regardless of consumer count.

Easy to misdiagnose as

“We need more consumers.” This was true the first three times you scaled. After that, you have probably hit the partition cap or shifted the bottleneck downstream. Always check the per-partition lag distribution before scaling pods.

Another common one: blaming Kafka itself for “being slow.” Kafka brokers handle millions of messages per second on cheap hardware. If your throughput is in the tens of thousands and you are struggling, the bottleneck is almost certainly the consumer code or a downstream service.

Prevention

  • Always provision partitions for the largest realistic consumer fleet, plus 2-4x headroom. Adding partitions later breaks key-based ordering for in-flight data.
  • Wire a DLQ from day one. Poison messages will happen.
  • Monitor per-partition lag, not just group lag. The average hides the bug.
  • Use cooperative-sticky assignor by default in new consumer groups.
  • Treat consumer CPU and downstream-write latency as two separate signals; lag without high CPU means downstream is the limit.

FAQ

  • Can I reduce partition count to fix skew? No. Partition count is append-only. You have to create a new topic and migrate.
  • Should every consumer commit synchronously? Sync commits are safer but slower. Async commits with periodic sync flushes is the standard pattern.

Tags: #Backend #Troubleshooting #infra #kafka #messaging #consumer-lag #streaming