Can I reduce partition count to fix skew?

No. Partition count is append-only. You have to create a new topic with more partitions, dual-write or replay into it, and migrate the consumer group.

Should every consumer commit synchronously?

Sync commits are safer but slower. The standard pattern is async commits during normal processing with a synchronous commit on shutdown and rebalance to avoid reprocessing.

Does adding partitions instantly clear existing lag?

No. New partitions only take new messages. The backlog already sitting in the old partitions still has to be drained by whatever consumers own those partitions. Adding partitions fixes future throughput, not the current backlog.

Will the KIP-848 protocol fix lag by itself?

No. It makes rebalances fast and non-blocking, so it removes lag that was caused by rebalance thrash (causes 4 and 5). It does nothing for poison messages, a slow downstream, or producer skew.

My group shows `Stable` but lag keeps growing. Is Kafka lying?

No. `Stable` only means no rebalance is in progress. A perfectly stable group can still fall behind because a downstream write or a stuck partition is the bottleneck. Always read the per-partition `LAG`, not just the group state.

One partition is at 0 lag but assigned to no consumer. Why?

That partition has no new data, or you have more consumers than partitions so some consumers got nothing. Check `kafka-consumer-groups.sh --describe`; an empty `CONSUMER-ID` on a partition with lag is the symptom that matters.

Troubleshooting

Kafka Consumer Lag Keeps Growing After Scaling Consumers

You added more consumer pods and lag is still climbing. The bottleneck is almost never too few consumers. It is partition count, a poison message, a slow downstream write, or rebalance thrash.

Published: May 24, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

A topic is producing 50k messages per second. Your consumer group used to keep up. Then traffic doubled, lag started climbing, and you scaled the consumer Deployment from 8 pods to 24. Lag kept climbing anyway. Some pods now sit at 0% CPU. Kafka UI shows the group is Stable, but the end-to-end latency from produce to commit has gone from 200 ms to 14 minutes and keeps growing.

Fastest fix: run kafka-consumer-groups.sh --describe and read the per-partition LAG column first. If the lag is piled on one or two partitions, you have a poison message or producer skew (do not add pods). If lag is even but pod CPU is low, your downstream write is the limit (batch it). If you have more pods than partitions, adding pods does literally nothing. Adding consumers only helps when consumers are CPU-bound and not yet saturated, which is the rare case.

This guide walks through the actual common bottlenecks, how to tell them apart from the metrics you already have, and the exact command or config that fixes each one. Examples use the bundled kafka-consumer-groups.sh / kafka-topics.sh CLI tools (Apache Kafka 4.3, released 22 May 2026, the current stable line as of June 2026) and a Node-style consumer loop, but the diagnosis applies to any client.

Which bucket are you in

Two metrics narrow it down in under a minute: the per-partition LAG distribution and consumer-pod CPU.

What you see	Most likely cause	Jump to
Lag on 1-2 partitions, rest near 0; same offset retried in logs	Poison message	Cause 2 / Step 3
Lag on 1-2 partitions, one partition’s produce rate far above the rest	Producer skew	Cause 6 / Step 6
Lag even across partitions, pod CPU under ~30%	Slow downstream write	Cause 3 / Step 5
Lag even, more pods than partitions, some pods idle	Out of partitions	Cause 1 / Step 2
Lag oscillates up and down; logs show `leaving group` / `rebalance`	Rebalance thrash or `max.poll.records` too high	Causes 4-5 / Steps 4 and 7
Consumer network throughput far below partition produce rate, CPU on decompression	Compression / fetch sizing	Cause 7

Common causes

Ordered by hit rate, highest first.

1. More consumers than partitions

A Kafka partition can be consumed by exactly one consumer in a group. If the topic has 12 partitions and you have 24 consumers, 12 of them are idle. Scaling further does nothing.

How to spot it: kafka-consumer-groups.sh --describe --group orders-consumer will show partitions, current-offset, and consumer-id. If multiple consumer-ids are missing assignments, partitions are the cap.

2. A poison message blocking a partition

One message in a partition fails to process. The consumer keeps retrying it forever (or for a long backoff), never committing past it. All later messages in that partition wait. The other partitions look fine — only one is stuck.

How to spot it: Lag on the group is concentrated on one or two partitions while others are at 0. Logs show the same offset being retried.

3. Commit happens after slow downstream write

The consumer reads quickly but processing each message does a synchronous write to a slow downstream (a Postgres insert, an external API, an embedding model call). End-to-end throughput equals downstream throughput, not Kafka throughput. CPU on consumer pods stays low.

How to spot it: Consumer CPU under 30%, downstream service p99 latency well above the per-message budget you would need (messages_per_second / num_partitions).

4. Rebalances thrashing the group

Every time you scale or a pod restarts, Kafka pauses the entire group, reassigns partitions, and resumes. With short max.poll.interval.ms or long-running message handlers, rebalances trigger constantly. The group spends more time rebalancing than consuming.

How to spot it: Look for repeated Attempt to heartbeat failed or Member ... sending LeaveGroup in consumer logs. kafka-consumer-groups.sh shows the group flickering between Stable and PreparingRebalance.

5. `max.poll.records` too high

The consumer polls 5000 messages at once, then takes longer than max.poll.interval.ms (default 5 minutes) to process them. Because no poll() happens in that window, the client sends a LeaveGroup, the group rebalances, and the work has to be redone. Lag oscillates wildly.

How to spot it: Repeating pattern of lag drop then spike. Logs contain the exact warning Maximum poll interval (300000ms) exceeded by 1532ms (adjust max.poll.interval.ms for long-running message processing): leaving group (the adjust max.poll.interval.ms... hint was added to the message in Kafka 4.x), plus Auto-offset-commit failed or This consumer instance is no longer part of the group on the next commit.

6. Producer skew — all messages on one partition

Producer is using a key that hashes to a small set of partitions, or no key at all with a sticky partitioner under bursty load. One partition gets 80% of traffic. No matter how many consumers you have, that one partition is consumed by exactly one of them.

How to spot it: kafka-topics.sh --describe --topic orders plus per-partition produce metrics. If one partition’s produce rate is 10x the others, you have skew.

7. Compression mismatch between producer and consumer

Producer sends with zstd. Consumer’s fetch.max.bytes is too small to hold a full decompressed batch, so it fetches tiny amounts, decompresses, processes, fetches again. Throughput collapses.

How to spot it: Network throughput at the consumer is far below what the partitions are producing, and consumer CPU is dominated by decompression.

Shortest path to fix

Step 1: Measure where the lag actually lives

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-consumer

You want the per-partition LAG column. If one partition has 99% of the lag, you have a poison message or producer skew. If lag is even across all partitions, you have a throughput problem.

Step 2: Check partition count versus consumer count

kafka-topics.sh --bootstrap-server kafka:9092 --describe --topic orders

If partitions < consumers, scaling consumers further is wasted. Increase partitions:

kafka-topics.sh --bootstrap-server kafka:9092 \
  --alter --topic orders --partitions 48

Partition count can only go up, never down. Pick a number that gives you 2-4x headroom for scaling.

Step 3: Handle poison messages with a dead-letter pattern

In your consumer, set a retry budget per message. After N failures, send the message to a DLQ topic and commit forward.

try {
  await processMessage(message);
} catch (err) {
  const attempts = (message.headers?.attempts ?? 0) + 1;
  if (attempts >= 3) {
    await producer.send({
      topic: 'orders.dlq',
      messages: [{ ...message, headers: { ...message.headers, attempts, lastError: err.message } }]
    });
  } else {
    throw err;  // will be retried on next poll
  }
}
await consumer.commitOffsets([{ topic, partition, offset: message.offset + 1 }]);

Never block a partition forever on a single bad message.

Step 4: Tune `max.poll.records` and `max.poll.interval.ms` together

The relationship is: max.poll.records * avg_processing_time_ms < max.poll.interval.ms.

max.poll.records: 500
max.poll.interval.ms: 300000   # 5 minutes
session.timeout.ms: 45000
heartbeat.interval.ms: 3000

Smaller batches commit more often and survive slow handlers without getting kicked out.

Step 5: If downstream is the bottleneck, batch the writes

Instead of one DB insert per message, accumulate 200 messages and do one bulk insert. Commit after the bulk insert succeeds.

const batch = [];
for await (const message of consumer) {
  batch.push(message);
  if (batch.length >= 200) {
    await db.bulkInsert(batch.map(parse));
    await consumer.commitOffsets(lastOffsetFor(batch));
    batch.length = 0;
  }
}

This is often the single largest win.

Step 6: Fix producer skew with a better partition key

If your key is userId and 0.1% of users generate 50% of events, your partitions will skew no matter what. Either pick a different key (event id, or a composite key that spreads load) or shard hot users explicitly.

const key = isHotUser(userId) ? `${userId}:${randomShard()}` : userId;
producer.send({ topic, messages: [{ key, value }] });

Step 7: Cut rebalance pain with the right protocol

The default range and round-robin assignors stop the world on every rebalance: every member revokes all partitions, the group syncs, then reassigns. A single pod restart pauses the whole group.

There are two ways to fix this, and which one you pick depends on your Kafka version.

If you are on Apache Kafka 4.0 or newer (4.0 released March 2025; 4.3 is the current stable line as of June 2026): switch the consumer to the new rebalance protocol from KIP-848, which went generally available (GA) in 4.0. It moves assignment to the broker-side group coordinator and is fully incremental, so unaffected members keep processing while a rebalance happens in the background. Confluent and Instaclustr have measured large groups rebalancing roughly an order of magnitude faster (for example, a 10-consumer group absorbing 900 new partitions in about 5 seconds instead of 103).

group.protocol: consumer   # new KIP-848 protocol; default is still "classic"

Important: when group.protocol=consumer is set, partition.assignment.strategy is no longer usable. Assignment is server-side, controlled by the broker config group.consumer.assignors (default uniform, which spreads partitions as evenly as possible; the alternative is range). Client heartbeat and session timeout also move server-side under group.consumer.heartbeat.interval.ms and group.consumer.session.timeout.ms, so the client-side heartbeat.interval.ms / session.timeout.ms shown in Step 4 only apply on the classic protocol. Both broker and clients must support the new protocol, so roll this out after the cluster is on 4.0+. See the official Kafka consumer rebalance protocol docs for the full config list.

If you are stuck on the classic protocol (Kafka 3.x, or a client that does not yet support group.protocol=consumer): use the cooperative-sticky assignor, which keeps most assignments in place across a rebalance instead of revoking everything.

partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Either way, a scaling event no longer pauses the entire group.

How to confirm it’s fixed

Do not trust “Stable” in the UI alone; a group can be Stable and still falling behind. Confirm with three checks:

Lag is shrinking, not just stable. Re-run kafka-consumer-groups.sh --describe --group orders-consumer a minute apart. The summed LAG should trend down. If it is flat at a high number, you have only matched intake to output, not drained the backlog. Over-provision briefly to catch up.
No single partition is the long pole. The LAG should be roughly even across partitions. A lingering hot partition means the poison-message or skew fix has not fully landed.
No fresh rebalances. Tail the consumer logs for a few minutes. You should see no new leaving group, Attempt to heartbeat failed, or PreparingRebalance lines. The group state should hold at Stable across the whole window.

A quick one-liner to total the backlog from the LAG column. The --describe output columns are GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID, so LAG is field 6:

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-consumer \
  | awk 'NR>1 && $6 ~ /^[0-9]+$/ {sum += $6} END {print "total lag:", sum}'

The $6 ~ /^[0-9]+$/ guard skips the header and any - placeholder rows (a partition with no committed offset prints - in the LAG column).

When this is not on you

Broker-side throttling will cap a consumer group regardless of how you tune it. If the cluster has consumer_byte_rate quotas set per-client and you are hitting them, no amount of consumer-side tuning helps. Check kafka.server:type=Fetch and kafka.server:type=ClientQuotaManager JMX metrics, or ask whoever runs the cluster.

Cluster under-provisioning is also a real cause: brokers maxing out disk or network mean fetches are slow regardless of consumer count.

Easy to misdiagnose as

“We need more consumers.” This was true the first three times you scaled. After that, you have probably hit the partition cap or shifted the bottleneck downstream. Always check the per-partition lag distribution before scaling pods.

Another common one: blaming Kafka itself for “being slow.” Kafka brokers handle millions of messages per second on cheap hardware. If your throughput is in the tens of thousands and you are struggling, the bottleneck is almost certainly the consumer code or a downstream service.

Prevention

Always provision partitions for the largest realistic consumer fleet, plus 2-4x headroom. Adding partitions later breaks key-based ordering for in-flight data.
Wire a DLQ from day one. Poison messages will happen.
Monitor per-partition lag, not just group lag. The average hides the bug.
On Kafka 4.0+, default new consumer groups to the KIP-848 protocol (group.protocol=consumer); on 3.x, default to the cooperative-sticky assignor. The classic protocol is on the deprecation track (KIP-1274), so moving now avoids a forced migration later.
Treat consumer CPU and downstream-write latency as two separate signals; lag without high CPU means downstream is the limit.

FAQ

Can I reduce partition count to fix skew? No. Partition count is append-only. You have to create a new topic with more partitions, dual-write or replay into it, and migrate the consumer group.
Should every consumer commit synchronously? Sync commits are safer but slower. The standard pattern is async commits during normal processing with a synchronous commit on shutdown and rebalance to avoid reprocessing.
Does adding partitions instantly clear existing lag? No. New partitions only take new messages. The backlog already sitting in the old partitions still has to be drained by whatever consumers own those partitions. Adding partitions fixes future throughput, not the current backlog.
Will the KIP-848 protocol fix lag by itself? No. It makes rebalances fast and non-blocking, so it removes lag that was caused by rebalance thrash (causes 4 and 5). It does nothing for poison messages, a slow downstream, or producer skew.
My group shows Stable but lag keeps growing. Is Kafka lying? No. Stable only means no rebalance is in progress. A perfectly stable group can still fall behind because a downstream write or a stuck partition is the bottleneck. Always read the per-partition LAG, not just the group state.
One partition is at 0 lag but assigned to no consumer. Why? That partition has no new data, or you have more consumers than partitions so some consumers got nothing. Check kafka-consumer-groups.sh --describe; an empty CONSUMER-ID on a partition with lag is the symptom that matters.

Tags: #Backend #Troubleshooting #infra #kafka #messaging #consumer-lag #streaming

Which bucket are you in

Common causes

1. More consumers than partitions

2. A poison message blocking a partition

3. Commit happens after slow downstream write

4. Rebalances thrashing the group

5. max.poll.records too high

6. Producer skew — all messages on one partition

7. Compression mismatch between producer and consumer

Shortest path to fix

Step 1: Measure where the lag actually lives

Step 2: Check partition count versus consumer count

Step 3: Handle poison messages with a dead-letter pattern

Step 4: Tune max.poll.records and max.poll.interval.ms together

Step 5: If downstream is the bottleneck, batch the writes

Step 6: Fix producer skew with a better partition key

Step 7: Cut rebalance pain with the right protocol

How to confirm it’s fixed

When this is not on you

Easy to misdiagnose as

Prevention

FAQ

Related

Related Articles

Scheduled Cron Job Skipped Silently With No Error Logged

Postgres Migration Stuck on ALTER TABLE in Production

Docker Container Restarts With Exit Code 137 (OOM Killed): Fix It

Fix gRPC DEADLINE_EXCEEDED Errors Under Load

JWT 'jwt expired' on Fresh Tokens: Fix Clock Skew

MongoDB Aggregation With $lookup + $group Runs for 30 Seconds

5. `max.poll.records` too high

Step 4: Tune `max.poll.records` and `max.poll.interval.ms` together