When partitions are redistributed across consumers in a consumer group, message processing pauses until the new assignment completes. ApsaraMQ for Kafka records each rebalance event -- including start time, duration, trigger, and consumer changes -- so you can identify the root cause and reduce unnecessary rebalances.
How rebalancing works
A consumer group has a group coordinator that tracks two things: the consumers in the group and the partitions of the subscribed topics. When either changes, the coordinator triggers a rebalance to redistribute partitions across consumers.
Common triggers:
Subscription changes: A consumer subscribes to or unsubscribes from a topic within the group.
Partition count changes: The number of partitions in a topic changes.
Consumer joins or leaves: A new consumer joins the group, or an existing consumer is removed -- either intentionally (scale-down, shutdown) or due to a failure such as:
A heartbeat timeout caused by slow message processing.
No
poll()call within themax.poll.interval.mswindow (default: 5 minutes), which causes the broker to remove the consumer from the group.Too many consumers in the consumer group. To save topic and partition resources, specific consumers need to be shut down, which triggers a rebalance.
Insufficient consumers in the consumer group, which causes message delays in topics and partitions. To prevent message delays, additional consumers must be added, which triggers a rebalance.
During a rebalance, partition consumption pauses until the new assignment is finalized. Frequent or long-running rebalances directly reduce message processing throughput.
Prerequisites
Before you begin, make sure that you have:
An ApsaraMQ for Kafka instance
A consumer group with at least one consumer
View rebalance details in the console
Log on to the ApsaraMQ for Kafka console.
In the top navigation bar, select the region of your instance. On the Instances page, click the instance name.
In the left-side navigation pane, click Groups. Find the target group and click its name.
On the Group Details page, click the Rebalance Details tab.
The tab displays each rebalance event with the following details:
| Field | Description |
|---|---|
| Start time | When the rebalance started |
| Duration | How long the rebalance took to complete |
| Trigger | The event that caused the rebalance (subscription change, consumer join/leave, partition change) |
| Total rebalance count | The cumulative number of rebalances for this consumer group |
| Consumers involved | The consumers that were added or removed during the rebalance |
Troubleshoot frequent rebalances
Frequent rebalances typically mean that consumers are being removed from the group faster than expected. The root cause depends on your client version.
Identify the cause
| Client version | Common cause |
|---|---|
| Before 0.10.2 | Heartbeat messages are sent through the poll() interface -- no dedicated heartbeat thread exists. If message processing blocks the poll loop, the heartbeat times out and the broker removes the consumer. |
| 0.10.2 and later | A dedicated heartbeat thread exists, but the consumer is still removed if no poll() call occurs within max.poll.interval.ms (default: 5 minutes). Long-running message processing is the most common trigger. |
Tune consumer parameters
Four parameters control how the broker detects unresponsive consumers. Each involves a trade-off between tolerance for slow processing and speed of failure detection.
| Parameter | What it controls | Default | Trade-off |
|---|---|---|---|
session.timeout.ms | How long the broker waits for a heartbeat before removing the consumer. | 10 s (0.10.2+) | Higher values tolerate longer pauses but delay detection of crashed consumers. |
heartbeat.interval.ms | How often the consumer sends heartbeats to the coordinator. Must be lower than session.timeout.ms (typically one-third). | 3 s | Lower values detect rebalances faster but increase network overhead. |
max.poll.interval.ms | Maximum allowed time between consecutive poll() calls. If exceeded, the consumer is removed. | 5 min | Higher values tolerate slow batch processing but delay detection of stuck consumers. |
max.poll.records | Maximum number of records returned per poll() call. Fewer records per batch means faster processing and more frequent poll() calls. | Varies by client | Lower values reduce the risk of exceeding max.poll.interval.ms but increase poll overhead. |
Set parameter values
session.timeout.msClients before 0.10.2: Set this to a value larger than the time to process a single batch, but no greater than 30 seconds. A value of 25 seconds works well for most workloads.
Clients 0.10.2 and later: Keep the default value of 10 seconds.
heartbeat.interval.msSet this to one-third of
session.timeout.ms. For example, ifsession.timeout.msis 10 seconds, setheartbeat.interval.msto 3 seconds.
max.poll.recordsSet this well below the number of records your consumers can process within
max.poll.interval.ms:max.poll.records << records_per_second_per_thread x thread_count x (max.poll.interval.ms / 1000)
max.poll.interval.msSet this above the time needed to process a full batch at peak load:
max.poll.interval.ms > (max.poll.records / (records_per_second_per_thread x thread_count)) x 1000
Additional recommendations
Offload processing from the poll loop. Move message processing to a separate thread pool so that
poll()returns quickly.Limit topic subscriptions per group. Subscribe each consumer group to no more than five topics. For best performance, assign one topic per group.
Upgrade to client version 0.10.2 or later. Older clients lack a dedicated heartbeat thread, which makes them far more susceptible to rebalances during slow processing.