All Products
Search
Document Center

ApsaraMQ for Kafka:View rebalance details

Last Updated:Mar 11, 2026

When partitions are redistributed across consumers in a consumer group, message processing pauses until the new assignment completes. ApsaraMQ for Kafka records each rebalance event -- including start time, duration, trigger, and consumer changes -- so you can identify the root cause and reduce unnecessary rebalances.

How rebalancing works

A consumer group has a group coordinator that tracks two things: the consumers in the group and the partitions of the subscribed topics. When either changes, the coordinator triggers a rebalance to redistribute partitions across consumers.

Common triggers:

  • Subscription changes: A consumer subscribes to or unsubscribes from a topic within the group.

  • Partition count changes: The number of partitions in a topic changes.

  • Consumer joins or leaves: A new consumer joins the group, or an existing consumer is removed -- either intentionally (scale-down, shutdown) or due to a failure such as:

    • A heartbeat timeout caused by slow message processing.

    • No poll() call within the max.poll.interval.ms window (default: 5 minutes), which causes the broker to remove the consumer from the group.

    • Too many consumers in the consumer group. To save topic and partition resources, specific consumers need to be shut down, which triggers a rebalance.

    • Insufficient consumers in the consumer group, which causes message delays in topics and partitions. To prevent message delays, additional consumers must be added, which triggers a rebalance.

During a rebalance, partition consumption pauses until the new assignment is finalized. Frequent or long-running rebalances directly reduce message processing throughput.

Prerequisites

Before you begin, make sure that you have:

  • An ApsaraMQ for Kafka instance

  • A consumer group with at least one consumer

View rebalance details in the console

  1. Log on to the ApsaraMQ for Kafka console.

  2. In the top navigation bar, select the region of your instance. On the Instances page, click the instance name.

  3. In the left-side navigation pane, click Groups. Find the target group and click its name.

  4. On the Group Details page, click the Rebalance Details tab.

The tab displays each rebalance event with the following details:

FieldDescription
Start timeWhen the rebalance started
DurationHow long the rebalance took to complete
TriggerThe event that caused the rebalance (subscription change, consumer join/leave, partition change)
Total rebalance countThe cumulative number of rebalances for this consumer group
Consumers involvedThe consumers that were added or removed during the rebalance

Troubleshoot frequent rebalances

Frequent rebalances typically mean that consumers are being removed from the group faster than expected. The root cause depends on your client version.

Identify the cause

Client versionCommon cause
Before 0.10.2Heartbeat messages are sent through the poll() interface -- no dedicated heartbeat thread exists. If message processing blocks the poll loop, the heartbeat times out and the broker removes the consumer.
0.10.2 and laterA dedicated heartbeat thread exists, but the consumer is still removed if no poll() call occurs within max.poll.interval.ms (default: 5 minutes). Long-running message processing is the most common trigger.

Tune consumer parameters

Four parameters control how the broker detects unresponsive consumers. Each involves a trade-off between tolerance for slow processing and speed of failure detection.

ParameterWhat it controlsDefaultTrade-off
session.timeout.msHow long the broker waits for a heartbeat before removing the consumer.10 s (0.10.2+)Higher values tolerate longer pauses but delay detection of crashed consumers.
heartbeat.interval.msHow often the consumer sends heartbeats to the coordinator. Must be lower than session.timeout.ms (typically one-third).3 sLower values detect rebalances faster but increase network overhead.
max.poll.interval.msMaximum allowed time between consecutive poll() calls. If exceeded, the consumer is removed.5 minHigher values tolerate slow batch processing but delay detection of stuck consumers.
max.poll.recordsMaximum number of records returned per poll() call. Fewer records per batch means faster processing and more frequent poll() calls.Varies by clientLower values reduce the risk of exceeding max.poll.interval.ms but increase poll overhead.

Set parameter values

  • session.timeout.ms

    • Clients before 0.10.2: Set this to a value larger than the time to process a single batch, but no greater than 30 seconds. A value of 25 seconds works well for most workloads.

    • Clients 0.10.2 and later: Keep the default value of 10 seconds.

  • heartbeat.interval.ms

    • Set this to one-third of session.timeout.ms. For example, if session.timeout.ms is 10 seconds, set heartbeat.interval.ms to 3 seconds.

  • max.poll.records

    • Set this well below the number of records your consumers can process within max.poll.interval.ms:

      max.poll.records << records_per_second_per_thread x thread_count x (max.poll.interval.ms / 1000)

  • max.poll.interval.ms

    • Set this above the time needed to process a full batch at peak load:

      max.poll.interval.ms > (max.poll.records / (records_per_second_per_thread x thread_count)) x 1000

Additional recommendations

  1. Offload processing from the poll loop. Move message processing to a separate thread pool so that poll() returns quickly.

  2. Limit topic subscriptions per group. Subscribe each consumer group to no more than five topics. For best performance, assign one topic per group.

  3. Upgrade to client version 0.10.2 or later. Older clients lack a dedicated heartbeat thread, which makes them far more susceptible to rebalances during slow processing.