Fix Kafka Consumer Pull Failures & Slow Reads - ApsaraMQ for Kafka

A consumer subscribed to topics with incoming messages may pull messages slowly or not at all, even though the consumer has not caught up to the latest offset. This typically happens when consumption traffic exceeds the available network bandwidth, especially over the Internet.

Possible causes

Three bandwidth-related conditions cause slow message pulling:

Cause	Description
Bandwidth saturation	Total consumption traffic from the instance has reached the network bandwidth limit
Oversized single message	An individual message is larger than the available network bandwidth can deliver promptly
Batch fetch exceeds bandwidth	The combined size of messages pulled in a single fetch request exceeds the available bandwidth

Consumer configuration parameters

The following consumer configuration parameters control how many messages are fetched per request:

Parameter	What it controls
max.poll.records	Maximum number of messages that the consumer can pull at the same time
fetch.max.bytes	Maximum number of bytes of messages that the consumer can pull at the same time
max.partition.fetch.bytes	Maximum number of bytes of messages that the consumer can pull from a single partition at the same time

Diagnose and resolve the issue

Step 1: Confirm that messages exist in the topic

Log on to the ApsaraMQ for Kafka console.
Query messages for the target topic.

If no messages are returned, the issue is on the producer side, not the consumer. The remaining steps apply only when messages exist but the consumer cannot pull them fast enough.

Step 2: Check whether consumption traffic has reached the bandwidth limit

In the left-side navigation pane of the Instances page, choose Observability > CloudMonitor.
Click the Monitoring Chart tab.
Locate the instance_internet_rx.rate(bit/s) chart and check whether consumption traffic has reached the bandwidth ceiling.

If traffic is at the limit, increase the instance network bandwidth.

Step 3: Check whether a single message exceeds the bandwidth

Check whether any individual message in the topic is large enough to saturate the available bandwidth on its own.

If so, increase the network bandwidth or reduce the message size at the producer -- for example, compress payloads or split large messages into smaller ones.

Step 4: Reduce the batch fetch size

If neither a single oversized message nor overall bandwidth saturation is the cause, the combined size of messages in a single fetch request may be exceeding the bandwidth limit. Adjust the following parameters:

fetch.max.bytes -- Set this to a value lower than the network bandwidth.
max.partition.fetch.bytes -- Set this to a value lower than the per-partition limit, calculated as:
```
  limit = network bandwidth / number of partitions the consumer subscribes to
```

Important

The meaning of "network bandwidth" depends on how the consumer connects to the broker:

Through a virtual private cloud (VPC): network bandwidth refers to the maximum write traffic of elastic network interfaces (ENIs) on the instance.
Over the Internet: network bandwidth refers to the Internet bandwidth of the instance.