All Products
Search
Document Center

ApsaraMQ for Kafka:View instance risks

Last Updated:Mar 11, 2026

ApsaraMQ for Kafka periodically runs health checks on each instance through its real-time diagnosis feature. Diagnosed risks are surfaced with suggested fixes, and alert notifications are sent for anomalies to designated contacts.

How it works

image

Alert notifications

Alerts are triggered only for urgent and unhealthy conditions.

  • Default behavior: If no alert contact is configured, notifications are sent to the contact person for the Alibaba Cloud account that owns the instance.

  • Custom contacts: If alert contacts are configured, notifications are sent only to those contacts during their specified time ranges. For details, see Manage alert contacts.

  • Cloud Monitor integration: Instance risks are synchronized to the Cloud Monitor Event Center. Subscribe to risk events for advanced policies such as denoising, notification methods, channels, and templates. For details, see Manage event subscriptions (Recommended). To subscribe to Kafka risk events in Event Center, use the following settings:

    FieldValue
    Alibaba Cloud ServiceKafka
    Event TypeAbnormal
    Event NameInstance Risk Alert

View instance risks in the console

  1. Log on to the ApsaraMQ for Kafka console.

  2. In the Resource Distribution section of the Overview page, select the region where the instance resides.

  3. On the Instances page, click the instance name.

  4. On the Instance Details page, click the Instance Risks tab.

The following table describes the columns on the Instance Risks tab:

ColumnDescriptionExample
Risk TypeThe type of instance risk.Group with Long Consumption Time
Metric LevelThe severity of the risk. Valid values: Repair Required, Important, Normal.Important
Risk StatusThe current state of the risk. Valid values: To Be Fixed, Fixed.To Be Fixed
Time of Last AlertThe time when the risk was last detected.March 31, 2022
ActionsOperations available for the risk item. See descriptions below.--

Manage risk entries

  • Details View the diagnosis details and suggested fixes for a risk.

  • Modify Alert Status After resolving a risk, set Risk Status to Fixed. Alternatively, ignore an unfixed risk for the next month.

    Note

    After a risk is marked as Fixed, alert notifications stop. If the same risk recurs, a new alert is sent after 7 days.

  • Delete Remove a risk entry after its status is changed to Fixed.

    Note

    Wait at least 7 days after marking a risk as Fixed before deleting it. This prevents new alerts from being triggered by data that has not yet been cleaned up.

Risk types

Note

If a risk is detected, follow the suggestions in the console to resolve it.

Each section below describes a risk type, its severity levels, the diagnosis output, and the recommended fix.


CPU consumption percentage

High CPU usage is often caused by fragmented sending. An alert is triggered in Prometheus when CPU usage exceeds 70%.

SeverityDiagnosis
NormalCPU consumption: XX%
ImportantCPU consumption: XX%

Fix: Identify and resolve fragmented sending issues to reduce CPU consumption.

Recheck: Monitor the CPU usage metric in Prometheus.


Disk usage

Disk usage directly affects data safety and cluster availability. ApsaraMQ for Kafka takes automatic protective actions at specific thresholds:

  • 85% or higher: Unexpired messages are periodically deleted.

  • 90% or higher: Write protection is activated, blocking all incoming writes to the cluster.

An alert is triggered in Prometheus when disk usage exceeds 80%.

SeverityDiagnosis
ImportantDisk usage: XX%
Fix as soon as possibleDisk usage: XX%

Fix: Reduce disk usage by cleaning up unused topics, reducing message retention periods, or upgrading the instance.

Recheck: Monitor the disk usage metric in Prometheus.


Disk skew

Uneven disk usage across the cluster prevents full utilization of disk performance and capacity.

An alert is triggered when the difference between the maximum and minimum disk usage exceeds 50%. The console displays only the maximum disk usage.

SeverityDiagnosis
ImportantDisk skew detected
Fix as soon as possibleDisk skew detected

Fix: Add topic partitions based on the recommended value, or trigger the cluster partition balancing feature. For details, see What do I do if topic partitions are skewed?.


Message production format conversion latency

Format conversion during message production affects overall sending performance. This occurs when the sending client version differs from the server version.

SeverityDiagnosis
ImportantTP98 production format conversion latency: XX ms
Fix as soon as possibleTP98 production format conversion latency: XX ms

Fix: Align the sending client version with the server version.

Recheck: Check whether the production client version is inconsistent with the server version for the affected topics.


Message consumption format conversion latency

Format conversion during message consumption affects overall consumption performance. This occurs when the consumption client version differs from the server version.

SeverityDiagnosis
ImportantTP98 consumption format conversion latency: XX ms
Fix as soon as possibleTP98 consumption format conversion latency: XX ms

Fix: Align the consumption client version with the server version.

Recheck: Check whether the consumption client version is inconsistent with the server version.


Topic format conversion

Topics undergoing format conversion may experience degraded sending performance. This occurs when the production client version differs from the server version.

SeverityDiagnosis
NormalXX topics have format conversion
ImportantXX topics have format conversion

Fix: Align the sending client version with the server version to eliminate format conversion overhead.

Recheck: Check whether the production client version is inconsistent with the server version for the affected topics.


Group subscribes to too many topics

A group that subscribes to too many topics is prone to rebalancing events, which degrades overall consumption performance. An alert is triggered when a group subscribes to more than one topic.

SeverityDiagnosis
NormalXX groups subscribe to too many topics
ImportantXX groups subscribe to too many topics

Fix: Maintain a one-to-one subscription relationship between groups and topics where possible. For details, see Best practices for subscribers.


Use of Sarama Go client

The Sarama Go client has many known issues and is not recommended for production use. An alert is triggered when a consumer client uses Sarama Go to commit offsets.

SeverityDiagnosis
NormalXX groups use the Sarama Go client for consumption
ImportantXX groups use the Sarama Go client for consumption

Fix: Migrate to a different Go client. For details, see Why am I not advised to use the Sarama Go client to send and receive messages?.


Rebalancing timeout

Long rebalancing timeouts are typically caused by setting max.poll.interval.ms to an excessively large value, which delays Kafka rebalancing.

SeverityDiagnosis
NormalXX groups have rebalancing timeouts
ImportantXX groups have rebalancing timeouts

Fix: Reduce the max.poll.interval.ms value.

Recheck: Go to the details page of the affected group to view rebalancing details.


Consumer client actively leaves the queue

When a consumer client actively leaves a group, it triggers a rebalancing event that disrupts consumption for the entire group.

SeverityDiagnosis
ImportantXX group consumers actively leave the queue and trigger a rebalancing event
Fix as soon as possibleXX group consumers actively leave the queue and trigger a rebalancing event

Fix: Check whether the consumer client is stuck or configured improperly. For details, see Why does the consumer client frequently perform rebalancing?.

Recheck: Go to the details page of the affected group to view rebalancing details.


Groups with high consumption latency

High consumption latency typically results from:

  • A fetch.max.bytes value that is too large on the consumer client.

  • Poor network conditions between the client and the cluster.

SeverityDiagnosis
NormalXX groups have high latency in receiving consumed data
ImportantXX groups have high latency in receiving consumed data

Fix: Reduce fetch.max.bytes or improve the network environment. For details, see Best practices for subscribers.

Recheck: Monitor the consumption latency of the consumer client.


Group quota

The group quota is approaching its limit.

SeverityDiagnosis
NormalRemaining group quota: XX
ImportantRemaining group quota: XX

Fix: Delete unused groups or upgrade the instance to increase the quota.

Recheck: View the current number of groups on the instance details page.


Topic quota

The topic quota is approaching its limit.

SeverityDiagnosis
ImportantRemaining topic quota: XX
Fix as soon as possibleRemaining topic quota: XX

Fix: Delete unused topics or upgrade the instance to increase the quota.

Recheck: View the current number of topics on the instance details page.


Partition quota

The partition quota is approaching its limit.

SeverityDiagnosis
ImportantRemaining partition quota: XX
Fix as soon as possibleRemaining partition quota: XX

Fix: Delete unused partitions or upgrade the instance to increase the quota.

Recheck: View the current number of partitions on the instance details page.


Server minor version upgrade

The server minor version is outdated. Upgrading to the latest minor version fixes known open source bugs and improves cluster performance and stability.

SeverityDiagnosis
ImportantThe current server minor version is XX versions behind the latest minor version
Fix as soon as possibleThe current server minor version is XX versions behind the latest minor version

Fix: Upgrade the server to the latest minor version from the instance details page.


TCP connections for a single node

Too many TCP connections on a single node degrades cluster stability. If the connection count continues to grow, some connections may fail.

An alert is triggered when the connection count exceeds the specification limit. For specification limits, see Limits.

SeverityDiagnosis
ImportantNumber of TCP connections for a single node: XX
Fix as soon as possibleNumber of TCP connections for a single node: XX

Fix:

  • Check whether connection objects are repeatedly instantiated.

  • Reduce the number of clients and adjust the batch.size and linger.ms parameters to aggregate data into batches before sending.

Recheck: View the maximum number of TCP connections for an instance node on the dashboard or in Prometheus.


Public TCP connections for a single node

Too many public TCP connections on a single node degrades cluster stability.

Important

Public network connections are heavyweight with poor performance. Use them only for development and testing. For production, use VPC connections.

An alert is triggered when the connection count exceeds the specification limit. For specification limits, see Limits.

SeverityDiagnosis
ImportantNumber of public TCP connections for a single node: XX
Fix as soon as possibleNumber of public TCP connections for a single node: XX

Fix:

  • Check whether connection objects are repeatedly instantiated.

  • Reduce the number of clients and adjust the batch.size and linger.ms parameters to aggregate data into batches before sending.

Recheck: View the maximum number of public TCP connections for an instance node on the dashboard or in Prometheus.


Synchronous sending

Topics using synchronous disk flushing with acks=all have poor sending performance and reduce the cluster's processing efficiency.

SeverityDiagnosis
ImportantXX topics have a synchronous sending issue
Fix as soon as possibleXX topics have a synchronous sending issue

Fix: If your business allows, set acks=1 to improve sending efficiency. For details, see Best practices for publishers.

Recheck: Check whether the sending client for the affected topics is configured with acks set to all or -1.


Fragmented sending

Fragmented sending causes sending queue timeouts and degrades the cluster's throughput and stability.

An alert is triggered when the sent batchSize is less than 4 KB and the node CPU usage exceeds 60%. For best results, use client version 2.4.0 or later.

SeverityDiagnosis
ImportantXX topics have a fragmented sending issue
Fix as soon as possibleXX topics have a fragmented sending issue

Fix:

  • Adjust the batch.size and linger.ms parameters.

  • For topics with many partitions, use the sticky partition sending policy.

For details, see Best practices for publishers.


Whitelist security group sharing

When multiple instances share the same whitelist security group, modifying the whitelist of one instance affects all other instances using that security group. This increases the blast radius of misconfigurations.

SeverityDiagnosis
ImportantThe default endpoint whitelist shares security group ID: XX
Fix as soon as possibleThe default endpoint whitelist shares security group ID: XX

Fix: Check whether the security group is shared with other resources. If so, assign a dedicated security group to the instance.


Single-partition topic risk

A single partition of cloud storage may become unavailable during a node failure or upgrade. If a single partition is required, use local storage instead.

SeverityDiagnosis
ImportantThere are currently XX single-partition topics for cloud storage
Fix as soon as possibleThere are currently XX single-partition topics for cloud storage

Fix: Add more partitions to the affected topics.

Recheck: Check the number of partitions for the affected topics.


Topic partition skew

Uneven partition distribution across topics has the following risks:

  • Prevents full utilization of disk performance and capacity.

  • May trigger single-node rate limiting.

SeverityDiagnosis
ImportantThere are currently XX topics with partition skew
Fix as soon as possibleThere are currently XX topics with partition skew

Fix: Add topic partitions based on the recommended value. For details, see What do I do if topic partitions are skewed?.

Recheck: Check whether the number of partitions on the details page of the affected topic meets the recommended number.


Node sending traffic

Node sending traffic has reached its upper limit.

SeverityDiagnosis
ImportantNode sending traffic has exceeded the limit: XX%
Fix as soon as possibleNode sending traffic has exceeded the limit: XX%

Fix: Upgrade the instance to increase the traffic capacity.

Recheck: Check the maximum production traffic of the node (bytes/s) in Prometheus. Also check whether the production rate limiting queue length (items/second) indicates active rate limiting.


Node consumption traffic

Node consumption traffic has reached its upper limit.

SeverityDiagnosis
ImportantNode consumption traffic has exceeded the limit: XX%
Fix as soon as possibleNode consumption traffic has exceeded the limit: XX%

Fix: Upgrade the instance to increase the traffic capacity.

Recheck: Check the maximum consumption traffic of the node (bytes/s) in Prometheus. Also check whether the consumption rate limiting queue length (items/second) indicates active rate limiting.


Sending traffic

Cluster production traffic has reached its upper limit. Rate limiting may apply, causing production sending timeouts.

SeverityDiagnosis
ImportantSending traffic has exceeded the limit: XX%
Fix as soon as possibleSending traffic has exceeded the limit: XX%

Fix: Upgrade the instance to increase the traffic capacity.

Recheck: Check the message production traffic (bytes/s) in Prometheus. Also check whether the production rate limiting queue length (items/second) indicates active rate limiting.


Consumption traffic

Cluster consumption traffic has reached its upper limit. Rate limiting may apply, causing messages to stack instead of being consumed promptly.

SeverityDiagnosis
ImportantConsumption traffic has exceeded the limit: XX%
Fix as soon as possibleConsumption traffic has exceeded the limit: XX%

Fix: Upgrade the instance to increase the traffic capacity.

Recheck: Check the message consumption traffic (bytes/s) in Prometheus. Also check whether the consumption rate limiting queue length (items/second) indicates active rate limiting.


Partition assignment policy

The same partition is assigned to multiple consumer threads.

SeverityDiagnosis
ImportantCurrently, XX groups have the same partition assigned to multiple consumer threads
Fix as soon as possibleCurrently, XX groups have the same partition assigned to multiple consumer threads

Fix: Check the consumer assignment policy. For details, see Why is the same partition consumed by multiple consumer threads?.


Consumer offset commit frequency

Committing consumer offsets too frequently degrades cluster performance and stability. The diagnosis lists the top 10 groups with the highest commit frequency.

SeverityDiagnosis
ImportantThe consumer client commits consumer offsets XX times per second
Fix as soon as possibleThe consumer client commits consumer offsets XX times per second

Fix: Switch to autocommit for consumer offsets, or reduce the commit frequency. For details, see Best practices for subscribers.


Groups with rebalancing within one day

Groups that triggered rebalancing events within the past day may indicate improper client configuration.

SeverityDiagnosis
NormalXX groups triggered a rebalancing event within one day
ImportantXX groups triggered a rebalancing event within one day

Fix: Check whether the rebalancing is caused by improper consumer client configuration. For details, see Why does the consumer client frequently perform rebalancing?.

Recheck: View the rebalancing details on the affected group's details page.


Disk cold read

Cold reads occur when consumers read large amounts of data from disk instead of the page cache. This degrades cluster performance and stability, and is typically caused by message stacking due to high consumption latency.

An alert is triggered when disk input/output operations per second (IOPS) or bandwidth usage exceeds 70%.

SeverityDiagnosis
ImportantCold read degree: XX%
Fix as soon as possibleCold read degree: XX%

Fix: Increase the consumption rate or reset the consumer offset to skip the stacked messages.


Disk write protection

Disk write protection is triggered when disk usage exceeds 90%, blocking all incoming writes to the cluster.

SeverityDiagnosis
ImportantDisk write protection is triggered.
Fix as soon as possibleDisk write protection is triggered.

Fix: Reduce disk usage immediately by cleaning up data or upgrading the instance.

Recheck: Check in Prometheus whether the instance disk usage exceeds 90%.


Consumer offset rollback

An offset reset may have been triggered, causing consumer offsets to roll back to a previous position.

SeverityDiagnosis
ImportantXX groups have consumer offset rollbacks
Fix as soon as possibleXX groups have consumer offset rollbacks

Fix: Check whether a historical offset was committed by the consumer client.


Use of GZIP compression

GZIP compression increases the cluster payload compared to other compression algorithms.

SeverityDiagnosis
ImportantXX topics use GZIP compression
Fix as soon as possibleXX topics use GZIP compression

Fix: Check the production client's compression configuration and switch to a lighter algorithm (such as LZ4 or Snappy).


Possible early message cleanup

When a single disk has limited capacity and high usage, messages may be deleted before their time-to-live (TTL) expires.

SeverityDiagnosis
ImportantMessages may be deleted before their time-to-live (TTL) expires
Fix as soon as possibleMessages may be deleted before their time-to-live (TTL) expires

Fix: Monitor disk capacity and current maximum disk usage. Upgrade the instance or reduce message retention to prevent early cleanup.


Server major version has expired

The server major version is too low (0.10.x) and has known open source bugs that affect service stability.

SeverityDiagnosis
ImportantThe server major version is too low and has expired.
Fix as soon as possibleThe server major version is too low and has expired.

Fix: Upgrade the server major version by following the upgrade instructions in the documentation.


Consumer offset contains leader_epoch

When the consumer client carries leader_epoch records while committing offsets, consumption abnormalities or failures may occur. This affects client versions between 2.3 and 2.6 running against server version 2.2.0. For details, see KAFKA-9724.

SeverityDiagnosis
ImportantXX groups have leader_epoch records when committing offsets
Fix as soon as possibleXX groups have leader_epoch records when committing offsets

Fix: Upgrade the consumer client to version 2.6 or later.


Local storage risk

Topics using local storage (LocalTopic) have many known open source issues. Alibaba Cloud is not responsible for business losses caused by defects in open source Kafka or its third-party components, or by improper configuration and use.

SeverityDiagnosis
ImportantXX topics use local storage
Fix as soon as possibleXX topics use local storage

Fix: Migrate the affected topics to cloud storage.


Scheduled ECS restart

A scheduled ECS restart event exists for nodes in the cluster. During the restart, slight service traffic jitter may occur.

SeverityDiagnosis
ImportantThere are XX nodes in the cluster with scheduled ECS restart events
Fix as soon as possibleThere are XX nodes in the cluster with scheduled ECS restart events

Fix: Monitor the situation closely during the scheduled maintenance window.


Unused Connector service

The Connector service is transitioning to a paid model. Release the resource to avoid charges if it is no longer needed.

SeverityDiagnosis
NormalThe Connector service will be charged soon.
ImportantThe Connector service will be charged soon.

Fix: Go to Connector Ecosystem Integration/Task List to release the service.


Unused message retrieval service

The message retrieval service is transitioning to a paid model. Release the resource to avoid charges if it is no longer needed.

SeverityDiagnosis
NormalThe message retrieval service will be charged soon.
ImportantThe message retrieval service will be charged soon.

Fix: Go to the Message Retrieval page to release the service.


Unused cloud migration service

The cloud migration service is transitioning to a paid model. Release the resource to avoid charges if it is no longer needed.

SeverityDiagnosis
NormalThe cloud migration service will be charged soon.
ImportantThe cloud migration service will be charged soon.

Fix: Go to the Migration Task page to release the service.


Message batch is too large

When a message batch exceeds the server's maximum allowed size, the batch cannot reach the server and sending requests are rejected.

SeverityDiagnosis
ImportantXX topics have an issue where the message batch is too large.
Fix as soon as possibleXX topics have an issue where the message batch is too large.

Fix: Increase the maximum message size on the server, or decrease the max.request.size value on the client. For details, see Limits.

Recheck: Check whether the batch.size configuration of the production client is too large.


Too many messages in a batch

When a batch contains too many individual messages, sending fails due to overflow.

SeverityDiagnosis
ImportantXX topics have an issue where there are too many messages in a batch.
Fix as soon as possibleXX topics have an issue where there are too many messages in a batch.

Fix: Decrease the batch.size value to prevent overflow. For details, see Limits.

Recheck: Check whether the batch.size configuration is too large or whether individual messages are too small.


Server network architecture version is too low

The instance uses an outdated network architecture with known security weaknesses. This architecture is scheduled to be decommissioned.

SeverityDiagnosis
ImportantThe server network architecture version is too low.
Fix as soon as possibleThe server network architecture version is too low.

Fix: Create a new instance (Serverless, subscription, or pay-as-you-go) and use the migration feature in the ApsaraMQ for Kafka console to migrate from the old instance. After migration is complete, unsubscribe from the old instance.

References

For other common issues and solutions, see FAQ.