All Products
Search
Document Center

ApsaraMQ for Kafka:View instance threats

Last Updated:Dec 10, 2025

The real-time diagnosis feature for ApsaraMQ for Kafka instances periodically performs health checks. This feature lets you view diagnosed issues, provides suggestions for fixes, and sends alerts for abnormal items to the specified contacts.

Implementation plan

image

Alert notification details

  • Alert notifications are triggered only for urgent and unhealthy alerts.

  • If you do not add an alert contact, alert notifications are sent by default to the contact person for the Alibaba Cloud account that owns the instance.

  • If you add alert contacts, alert notifications are sent to them. Alert notifications are sent only during the time range specified for each contact. For more information, see Manage alert contacts.

Check items

Note

If a threat is detected in the instance, follow the suggestions in the console to fix it.

Threat type

Metric level

Diagnosis result

Suggested fix

Threat recheck

CPU consumption percentage

  • Normal

  • Important

CPU consumption: xx%

High CPU usage is often caused by issues such as fragmented sending. You can fix these issues to reduce CPU consumption.

An alert is triggered in Prometheus if the CPU usage exceeds 70%.

Disk usage

  • Important

  • Fix as soon as possible

Disk usage: xx%

To ensure data security and cluster stability:

  • If disk usage reaches 85% or higher, unexpired messages are periodically deleted.

  • If disk usage reaches 90% or higher, the cluster enables a write protection mechanism.

An alert is triggered in Prometheus if the instance disk usage exceeds 80%.

Disk skew issue

  • Important

  • Fix as soon as possible

None

Disk skew in the cluster may prevent you from using the full disk performance and capacity. Add topic partitions based on the recommended value or trigger the cluster partition balancing feature. For more information about optimization, see What do I do if topic partitions are skewed?.

An alert is triggered if the difference between the maximum and minimum disk usage is greater than 50%. The console displays only the maximum disk usage.

Time consumed for message production format conversion

  • Important

  • Fix as soon as possible

TP98 time consumed for message production format conversion: xx ms

The affected topics have format conversion, which affects the overall sending performance. To resolve this, align the versions of the sending client and the server.

Check whether the version of the production client for the affected topics is inconsistent with the server version.

Time consumed for message consumption format conversion

  • Important

  • Fix as soon as possible

TP98 time consumed for message consumption format conversion: xx ms

Format conversion is occurring, which affects the overall consumption performance. To resolve this, align the versions of the consumption client and the server.

Check whether the version of the consumption client is inconsistent with the server version.

Topic format conversion

  • Normal

  • Important

xx topics have format conversion

The affected topics have format conversion, which may affect the overall sending performance. Align the versions of the sending client and the server to reduce performance loss caused by format conversion.

Check whether the version of the production client for the affected topics is inconsistent with the server version.

Group subscribes to too many topics

  • Normal

  • Important

xx groups subscribe to too many topics

A group that subscribes to too many topics is prone to rebalancing events, which affects the overall consumption performance. If your business allows, maintain a one-to-one subscription relationship between groups and topics. For more information, see Best practices for subscribers.

An alert is triggered if a group subscribes to more than one topic.

Use of Sarama Go client

  • Normal

  • Important

xx groups use the Sarama Go client for consumption

The affected groups use the Sarama Go client. The Sarama Go client has many known issues and its use is not recommended. For more information, see Why am I not advised to use the Sarama Go client to send and receive messages?.

An alert is triggered if the consumer client uses Sarama Go to commit offsets.

Rebalancing timeout

  • Normal

  • Important

xx groups have rebalancing timeouts

The affected groups have long rebalancing timeouts. Do not set the max.poll.interval.ms parameter to a large value. Otherwise, Kafka rebalancing is affected and takes a long time.

Go to the details page of the affected group to view the rebalancing details.

Consumer client actively leaves the queue

  • Important

  • Fix as soon as possible

xx group consumers actively leave the queue and trigger a rebalancing event

In the affected groups, the consumer client actively leaves the queue and triggers rebalancing. Check the following items:

  • Check whether the consumer client is stuck.

  • Check whether the consumer client is configured properly.

For more information, see Why does the consumer client frequently perform rebalancing?.

Go to the details page of the affected group to view the rebalancing details.

Groups with high latency in receiving consumed data

  • Normal

  • Important

xx groups have high latency in receiving consumed data

The affected groups have high consumption latency. This may be caused by the following reasons:

  • The fetch.max.bytes parameter of the consumer client is set to a large value.

  • The client is in a poor network environment.

For more information about optimization, see Best practices for subscribers.

The consumer client checks the consumption latency.

Group quota

  • Normal

  • Important

Remaining group quota: xx

The group quota is almost exhausted.

You can view the current number of groups on the instance details page.

Topic quota

  • Important

  • Fix as soon as possible

Remaining topic quota: xx

The topic quota is almost exhausted.

You can view the current number of topics on the instance details page.

Partition quota

  • Important

  • Fix as soon as possible

Remaining partition quota: xx

The partition quota is almost exhausted.

You can view the current number of partitions on the instance details page.

Server minor version upgrade

  • Important

  • Fix as soon as possible

The current server minor version is xx versions behind the latest minor version

The latest minor version fixes several known open source bugs and improves the overall performance and stability of the cluster. For service stability, upgrade the server to the latest minor version as soon as possible.

You can view the minor version details on the instance details page.

TCP connections for a single node

  • Important

  • Fix as soon as possible

Number of TCP connections for a single node: xx

An excessive number of TCP connections affects the overall stability of the cluster.

  • Check your connection method to determine if connection objects are repeatedly instantiated.

  • Many clients can cause severe fragmented sending. Reduce the number of clients and adjust the batch.size and linger.ms sending parameters to aggregate data into batches before sending.

  • If the number of connections continues to increase, some connections may fail.

You can view the maximum number of TCP connections for an instance node on the dashboard or in Prometheus. An alert is triggered if the number exceeds the specification limit. For more information about specification limits, see Limits.

Public TCP connections for a single node

  • Important

  • Fix as soon as possible

Number of public TCP connections for a single node: xx

An excessive number of public TCP connections affects the overall stability of the cluster.

  • Check your connection method to determine if connection objects are repeatedly instantiated.

  • Many clients can cause severe fragmented sending. Reduce the number of clients and adjust the batch.size and linger.ms sending parameters to aggregate data into batches before sending.

  • If the number of connections continues to increase, some connections may fail.

  • A public network connection is a heavyweight connection with poor performance. Use it only for development and testing. For production environments, use a VPC connection.

You can view the maximum number of public TCP connections for an instance node on the dashboard or in Prometheus. An alert is triggered if the number exceeds the specification limit. For more information about specification limits, see Limits.

Synchronous sending issue

  • Important

  • Fix as soon as possible

xx topics have a synchronous sending issue

The affected topics use a synchronous disk flushing mechanism with acks=all. This results in poor overall sending performance and affects the processing efficiency of the cluster. If your business allows, set acks=1 to significantly improve sending efficiency. For more information, see Best practices for publishers.

Check whether the sending client for the affected topics is configured with ack set to all or -1.

Fragmented sending issue

  • Important

  • Fix as soon as possible

xx topics have a fragmented sending issue

The affected topics have a fragmented sending issue. This may cause sending queue timeouts and affect the sending throughput and stability of the cluster. To improve sending performance:

  • Set the batch.size and linger.ms parameters as needed.

  • If there are many partitions, use the sticky partition sending policy.

For more information, see Best practices for publishers.

An alert is triggered if the sent batchSize is less than 4 KB and the node CPU usage is greater than 60%. Configure the sending client parameters for the affected topics. We recommend that you use client version 2.4.0 or later.

Whitelist security group sharing

  • Important

  • Fix as soon as possible

The default endpoint whitelist shares security group ID: xx

When you deploy an instance, you specify the whitelist security group ID parameter. Multiple instances may share the same whitelist configuration. This means that modifying the whitelist configuration of one instance also affects other instances that use the same security group. This increases the impact of misoperations on the whitelist configuration. Be aware of the risks related to whitelists.

Check whether the security group used by the instance is shared with other resources.

Single-partition topic threat

  • Important

  • Fix as soon as possible

There are currently xx single-partition topics for cloud storage

A single partition of cloud storage may become unavailable during a breakdown or upgrade. We recommend that you add more partitions. If you must use a single partition, you can use local storage.

Check the number of partitions for the affected topics.

Topic partition skew issue

  • Important

  • Fix as soon as possible

There are currently xx topics with partition skew

Topic partition skew has the following risks:

  • It may prevent you from using the full disk performance and capacity.

  • It may trigger single-node throttling. Add topic partitions based on the recommended value.

For more information about optimization, see What do I do if topic partitions are skewed?.

Check whether the number of partitions on the details page of the affected topic meets the recommended number of partitions.

Node sending traffic

  • Important

  • Fix as soon as possible

Node sending traffic has exceeded the limit: xx%

The node sending traffic has reached its upper limit. For instance stability, upgrade the instance as soon as possible.

Check the maximum production traffic of the node (bytes/s) in Prometheus. Also, check whether the production throttling queue length of the instance (items/second) has caused production throttling.

Node consumption traffic

  • Important

  • Fix as soon as possible

Node consumption traffic has exceeded the limit: xx%

The node consumption traffic has reached its upper limit. For instance stability, upgrade the instance as soon as possible.

Check the maximum consumption traffic of the node (bytes/s) in Prometheus. Also, check whether the consumption throttling queue length of the instance (items/second) has caused consumption throttling.

Sending traffic

  • Important

  • Fix as soon as possible

Sending traffic has exceeded the limit: xx%

The cluster production traffic has reached its upper limit. Some production traffic may be throttled, which causes production sending timeouts. To avoid affecting your business and cluster stability, upgrade the instance as soon as possible.

Check the message production traffic of the instance (bytes/s) in Prometheus. Also, check whether the production throttling queue length of the instance (items/second) has caused production throttling.

Consumption traffic

  • Important

  • Fix as soon as possible

Consumption traffic has exceeded the limit: xx%

The cluster consumption traffic has reached its upper limit. Some consumption traffic may be throttled, which causes some message data to be stacked instead of being consumed promptly. To avoid affecting your business and cluster stability, upgrade the instance as soon as possible.

Check the message consumption traffic of the instance (bytes/s) in Prometheus. Also, check whether the consumption throttling queue length of the instance (items/second) has caused consumption throttling.

Partition assignment policy issue

  • Important

  • Fix as soon as possible

Currently, xx groups have the same partition assigned to multiple consumer threads

The same partition is assigned to multiple consumer threads for consumption. Check whether there is an issue with the consumer assignment policy. For more information, see Why is the same partition consumed by multiple consumer threads?.

The consumer client checks whether multiple consumers are consuming the same partition.

Consumer offset commit frequency

  • Important

  • Fix as soon as possible

The consumer client commits consumer offsets xx times per second

The consumer client commits consumer offsets too frequently, which affects the performance and stability of the cluster. Switch to autocommit for consumer offsets or reduce the frequency of committing consumer offsets. The following are the top 10 groups with the highest offset commit frequency. For more information about optimization, see Best practices for subscribers.

None.

Groups with rebalancing within one day

  • Normal

  • Important

xx groups triggered a rebalancing event within one day

The affected groups have rebalancing within one day. For the specific time, see the group details page. Also, confirm whether this is caused by an improper configuration of the consumer client. For more information, see Why does the consumer client frequently perform rebalancing?.

Check the rebalancing details of the affected group.

Disk cold read

  • Important

  • Fix as soon as possible

Cold read degree: xx%

Disk cold reads are occurring. Consumers need to read a large amount of data from the disk, which affects cluster performance and stability. This may be caused by high consumption latency that leads to message stacking. You can increase the consumption rate or reset the consumer offset.

An alert is triggered when disk input/output operations per second (IOPS) or bandwidth usage exceeds 70%.

Disk write protection

  • Important

  • Fix as soon as possible

Disk write protection is triggered

The current disk usage is too high, and disk write protection has been triggered. Optimize it immediately.

Check in Prometheus whether the instance disk usage exceeds 90%, which causes disk write protection.

Consumer offset rollback

  • Important

  • Fix as soon as possible

xx groups have consumer offset rollbacks

The affected groups have consumer offset rollbacks. An offset reset may have been triggered.

The consumer client checks whether a historical offset was committed, which caused the consumer offset to roll back.

Use of GZIP compression

  • Important

  • Fix as soon as possible

xx topics use GZIP compression

The affected topics use GZIP compression, which increases the cluster payload. Optimize it immediately.

The production client checks whether GZIP compression is configured.

Possible early message cleanup

  • Important

  • Fix as soon as possible

Messages may be deleted before their time-to-live (TTL) expires

The single-disk capacity is small and the disk usage is high. Messages may be deleted early. Monitor this issue and handle it promptly.

Check the disk capacity and current maximum disk usage.

Server major version has expired

  • Important

  • Fix as soon as possible

The server major version is too low and has expired

The server major version is too low and has known open source bugs. For service stability, upgrade the server major version as soon as possible by following the instructions in the documentation.

The server major version is 0.10.x.

Consumer offset contains leader_epoch

  • Important

  • Fix as soon as possible

xx groups have leader_epoch records when committing offsets

The current consumer client carries leader_epoch records when committing offsets. This may cause consumption abnormalities or even consumption failures. Upgrade the client to version 2.6 or later immediately. For more information, see the open source issue.

The server major version is 2.2.0, and the consumer client carries the leader_epoch field when committing consumer offsets. This occurs when the client version is between 2.3 and 2.6.

Local storage threat

  • Important

  • Fix as soon as possible

xx topics use local storage

LocalTopic storage has many open source issues. We are not responsible for business losses caused by defects in open source Kafka or its third-party components, or by improper configuration and use.

The affected topics use local storage.

Scheduled ECS restart

  • Important

  • Fix as soon as possible

There are xx nodes in the cluster with scheduled ECS restart events

The cluster has a scheduled ECS restart event. During the scheduled restart, slight service traffic jitter may occur. Monitor the situation closely.

A node in the cluster has an O&M event. The customer is notified of this event through this threat notification.

Unused Connector service exists for a long time

  • Normal

  • Important

The Connector service will be charged soon. If you do not need to use it, release the resource as soon as possible.

The Connector service will be charged soon. If you do not need to use it, release the resource as soon as possible. To release the resource, go to the Connector Ecosystem Integration/Task List page.

Go to the specified page to view the connector service.

Unused message retrieval service exists for a long time

  • Normal

  • Important

The message retrieval service will be charged soon. If you do not need to use it, release the resource as soon as possible.

The message retrieval service will be charged soon. If you do not need to use it, release the resource as soon as possible. To release the resource, go to the Message Retrieval page.

Go to the specified page to view the message retrieval service.

Unused cloud migration service exists for a long time

  • Normal

  • Important

The cloud migration service will be charged soon. If you do not need to use it, release the resource as soon as possible.

The cloud migration service will be charged soon. If you do not need to use it, release the resource as soon as possible. To release the resource, go to the Migration Task page.

Go to the specified page to view the cloud migration service.

Message batch is too large

  • Important

  • Fix as soon as possible

xx topics have an issue where the message batch is too large. This may cause sending requests to be rejected. Address this issue promptly.

If a message batch is too large, the message cannot reach the server. You can increase the maximum message size on the server or decrease the value of the client parameter max.request.size. For more information, see Limits.

Check whether the batch.size configuration of the production client is too large.

Too many messages in a batch

  • Important

  • Fix as soon as possible

xx topics have an issue where there are too many messages in a batch. This may cause sending requests to be rejected. Address this issue promptly.

If a batch contains too many messages, sending fails. Decrease the batch.size value to prevent overflow. For more information, see Limits.

Check whether the batch.size configuration of the production client is too large and whether a single message is small.

Server network architecture version is too low

  • Important

  • Fix as soon as possible

The server network architecture version is too low, the network architecture has poor security, and needs to be unpublished

You can create a new instance (Serverless, subscription, or pay-as-you-go) and go to the ApsaraMQ for Kafka console. Use the migration feature to migrate the old instance to the new one. After the migration is complete, you can release the old instance.

The instance network architecture is outdated and scheduled to be unpublished.

Procedure

  1. Log on to the ApsaraMQ for Kafka console. In the Resource Distribution section of the Overview page, select the region where the ApsaraMQ for Kafka instance that you want to manage resides.

  2. On the Instances page, click the name of the target instance.

  3. On the Instance Details page, click the Instance Risks tab.

    On the Instance Risks tab, you can view the instance threats.

    Parameter

    Description

    Example

    Risk Type

    The type of the current instance threat.

    Group with Long Consumption Time

    Metric Level

    The level of the current instance metric. Valid values:

    • Repair Required

    • Important

    • Normal

    Important

    Risk Status

    The health status of the current instance. Valid values:

    • To Be Fixed

    • Fixed

    To Be Fixed

    Time of Last Alert

    The time when this threat was last found.

    March 31, 2022

    Actions

    The actions that can be performed on the current instance threat item.

    • Details: View the details and suggested fixes for the current instance threat.

      In the Actions column of the target threat, click Details.

    • Modify Alert Status: After a threat is fixed, you can set the Threat Status to Fixed, or ignore an unfixed threat for the next month.

      In the Actions column of the target threat, click Modify Alert Status.

      Note

      After a threat is fixed, no more alert notifications are sent. If the same threat occurs again after it is fixed, the system sends another alert notification after 7 days.

    • Delete: After a threat is fixed and its status is changed to Fixed, you can delete the threat.

      In the Actions column of the target threat, click Delete.

      Suggestion: After you change the Threat Status to Fixed, wait for a period of time before you delete the threat. This prevents new alerts from being generated due to reasons such as dirty data that is not cleaned up in real time. The recommended waiting period is 7 days.

    None

References

For information about other common issues and solutions for instances, see the FAQ.