The real-time diagnosis feature for ApsaraMQ for Kafka instances periodically performs health checks. This feature lets you view diagnosed issues, provides suggestions for fixes, and sends alerts for abnormal items to the specified contacts.
Implementation plan
Alert notification details
Alert notifications are triggered only for urgent and unhealthy alerts.
If you do not add an alert contact, alert notifications are sent by default to the contact person for the Alibaba Cloud account that owns the instance.
If you add alert contacts, alert notifications are sent to them. Alert notifications are sent only during the time range specified for each contact. For more information, see Manage alert contacts.
Check items
If a threat is detected in the instance, follow the suggestions in the console to fix it.
Threat type | Metric level | Diagnosis result | Suggested fix | Threat recheck |
CPU consumption percentage |
| CPU consumption: xx% | High CPU usage is often caused by issues such as fragmented sending. You can fix these issues to reduce CPU consumption. | An alert is triggered in Prometheus if the CPU usage exceeds 70%. |
Disk usage |
| Disk usage: xx% | To ensure data security and cluster stability:
| An alert is triggered in Prometheus if the instance disk usage exceeds 80%. |
Disk skew issue |
| None | Disk skew in the cluster may prevent you from using the full disk performance and capacity. Add topic partitions based on the recommended value or trigger the cluster partition balancing feature. For more information about optimization, see What do I do if topic partitions are skewed?. | An alert is triggered if the difference between the maximum and minimum disk usage is greater than 50%. The console displays only the maximum disk usage. |
Time consumed for message production format conversion |
| TP98 time consumed for message production format conversion: xx ms | The affected topics have format conversion, which affects the overall sending performance. To resolve this, align the versions of the sending client and the server. | Check whether the version of the production client for the affected topics is inconsistent with the server version. |
Time consumed for message consumption format conversion |
| TP98 time consumed for message consumption format conversion: xx ms | Format conversion is occurring, which affects the overall consumption performance. To resolve this, align the versions of the consumption client and the server. | Check whether the version of the consumption client is inconsistent with the server version. |
Topic format conversion |
| xx topics have format conversion | The affected topics have format conversion, which may affect the overall sending performance. Align the versions of the sending client and the server to reduce performance loss caused by format conversion. | Check whether the version of the production client for the affected topics is inconsistent with the server version. |
Group subscribes to too many topics |
| xx groups subscribe to too many topics | A group that subscribes to too many topics is prone to rebalancing events, which affects the overall consumption performance. If your business allows, maintain a one-to-one subscription relationship between groups and topics. For more information, see Best practices for subscribers. | An alert is triggered if a group subscribes to more than one topic. |
Use of Sarama Go client |
| xx groups use the Sarama Go client for consumption | The affected groups use the Sarama Go client. The Sarama Go client has many known issues and its use is not recommended. For more information, see Why am I not advised to use the Sarama Go client to send and receive messages?. | An alert is triggered if the consumer client uses Sarama Go to commit offsets. |
Rebalancing timeout |
| xx groups have rebalancing timeouts | The affected groups have long rebalancing timeouts. Do not set the | Go to the details page of the affected group to view the rebalancing details. |
Consumer client actively leaves the queue |
| xx group consumers actively leave the queue and trigger a rebalancing event | In the affected groups, the consumer client actively leaves the queue and triggers rebalancing. Check the following items:
For more information, see Why does the consumer client frequently perform rebalancing?. | Go to the details page of the affected group to view the rebalancing details. |
Groups with high latency in receiving consumed data |
| xx groups have high latency in receiving consumed data | The affected groups have high consumption latency. This may be caused by the following reasons:
For more information about optimization, see Best practices for subscribers. | The consumer client checks the consumption latency. |
Group quota |
| Remaining group quota: xx | The group quota is almost exhausted. | You can view the current number of groups on the instance details page. |
Topic quota |
| Remaining topic quota: xx | The topic quota is almost exhausted. | You can view the current number of topics on the instance details page. |
Partition quota |
| Remaining partition quota: xx | The partition quota is almost exhausted. | You can view the current number of partitions on the instance details page. |
Server minor version upgrade |
| The current server minor version is xx versions behind the latest minor version | The latest minor version fixes several known open source bugs and improves the overall performance and stability of the cluster. For service stability, upgrade the server to the latest minor version as soon as possible. | You can view the minor version details on the instance details page. |
TCP connections for a single node |
| Number of TCP connections for a single node: xx | An excessive number of TCP connections affects the overall stability of the cluster.
| You can view the maximum number of TCP connections for an instance node on the dashboard or in Prometheus. An alert is triggered if the number exceeds the specification limit. For more information about specification limits, see Limits. |
Public TCP connections for a single node |
| Number of public TCP connections for a single node: xx | An excessive number of public TCP connections affects the overall stability of the cluster.
| You can view the maximum number of public TCP connections for an instance node on the dashboard or in Prometheus. An alert is triggered if the number exceeds the specification limit. For more information about specification limits, see Limits. |
Synchronous sending issue |
| xx topics have a synchronous sending issue | The affected topics use a synchronous disk flushing mechanism with | Check whether the sending client for the affected topics is configured with |
Fragmented sending issue |
| xx topics have a fragmented sending issue | The affected topics have a fragmented sending issue. This may cause sending queue timeouts and affect the sending throughput and stability of the cluster. To improve sending performance:
For more information, see Best practices for publishers. | An alert is triggered if the sent |
Whitelist security group sharing |
| The default endpoint whitelist shares security group ID: xx | When you deploy an instance, you specify the whitelist security group ID parameter. Multiple instances may share the same whitelist configuration. This means that modifying the whitelist configuration of one instance also affects other instances that use the same security group. This increases the impact of misoperations on the whitelist configuration. Be aware of the risks related to whitelists. | Check whether the security group used by the instance is shared with other resources. |
Single-partition topic threat |
| There are currently xx single-partition topics for cloud storage | A single partition of cloud storage may become unavailable during a breakdown or upgrade. We recommend that you add more partitions. If you must use a single partition, you can use local storage. | Check the number of partitions for the affected topics. |
Topic partition skew issue |
| There are currently xx topics with partition skew | Topic partition skew has the following risks:
For more information about optimization, see What do I do if topic partitions are skewed?. | Check whether the number of partitions on the details page of the affected topic meets the recommended number of partitions. |
Node sending traffic |
| Node sending traffic has exceeded the limit: xx% | The node sending traffic has reached its upper limit. For instance stability, upgrade the instance as soon as possible. | Check the maximum production traffic of the node (bytes/s) in Prometheus. Also, check whether the production throttling queue length of the instance (items/second) has caused production throttling. |
Node consumption traffic |
| Node consumption traffic has exceeded the limit: xx% | The node consumption traffic has reached its upper limit. For instance stability, upgrade the instance as soon as possible. | Check the maximum consumption traffic of the node (bytes/s) in Prometheus. Also, check whether the consumption throttling queue length of the instance (items/second) has caused consumption throttling. |
Sending traffic |
| Sending traffic has exceeded the limit: xx% | The cluster production traffic has reached its upper limit. Some production traffic may be throttled, which causes production sending timeouts. To avoid affecting your business and cluster stability, upgrade the instance as soon as possible. | Check the message production traffic of the instance (bytes/s) in Prometheus. Also, check whether the production throttling queue length of the instance (items/second) has caused production throttling. |
Consumption traffic |
| Consumption traffic has exceeded the limit: xx% | The cluster consumption traffic has reached its upper limit. Some consumption traffic may be throttled, which causes some message data to be stacked instead of being consumed promptly. To avoid affecting your business and cluster stability, upgrade the instance as soon as possible. | Check the message consumption traffic of the instance (bytes/s) in Prometheus. Also, check whether the consumption throttling queue length of the instance (items/second) has caused consumption throttling. |
Partition assignment policy issue |
| Currently, xx groups have the same partition assigned to multiple consumer threads | The same partition is assigned to multiple consumer threads for consumption. Check whether there is an issue with the consumer assignment policy. For more information, see Why is the same partition consumed by multiple consumer threads?. | The consumer client checks whether multiple consumers are consuming the same partition. |
Consumer offset commit frequency |
| The consumer client commits consumer offsets xx times per second | The consumer client commits consumer offsets too frequently, which affects the performance and stability of the cluster. Switch to autocommit for consumer offsets or reduce the frequency of committing consumer offsets. The following are the top 10 groups with the highest offset commit frequency. For more information about optimization, see Best practices for subscribers. | None. |
Groups with rebalancing within one day |
| xx groups triggered a rebalancing event within one day | The affected groups have rebalancing within one day. For the specific time, see the group details page. Also, confirm whether this is caused by an improper configuration of the consumer client. For more information, see Why does the consumer client frequently perform rebalancing?. | Check the rebalancing details of the affected group. |
Disk cold read |
| Cold read degree: xx% | Disk cold reads are occurring. Consumers need to read a large amount of data from the disk, which affects cluster performance and stability. This may be caused by high consumption latency that leads to message stacking. You can increase the consumption rate or reset the consumer offset. | An alert is triggered when disk input/output operations per second (IOPS) or bandwidth usage exceeds 70%. |
Disk write protection |
| Disk write protection is triggered | The current disk usage is too high, and disk write protection has been triggered. Optimize it immediately. | Check in Prometheus whether the instance disk usage exceeds 90%, which causes disk write protection. |
Consumer offset rollback |
| xx groups have consumer offset rollbacks | The affected groups have consumer offset rollbacks. An offset reset may have been triggered. | The consumer client checks whether a historical offset was committed, which caused the consumer offset to roll back. |
Use of GZIP compression |
| xx topics use GZIP compression | The affected topics use GZIP compression, which increases the cluster payload. Optimize it immediately. | The production client checks whether GZIP compression is configured. |
Possible early message cleanup |
| Messages may be deleted before their time-to-live (TTL) expires | The single-disk capacity is small and the disk usage is high. Messages may be deleted early. Monitor this issue and handle it promptly. | Check the disk capacity and current maximum disk usage. |
Server major version has expired |
| The server major version is too low and has expired | The server major version is too low and has known open source bugs. For service stability, upgrade the server major version as soon as possible by following the instructions in the documentation. | The server major version is 0.10.x. |
Consumer offset contains leader_epoch |
| xx groups have leader_epoch records when committing offsets | The current consumer client carries leader_epoch records when committing offsets. This may cause consumption abnormalities or even consumption failures. Upgrade the client to version 2.6 or later immediately. For more information, see the open source issue. | The server major version is 2.2.0, and the consumer client carries the leader_epoch field when committing consumer offsets. This occurs when the client version is between 2.3 and 2.6. |
Local storage threat |
| xx topics use local storage | LocalTopic storage has many open source issues. We are not responsible for business losses caused by defects in open source Kafka or its third-party components, or by improper configuration and use. | The affected topics use local storage. |
Scheduled ECS restart |
| There are xx nodes in the cluster with scheduled ECS restart events | The cluster has a scheduled ECS restart event. During the scheduled restart, slight service traffic jitter may occur. Monitor the situation closely. | A node in the cluster has an O&M event. The customer is notified of this event through this threat notification. |
Unused Connector service exists for a long time |
| The Connector service will be charged soon. If you do not need to use it, release the resource as soon as possible. | The Connector service will be charged soon. If you do not need to use it, release the resource as soon as possible. To release the resource, go to the Connector Ecosystem Integration/Task List page. | Go to the specified page to view the connector service. |
Unused message retrieval service exists for a long time |
| The message retrieval service will be charged soon. If you do not need to use it, release the resource as soon as possible. | The message retrieval service will be charged soon. If you do not need to use it, release the resource as soon as possible. To release the resource, go to the Message Retrieval page. | Go to the specified page to view the message retrieval service. |
Unused cloud migration service exists for a long time |
| The cloud migration service will be charged soon. If you do not need to use it, release the resource as soon as possible. | The cloud migration service will be charged soon. If you do not need to use it, release the resource as soon as possible. To release the resource, go to the Migration Task page. | Go to the specified page to view the cloud migration service. |
Message batch is too large |
| xx topics have an issue where the message batch is too large. This may cause sending requests to be rejected. Address this issue promptly. | If a message batch is too large, the message cannot reach the server. You can increase the maximum message size on the server or decrease the value of the client parameter max.request.size. For more information, see Limits. | Check whether the batch.size configuration of the production client is too large. |
Too many messages in a batch |
| xx topics have an issue where there are too many messages in a batch. This may cause sending requests to be rejected. Address this issue promptly. | If a batch contains too many messages, sending fails. Decrease the batch.size value to prevent overflow. For more information, see Limits. | Check whether the batch.size configuration of the production client is too large and whether a single message is small. |
Server network architecture version is too low |
| The server network architecture version is too low, the network architecture has poor security, and needs to be unpublished | You can create a new instance (Serverless, subscription, or pay-as-you-go) and go to the ApsaraMQ for Kafka console. Use the migration feature to migrate the old instance to the new one. After the migration is complete, you can release the old instance. | The instance network architecture is outdated and scheduled to be unpublished. |
Procedure
Log on to the ApsaraMQ for Kafka console. In the Resource Distribution section of the Overview page, select the region where the ApsaraMQ for Kafka instance that you want to manage resides.
On the Instances page, click the name of the target instance.
On the Instance Details page, click the Instance Risks tab.
On the Instance Risks tab, you can view the instance threats.
Parameter
Description
Example
Risk Type
The type of the current instance threat.
Group with Long Consumption Time
Metric Level
The level of the current instance metric. Valid values:
Repair Required
Important
Normal
Important
Risk Status
The health status of the current instance. Valid values:
To Be Fixed
Fixed
To Be Fixed
Time of Last Alert
The time when this threat was last found.
March 31, 2022
Actions
The actions that can be performed on the current instance threat item.
Details: View the details and suggested fixes for the current instance threat.
In the Actions column of the target threat, click Details.
Modify Alert Status: After a threat is fixed, you can set the Threat Status to Fixed, or ignore an unfixed threat for the next month.
In the Actions column of the target threat, click Modify Alert Status.
NoteAfter a threat is fixed, no more alert notifications are sent. If the same threat occurs again after it is fixed, the system sends another alert notification after 7 days.
Delete: After a threat is fixed and its status is changed to Fixed, you can delete the threat.
In the Actions column of the target threat, click Delete.
Suggestion: After you change the Threat Status to Fixed, wait for a period of time before you delete the threat. This prevents new alerts from being generated due to reasons such as dirty data that is not cleaned up in real time. The recommended waiting period is 7 days.
None
References
For information about other common issues and solutions for instances, see the FAQ.