ApsaraMQ for Kafka periodically runs health checks on each instance through its real-time diagnosis feature. Diagnosed risks are surfaced with suggested fixes, and alert notifications are sent for anomalies to designated contacts.
How it works
Alert notifications
Alerts are triggered only for urgent and unhealthy conditions.
Default behavior: If no alert contact is configured, notifications are sent to the contact person for the Alibaba Cloud account that owns the instance.
Custom contacts: If alert contacts are configured, notifications are sent only to those contacts during their specified time ranges. For details, see Manage alert contacts.
Cloud Monitor integration: Instance risks are synchronized to the Cloud Monitor Event Center. Subscribe to risk events for advanced policies such as denoising, notification methods, channels, and templates. For details, see Manage event subscriptions (Recommended). To subscribe to Kafka risk events in Event Center, use the following settings:
Field Value Alibaba Cloud Service Kafka Event Type Abnormal Event Name Instance Risk Alert
View instance risks in the console
Log on to the ApsaraMQ for Kafka console.
In the Resource Distribution section of the Overview page, select the region where the instance resides.
On the Instances page, click the instance name.
On the Instance Details page, click the Instance Risks tab.
The following table describes the columns on the Instance Risks tab:
| Column | Description | Example |
|---|---|---|
| Risk Type | The type of instance risk. | Group with Long Consumption Time |
| Metric Level | The severity of the risk. Valid values: Repair Required, Important, Normal. | Important |
| Risk Status | The current state of the risk. Valid values: To Be Fixed, Fixed. | To Be Fixed |
| Time of Last Alert | The time when the risk was last detected. | March 31, 2022 |
| Actions | Operations available for the risk item. See descriptions below. | -- |
Manage risk entries
Details View the diagnosis details and suggested fixes for a risk.
Modify Alert Status After resolving a risk, set Risk Status to Fixed. Alternatively, ignore an unfixed risk for the next month.
NoteAfter a risk is marked as Fixed, alert notifications stop. If the same risk recurs, a new alert is sent after 7 days.
Delete Remove a risk entry after its status is changed to Fixed.
NoteWait at least 7 days after marking a risk as Fixed before deleting it. This prevents new alerts from being triggered by data that has not yet been cleaned up.
Risk types
If a risk is detected, follow the suggestions in the console to resolve it.
Each section below describes a risk type, its severity levels, the diagnosis output, and the recommended fix.
CPU consumption percentage
High CPU usage is often caused by fragmented sending. An alert is triggered in Prometheus when CPU usage exceeds 70%.
| Severity | Diagnosis |
|---|---|
| Normal | CPU consumption: XX% |
| Important | CPU consumption: XX% |
Fix: Identify and resolve fragmented sending issues to reduce CPU consumption.
Recheck: Monitor the CPU usage metric in Prometheus.
Disk usage
Disk usage directly affects data safety and cluster availability. ApsaraMQ for Kafka takes automatic protective actions at specific thresholds:
85% or higher: Unexpired messages are periodically deleted.
90% or higher: Write protection is activated, blocking all incoming writes to the cluster.
An alert is triggered in Prometheus when disk usage exceeds 80%.
| Severity | Diagnosis |
|---|---|
| Important | Disk usage: XX% |
| Fix as soon as possible | Disk usage: XX% |
Fix: Reduce disk usage by cleaning up unused topics, reducing message retention periods, or upgrading the instance.
Recheck: Monitor the disk usage metric in Prometheus.
Disk skew
Uneven disk usage across the cluster prevents full utilization of disk performance and capacity.
An alert is triggered when the difference between the maximum and minimum disk usage exceeds 50%. The console displays only the maximum disk usage.
| Severity | Diagnosis |
|---|---|
| Important | Disk skew detected |
| Fix as soon as possible | Disk skew detected |
Fix: Add topic partitions based on the recommended value, or trigger the cluster partition balancing feature. For details, see What do I do if topic partitions are skewed?.
Message production format conversion latency
Format conversion during message production affects overall sending performance. This occurs when the sending client version differs from the server version.
| Severity | Diagnosis |
|---|---|
| Important | TP98 production format conversion latency: XX ms |
| Fix as soon as possible | TP98 production format conversion latency: XX ms |
Fix: Align the sending client version with the server version.
Recheck: Check whether the production client version is inconsistent with the server version for the affected topics.
Message consumption format conversion latency
Format conversion during message consumption affects overall consumption performance. This occurs when the consumption client version differs from the server version.
| Severity | Diagnosis |
|---|---|
| Important | TP98 consumption format conversion latency: XX ms |
| Fix as soon as possible | TP98 consumption format conversion latency: XX ms |
Fix: Align the consumption client version with the server version.
Recheck: Check whether the consumption client version is inconsistent with the server version.
Topic format conversion
Topics undergoing format conversion may experience degraded sending performance. This occurs when the production client version differs from the server version.
| Severity | Diagnosis |
|---|---|
| Normal | XX topics have format conversion |
| Important | XX topics have format conversion |
Fix: Align the sending client version with the server version to eliminate format conversion overhead.
Recheck: Check whether the production client version is inconsistent with the server version for the affected topics.
Group subscribes to too many topics
A group that subscribes to too many topics is prone to rebalancing events, which degrades overall consumption performance. An alert is triggered when a group subscribes to more than one topic.
| Severity | Diagnosis |
|---|---|
| Normal | XX groups subscribe to too many topics |
| Important | XX groups subscribe to too many topics |
Fix: Maintain a one-to-one subscription relationship between groups and topics where possible. For details, see Best practices for subscribers.
Use of Sarama Go client
The Sarama Go client has many known issues and is not recommended for production use. An alert is triggered when a consumer client uses Sarama Go to commit offsets.
| Severity | Diagnosis |
|---|---|
| Normal | XX groups use the Sarama Go client for consumption |
| Important | XX groups use the Sarama Go client for consumption |
Fix: Migrate to a different Go client. For details, see Why am I not advised to use the Sarama Go client to send and receive messages?.
Rebalancing timeout
Long rebalancing timeouts are typically caused by setting max.poll.interval.ms to an excessively large value, which delays Kafka rebalancing.
| Severity | Diagnosis |
|---|---|
| Normal | XX groups have rebalancing timeouts |
| Important | XX groups have rebalancing timeouts |
Fix: Reduce the max.poll.interval.ms value.
Recheck: Go to the details page of the affected group to view rebalancing details.
Consumer client actively leaves the queue
When a consumer client actively leaves a group, it triggers a rebalancing event that disrupts consumption for the entire group.
| Severity | Diagnosis |
|---|---|
| Important | XX group consumers actively leave the queue and trigger a rebalancing event |
| Fix as soon as possible | XX group consumers actively leave the queue and trigger a rebalancing event |
Fix: Check whether the consumer client is stuck or configured improperly. For details, see Why does the consumer client frequently perform rebalancing?.
Recheck: Go to the details page of the affected group to view rebalancing details.
Groups with high consumption latency
High consumption latency typically results from:
A
fetch.max.bytesvalue that is too large on the consumer client.Poor network conditions between the client and the cluster.
| Severity | Diagnosis |
|---|---|
| Normal | XX groups have high latency in receiving consumed data |
| Important | XX groups have high latency in receiving consumed data |
Fix: Reduce fetch.max.bytes or improve the network environment. For details, see Best practices for subscribers.
Recheck: Monitor the consumption latency of the consumer client.
Group quota
The group quota is approaching its limit.
| Severity | Diagnosis |
|---|---|
| Normal | Remaining group quota: XX |
| Important | Remaining group quota: XX |
Fix: Delete unused groups or upgrade the instance to increase the quota.
Recheck: View the current number of groups on the instance details page.
Topic quota
The topic quota is approaching its limit.
| Severity | Diagnosis |
|---|---|
| Important | Remaining topic quota: XX |
| Fix as soon as possible | Remaining topic quota: XX |
Fix: Delete unused topics or upgrade the instance to increase the quota.
Recheck: View the current number of topics on the instance details page.
Partition quota
The partition quota is approaching its limit.
| Severity | Diagnosis |
|---|---|
| Important | Remaining partition quota: XX |
| Fix as soon as possible | Remaining partition quota: XX |
Fix: Delete unused partitions or upgrade the instance to increase the quota.
Recheck: View the current number of partitions on the instance details page.
Server minor version upgrade
The server minor version is outdated. Upgrading to the latest minor version fixes known open source bugs and improves cluster performance and stability.
| Severity | Diagnosis |
|---|---|
| Important | The current server minor version is XX versions behind the latest minor version |
| Fix as soon as possible | The current server minor version is XX versions behind the latest minor version |
Fix: Upgrade the server to the latest minor version from the instance details page.
TCP connections for a single node
Too many TCP connections on a single node degrades cluster stability. If the connection count continues to grow, some connections may fail.
An alert is triggered when the connection count exceeds the specification limit. For specification limits, see Limits.
| Severity | Diagnosis |
|---|---|
| Important | Number of TCP connections for a single node: XX |
| Fix as soon as possible | Number of TCP connections for a single node: XX |
Fix:
Check whether connection objects are repeatedly instantiated.
Reduce the number of clients and adjust the
batch.sizeandlinger.msparameters to aggregate data into batches before sending.
Recheck: View the maximum number of TCP connections for an instance node on the dashboard or in Prometheus.
Public TCP connections for a single node
Too many public TCP connections on a single node degrades cluster stability.
Public network connections are heavyweight with poor performance. Use them only for development and testing. For production, use VPC connections.
An alert is triggered when the connection count exceeds the specification limit. For specification limits, see Limits.
| Severity | Diagnosis |
|---|---|
| Important | Number of public TCP connections for a single node: XX |
| Fix as soon as possible | Number of public TCP connections for a single node: XX |
Fix:
Check whether connection objects are repeatedly instantiated.
Reduce the number of clients and adjust the
batch.sizeandlinger.msparameters to aggregate data into batches before sending.
Recheck: View the maximum number of public TCP connections for an instance node on the dashboard or in Prometheus.
Synchronous sending
Topics using synchronous disk flushing with acks=all have poor sending performance and reduce the cluster's processing efficiency.
| Severity | Diagnosis |
|---|---|
| Important | XX topics have a synchronous sending issue |
| Fix as soon as possible | XX topics have a synchronous sending issue |
Fix: If your business allows, set acks=1 to improve sending efficiency. For details, see Best practices for publishers.
Recheck: Check whether the sending client for the affected topics is configured with acks set to all or -1.
Fragmented sending
Fragmented sending causes sending queue timeouts and degrades the cluster's throughput and stability.
An alert is triggered when the sent batchSize is less than 4 KB and the node CPU usage exceeds 60%. For best results, use client version 2.4.0 or later.
| Severity | Diagnosis |
|---|---|
| Important | XX topics have a fragmented sending issue |
| Fix as soon as possible | XX topics have a fragmented sending issue |
Fix:
Adjust the
batch.sizeandlinger.msparameters.For topics with many partitions, use the sticky partition sending policy.
For details, see Best practices for publishers.
Whitelist security group sharing
When multiple instances share the same whitelist security group, modifying the whitelist of one instance affects all other instances using that security group. This increases the blast radius of misconfigurations.
| Severity | Diagnosis |
|---|---|
| Important | The default endpoint whitelist shares security group ID: XX |
| Fix as soon as possible | The default endpoint whitelist shares security group ID: XX |
Fix: Check whether the security group is shared with other resources. If so, assign a dedicated security group to the instance.
Single-partition topic risk
A single partition of cloud storage may become unavailable during a node failure or upgrade. If a single partition is required, use local storage instead.
| Severity | Diagnosis |
|---|---|
| Important | There are currently XX single-partition topics for cloud storage |
| Fix as soon as possible | There are currently XX single-partition topics for cloud storage |
Fix: Add more partitions to the affected topics.
Recheck: Check the number of partitions for the affected topics.
Topic partition skew
Uneven partition distribution across topics has the following risks:
Prevents full utilization of disk performance and capacity.
May trigger single-node rate limiting.
| Severity | Diagnosis |
|---|---|
| Important | There are currently XX topics with partition skew |
| Fix as soon as possible | There are currently XX topics with partition skew |
Fix: Add topic partitions based on the recommended value. For details, see What do I do if topic partitions are skewed?.
Recheck: Check whether the number of partitions on the details page of the affected topic meets the recommended number.
Node sending traffic
Node sending traffic has reached its upper limit.
| Severity | Diagnosis |
|---|---|
| Important | Node sending traffic has exceeded the limit: XX% |
| Fix as soon as possible | Node sending traffic has exceeded the limit: XX% |
Fix: Upgrade the instance to increase the traffic capacity.
Recheck: Check the maximum production traffic of the node (bytes/s) in Prometheus. Also check whether the production rate limiting queue length (items/second) indicates active rate limiting.
Node consumption traffic
Node consumption traffic has reached its upper limit.
| Severity | Diagnosis |
|---|---|
| Important | Node consumption traffic has exceeded the limit: XX% |
| Fix as soon as possible | Node consumption traffic has exceeded the limit: XX% |
Fix: Upgrade the instance to increase the traffic capacity.
Recheck: Check the maximum consumption traffic of the node (bytes/s) in Prometheus. Also check whether the consumption rate limiting queue length (items/second) indicates active rate limiting.
Sending traffic
Cluster production traffic has reached its upper limit. Rate limiting may apply, causing production sending timeouts.
| Severity | Diagnosis |
|---|---|
| Important | Sending traffic has exceeded the limit: XX% |
| Fix as soon as possible | Sending traffic has exceeded the limit: XX% |
Fix: Upgrade the instance to increase the traffic capacity.
Recheck: Check the message production traffic (bytes/s) in Prometheus. Also check whether the production rate limiting queue length (items/second) indicates active rate limiting.
Consumption traffic
Cluster consumption traffic has reached its upper limit. Rate limiting may apply, causing messages to stack instead of being consumed promptly.
| Severity | Diagnosis |
|---|---|
| Important | Consumption traffic has exceeded the limit: XX% |
| Fix as soon as possible | Consumption traffic has exceeded the limit: XX% |
Fix: Upgrade the instance to increase the traffic capacity.
Recheck: Check the message consumption traffic (bytes/s) in Prometheus. Also check whether the consumption rate limiting queue length (items/second) indicates active rate limiting.
Partition assignment policy
The same partition is assigned to multiple consumer threads.
| Severity | Diagnosis |
|---|---|
| Important | Currently, XX groups have the same partition assigned to multiple consumer threads |
| Fix as soon as possible | Currently, XX groups have the same partition assigned to multiple consumer threads |
Fix: Check the consumer assignment policy. For details, see Why is the same partition consumed by multiple consumer threads?.
Consumer offset commit frequency
Committing consumer offsets too frequently degrades cluster performance and stability. The diagnosis lists the top 10 groups with the highest commit frequency.
| Severity | Diagnosis |
|---|---|
| Important | The consumer client commits consumer offsets XX times per second |
| Fix as soon as possible | The consumer client commits consumer offsets XX times per second |
Fix: Switch to autocommit for consumer offsets, or reduce the commit frequency. For details, see Best practices for subscribers.
Groups with rebalancing within one day
Groups that triggered rebalancing events within the past day may indicate improper client configuration.
| Severity | Diagnosis |
|---|---|
| Normal | XX groups triggered a rebalancing event within one day |
| Important | XX groups triggered a rebalancing event within one day |
Fix: Check whether the rebalancing is caused by improper consumer client configuration. For details, see Why does the consumer client frequently perform rebalancing?.
Recheck: View the rebalancing details on the affected group's details page.
Disk cold read
Cold reads occur when consumers read large amounts of data from disk instead of the page cache. This degrades cluster performance and stability, and is typically caused by message stacking due to high consumption latency.
An alert is triggered when disk input/output operations per second (IOPS) or bandwidth usage exceeds 70%.
| Severity | Diagnosis |
|---|---|
| Important | Cold read degree: XX% |
| Fix as soon as possible | Cold read degree: XX% |
Fix: Increase the consumption rate or reset the consumer offset to skip the stacked messages.
Disk write protection
Disk write protection is triggered when disk usage exceeds 90%, blocking all incoming writes to the cluster.
| Severity | Diagnosis |
|---|---|
| Important | Disk write protection is triggered. |
| Fix as soon as possible | Disk write protection is triggered. |
Fix: Reduce disk usage immediately by cleaning up data or upgrading the instance.
Recheck: Check in Prometheus whether the instance disk usage exceeds 90%.
Consumer offset rollback
An offset reset may have been triggered, causing consumer offsets to roll back to a previous position.
| Severity | Diagnosis |
|---|---|
| Important | XX groups have consumer offset rollbacks |
| Fix as soon as possible | XX groups have consumer offset rollbacks |
Fix: Check whether a historical offset was committed by the consumer client.
Use of GZIP compression
GZIP compression increases the cluster payload compared to other compression algorithms.
| Severity | Diagnosis |
|---|---|
| Important | XX topics use GZIP compression |
| Fix as soon as possible | XX topics use GZIP compression |
Fix: Check the production client's compression configuration and switch to a lighter algorithm (such as LZ4 or Snappy).
Possible early message cleanup
When a single disk has limited capacity and high usage, messages may be deleted before their time-to-live (TTL) expires.
| Severity | Diagnosis |
|---|---|
| Important | Messages may be deleted before their time-to-live (TTL) expires |
| Fix as soon as possible | Messages may be deleted before their time-to-live (TTL) expires |
Fix: Monitor disk capacity and current maximum disk usage. Upgrade the instance or reduce message retention to prevent early cleanup.
Server major version has expired
The server major version is too low (0.10.x) and has known open source bugs that affect service stability.
| Severity | Diagnosis |
|---|---|
| Important | The server major version is too low and has expired. |
| Fix as soon as possible | The server major version is too low and has expired. |
Fix: Upgrade the server major version by following the upgrade instructions in the documentation.
Consumer offset contains leader_epoch
When the consumer client carries leader_epoch records while committing offsets, consumption abnormalities or failures may occur. This affects client versions between 2.3 and 2.6 running against server version 2.2.0. For details, see KAFKA-9724.
| Severity | Diagnosis |
|---|---|
| Important | XX groups have leader_epoch records when committing offsets |
| Fix as soon as possible | XX groups have leader_epoch records when committing offsets |
Fix: Upgrade the consumer client to version 2.6 or later.
Local storage risk
Topics using local storage (LocalTopic) have many known open source issues. Alibaba Cloud is not responsible for business losses caused by defects in open source Kafka or its third-party components, or by improper configuration and use.
| Severity | Diagnosis |
|---|---|
| Important | XX topics use local storage |
| Fix as soon as possible | XX topics use local storage |
Fix: Migrate the affected topics to cloud storage.
Scheduled ECS restart
A scheduled ECS restart event exists for nodes in the cluster. During the restart, slight service traffic jitter may occur.
| Severity | Diagnosis |
|---|---|
| Important | There are XX nodes in the cluster with scheduled ECS restart events |
| Fix as soon as possible | There are XX nodes in the cluster with scheduled ECS restart events |
Fix: Monitor the situation closely during the scheduled maintenance window.
Unused Connector service
The Connector service is transitioning to a paid model. Release the resource to avoid charges if it is no longer needed.
| Severity | Diagnosis |
|---|---|
| Normal | The Connector service will be charged soon. |
| Important | The Connector service will be charged soon. |
Fix: Go to Connector Ecosystem Integration/Task List to release the service.
Unused message retrieval service
The message retrieval service is transitioning to a paid model. Release the resource to avoid charges if it is no longer needed.
| Severity | Diagnosis |
|---|---|
| Normal | The message retrieval service will be charged soon. |
| Important | The message retrieval service will be charged soon. |
Fix: Go to the Message Retrieval page to release the service.
Unused cloud migration service
The cloud migration service is transitioning to a paid model. Release the resource to avoid charges if it is no longer needed.
| Severity | Diagnosis |
|---|---|
| Normal | The cloud migration service will be charged soon. |
| Important | The cloud migration service will be charged soon. |
Fix: Go to the Migration Task page to release the service.
Message batch is too large
When a message batch exceeds the server's maximum allowed size, the batch cannot reach the server and sending requests are rejected.
| Severity | Diagnosis |
|---|---|
| Important | XX topics have an issue where the message batch is too large. |
| Fix as soon as possible | XX topics have an issue where the message batch is too large. |
Fix: Increase the maximum message size on the server, or decrease the max.request.size value on the client. For details, see Limits.
Recheck: Check whether the batch.size configuration of the production client is too large.
Too many messages in a batch
When a batch contains too many individual messages, sending fails due to overflow.
| Severity | Diagnosis |
|---|---|
| Important | XX topics have an issue where there are too many messages in a batch. |
| Fix as soon as possible | XX topics have an issue where there are too many messages in a batch. |
Fix: Decrease the batch.size value to prevent overflow. For details, see Limits.
Recheck: Check whether the batch.size configuration is too large or whether individual messages are too small.
Server network architecture version is too low
The instance uses an outdated network architecture with known security weaknesses. This architecture is scheduled to be decommissioned.
| Severity | Diagnosis |
|---|---|
| Important | The server network architecture version is too low. |
| Fix as soon as possible | The server network architecture version is too low. |
Fix: Create a new instance (Serverless, subscription, or pay-as-you-go) and use the migration feature in the ApsaraMQ for Kafka console to migrate from the old instance. After migration is complete, unsubscribe from the old instance.
References
For other common issues and solutions, see FAQ.