Set Kafka Alert Rules to Detect Issues Early - ApsaraMQ for Kafka

When your Kafka workloads experience message accumulation, disk pressure, or traffic throttling, you need visibility into what is happening and a way to get notified before the problem escalates. ApsaraMQ for Kafka integrates with Cloud Monitor to provide real-time metrics for instances, topics, and consumer groups, along with configurable alert rules that notify you through phone calls, text messages, emails, or DingTalk chatbot messages.

Cloud Monitor is free for ApsaraMQ for Kafka.

Prerequisites

Before you begin, make sure you have:

The service-linked role AliyunServiceRoleForAlikafka with the AliyunServiceRolePolicyForAlikafka policy attached -- this allows ApsaraMQ for Kafka to access other Alibaba Cloud services, such as Cloud Monitor and Application Real-Time Monitoring Service (ARMS), for monitoring and dashboard features

For details, see Service-linked roles.

View monitoring data

Log on to the ApsaraMQ for Kafka console. In the Resource Distribution section of the Overview page, select the region where your instance resides.
On the Instances page, click the name of the target instance.
In the left-side navigation pane, choose Observability > CloudMonitor.
On the Monitoring Chart tab, set a time range.

Charts for all metrics of the selected resource are displayed automatically.

Create an alert rule

Log on to the ApsaraMQ for Kafka console.
In the Resource Distribution section of the Overview page, select the region where your instance resides.
On the Instances page, click the name of the target instance.
In the left-side navigation pane, choose Observability > CloudMonitor.
Click the Alert Rule tab, and then click Create Alert Rule.
In the Create Alert Rule panel, configure the alert rule and notification method, and then click OK.

To modify an existing rule, find the rule and click Modify in the Actions column.

View alert details

Log on to the ApsaraMQ for Kafka console.
In the Resource Distribution section of the Overview page, select the region where your instance resides.
On the Instances page, click the name of the target instance.
In the left-side navigation pane, choose Observability > CloudMonitor.
Click the Alert Rule tab. Find the target rule and click Details in the Actions column.

Metrics reference

All metrics are aggregated at one-minute intervals. Traffic metrics are reported in bytes per second (B/s) and represent the average value over each one-minute period. Metric data has a one-minute latency.

Instance metrics

All instance metrics use the dimensions userId and instanceId.

Traffic metrics

Metric name	Metric ID	Unit
Inbound traffic of the instance cluster (including replication traffic)	ClusterMessageInputV3	B/s
Actual inbound traffic of the instance	InstanceMessageInputV3	B/s
Actual outbound traffic of the instance	InstanceMessageOutputV3	B/s
Number of messages produced for the instance	InstanceMessageNumInputV3	count/s
Number of messages consumed for the instance	InstanceMessageNumOutputV3	count/s
Number of message production requests for the instance	InstanceReqsInputV3	count/s
Number of message consumption requests for the instance	InstanceReqsOutputV3	count/s
Public network write bandwidth of the instance	InstanceInternetTxRateV3	bit/s
Public network read bandwidth of the instance	InstanceInternetRxRateV3	bit/s

Storage metrics

Metric name	Metric ID	Unit
Instance disk usage	DiskInstanceRatioV3	%
Instance storage size	InstanceDiskLogSizeV3	B

Connection metrics

Metric name	Metric ID	Unit
Maximum connections on a single node (public and private networks)	InstanceMaxConnectionV3	count
Maximum connections on a single node (public network)	InstanceMaxInternetConnectionV3	count
Total connections of the instance (public and private networks)	InstanceTotalConnectionV3	count
Total connections of the instance (public network)	InstanceTotalInternetConnectionV3	count
Usage of maximum connections on a single node (public and private networks)	InstanceMaxConnectionRatioV3	%
Usage of maximum connections on a single node (public network)	InstanceMaxInternetConnectionRatioV3	%

Capacity ratio metrics

Metric name	Metric ID	Unit
Ratio of production traffic on the busiest node to the elastic limit of the node	InstanceMaxNodeInputRatioV3	%
Ratio of consumption traffic on the busiest node to the elastic limit of the node	InstanceMaxNodeOutputRatioV3	%
Ratio of production traffic to the elastic limit	InstanceMessageInputRatioV3	%
Ratio of consumption traffic to the elastic limit	InstanceMessageOutputRatioV3	%
Instance partition usage	PartitionInstanceRatioV3	%

Throttling metrics

Metric name	Metric ID	Unit
Production throttling duration of the instance	InstanceThrottleTimeP99InputV3	ms
Consumption throttling duration of the instance	InstanceThrottleTimeP99OutputV3	ms

Consumer group metrics

Consumer group metrics track message accumulation (lag) and consumption throughput. A rising accumulation value means consumers are falling behind producers -- scale your consumer group or investigate processing bottlenecks.

Metric name	Metric ID	Dimensions	Unit
Message accumulation	MessageAccumulationV3	userId, instanceId, consumerGroup	count
Number of unconsumed messages of a topic in a consumer group	MessageAccumulationOnetopicV3	userId, instanceId, consumerGroup, topic	count
MessageNumOutputV3	GroupMessageNumOutputV3	userId, instanceId, consumerGroup	count/s
MessageNumOutputOnetopicV3	GroupMessageNumOutputOnetopicV3	userId, instanceId, consumerGroup, topic	count/s
MessageNumOutputOnetopicOnepartitionV3	GroupMessageNumOutputOnetopicOnepartitionV3	userId, instanceId, consumerGroup, topic, partition	count/s

Topic metrics

Metric name	Metric ID	Dimensions	Unit
Number of partitions with abnormal HA in a topic	TopicAbnormalHaPartitionNumV3	userId, instanceId, topic	count

References

To monitor resources with the dashboard, see Dashboard.
For answers to common monitoring questions, see Monitoring and alerts FAQ.