When your Kafka workloads experience message accumulation, disk pressure, or traffic throttling, you need visibility into what is happening and a way to get notified before the problem escalates. ApsaraMQ for Kafka integrates with Cloud Monitor to provide real-time metrics for instances, topics, and consumer groups, along with configurable alert rules that notify you through phone calls, text messages, emails, or DingTalk chatbot messages.
Cloud Monitor is free for ApsaraMQ for Kafka.
Prerequisites
Before you begin, make sure you have:
The service-linked role AliyunServiceRoleForAlikafka with the AliyunServiceRolePolicyForAlikafka policy attached -- this allows ApsaraMQ for Kafka to access other Alibaba Cloud services, such as Cloud Monitor and Application Real-Time Monitoring Service (ARMS), for monitoring and dashboard features
For details, see Service-linked roles.
View monitoring data
Log on to the ApsaraMQ for Kafka console. In the Resource Distribution section of the Overview page, select the region where your instance resides.
On the Instances page, click the name of the target instance.
In the left-side navigation pane, choose Observability > CloudMonitor.
On the Monitoring Chart tab, set a time range.
Charts for all metrics of the selected resource are displayed automatically.
Create an alert rule
Log on to the ApsaraMQ for Kafka console.
In the Resource Distribution section of the Overview page, select the region where your instance resides.
On the Instances page, click the name of the target instance.
In the left-side navigation pane, choose Observability > CloudMonitor.
Click the Alert Rule tab, and then click Create Alert Rule.
In the Create Alert Rule panel, configure the alert rule and notification method, and then click OK.
To modify an existing rule, find the rule and click Modify in the Actions column.
View alert details
Log on to the ApsaraMQ for Kafka console.
In the Resource Distribution section of the Overview page, select the region where your instance resides.
On the Instances page, click the name of the target instance.
In the left-side navigation pane, choose Observability > CloudMonitor.
Click the Alert Rule tab. Find the target rule and click Details in the Actions column.
Metrics reference
All metrics are aggregated at one-minute intervals. Traffic metrics are reported in bytes per second (B/s) and represent the average value over each one-minute period. Metric data has a one-minute latency.
Instance metrics
All instance metrics use the dimensions userId and instanceId.
Traffic metrics
| Metric name | Metric ID | Unit |
|---|---|---|
| Inbound traffic of the instance cluster (including replication traffic) | ClusterMessageInputV3 | B/s |
| Actual inbound traffic of the instance | InstanceMessageInputV3 | B/s |
| Actual outbound traffic of the instance | InstanceMessageOutputV3 | B/s |
| Number of messages produced for the instance | InstanceMessageNumInputV3 | count/s |
| Number of messages consumed for the instance | InstanceMessageNumOutputV3 | count/s |
| Number of message production requests for the instance | InstanceReqsInputV3 | count/s |
| Number of message consumption requests for the instance | InstanceReqsOutputV3 | count/s |
| Public network write bandwidth of the instance | InstanceInternetTxRateV3 | bit/s |
| Public network read bandwidth of the instance | InstanceInternetRxRateV3 | bit/s |
Storage metrics
| Metric name | Metric ID | Unit |
|---|---|---|
| Instance disk usage | DiskInstanceRatioV3 | % |
| Instance storage size | InstanceDiskLogSizeV3 | B |
Connection metrics
| Metric name | Metric ID | Unit |
|---|---|---|
| Maximum connections on a single node (public and private networks) | InstanceMaxConnectionV3 | count |
| Maximum connections on a single node (public network) | InstanceMaxInternetConnectionV3 | count |
| Total connections of the instance (public and private networks) | InstanceTotalConnectionV3 | count |
| Total connections of the instance (public network) | InstanceTotalInternetConnectionV3 | count |
| Usage of maximum connections on a single node (public and private networks) | InstanceMaxConnectionRatioV3 | % |
| Usage of maximum connections on a single node (public network) | InstanceMaxInternetConnectionRatioV3 | % |
Capacity ratio metrics
| Metric name | Metric ID | Unit |
|---|---|---|
| Ratio of production traffic on the busiest node to the elastic limit of the node | InstanceMaxNodeInputRatioV3 | % |
| Ratio of consumption traffic on the busiest node to the elastic limit of the node | InstanceMaxNodeOutputRatioV3 | % |
| Ratio of production traffic to the elastic limit | InstanceMessageInputRatioV3 | % |
| Ratio of consumption traffic to the elastic limit | InstanceMessageOutputRatioV3 | % |
| Instance partition usage | PartitionInstanceRatioV3 | % |
Throttling metrics
| Metric name | Metric ID | Unit |
|---|---|---|
| Production throttling duration of the instance | InstanceThrottleTimeP99InputV3 | ms |
| Consumption throttling duration of the instance | InstanceThrottleTimeP99OutputV3 | ms |
Consumer group metrics
Consumer group metrics track message accumulation (lag) and consumption throughput. A rising accumulation value means consumers are falling behind producers -- scale your consumer group or investigate processing bottlenecks.
| Metric name | Metric ID | Dimensions | Unit |
|---|---|---|---|
| Message accumulation | MessageAccumulationV3 | userId, instanceId, consumerGroup | count |
| Number of unconsumed messages of a topic in a consumer group | MessageAccumulationOnetopicV3 | userId, instanceId, consumerGroup, topic | count |
| MessageNumOutputV3 | GroupMessageNumOutputV3 | userId, instanceId, consumerGroup | count/s |
| MessageNumOutputOnetopicV3 | GroupMessageNumOutputOnetopicV3 | userId, instanceId, consumerGroup, topic | count/s |
| MessageNumOutputOnetopicOnepartitionV3 | GroupMessageNumOutputOnetopicOnepartitionV3 | userId, instanceId, consumerGroup, topic, partition | count/s |
Topic metrics
| Metric name | Metric ID | Dimensions | Unit |
|---|---|---|---|
| Number of partitions with abnormal HA in a topic | TopicAbnormalHaPartitionNumV3 | userId, instanceId, topic | count |
References
To monitor resources with the dashboard, see Dashboard.
For answers to common monitoring questions, see Monitoring and alerts FAQ.