Production Kafka workloads require continuous visibility into throughput, disk pressure, and consumer health. CloudMonitor collects instance, topic, and consumer group metrics for ApsaraMQ for Kafka in real time, and triggers alert notifications through phone calls, text messages, emails, or DingTalk chatbot messages when a metric crosses a threshold.
CloudMonitor is free to use with ApsaraMQ for Kafka.
Metrics
CloudMonitor collects the following metrics for ApsaraMQ for Kafka. All metrics use a 1-minute aggregation cycle with a 1-minute data latency. Values reported in bytes per second (B/s) represent the average over each 1-minute window.
Instance metrics
| Metric | Description | Unit |
|---|---|---|
InstanceMessageInput | Average inbound message throughput of the instance. Use this to track producer traffic volume. | B/s |
InstanceMessageOutput | Average outbound message throughput of the instance. Use this to track consumer traffic volume. | B/s |
InstanceMessageNumInput | Number of messages produced to the instance per second. | count/s |
InstanceReqsInput | Inbound request rate of the instance. | count/s |
InstanceReqsOutput | Outbound request rate of the instance. | count/s |
instance_disk_capacity | Maximum disk usage across all broker nodes on the instance. | % |
Network traffic metrics
| Metric | Description | Unit |
|---|---|---|
instance_internet_tx.rate | Total outbound Internet bandwidth of the instance. | bit/s |
instance_internet_rx.rate | Total inbound Internet bandwidth of the instance. | bit/s |
InstanceInternetRxRateByNode | Inbound Internet bandwidth per broker node. Use this to identify uneven traffic distribution. | bit/s |
InstanceInternetTxRateByNode | Outbound Internet bandwidth per broker node. | bit/s |
InstanceInternetRxUtilizationByNode | Inbound Internet bandwidth utilization per broker node. A value near 100% indicates the node is approaching its bandwidth limit. | % |
InstanceInternetTxUtilizationByNode | Outbound Internet bandwidth utilization per broker node. | % |
Capacity utilization metrics
| Metric | Description | Unit |
|---|---|---|
| Proportion of Production Traffic in Instance Type | Percentage of the instance type's production traffic capacity in use. Monitor this to avoid hitting throughput limits. | % |
| Proportion of Consumption Traffic in Instance Type | Percentage of the instance type's consumption traffic capacity in use. | % |
| Proportion of Partitions in Instance Type | Percentage of the instance type's partition quota in use. | % |
| Topic Consumption and Production Traffic Ratio | Ratio of topic-level consumption traffic to production traffic. | % |
View monitoring data
Before you begin, make sure that the AliyunServiceRoleForAlikafka service-linked role with the AliyunServiceRolePolicyForAlikafka policy is attached. This role grants ApsaraMQ for Kafka access to CloudMonitor and Application Real-Time Monitoring Service (ARMS). For details, see Service-linked roles.
Log on to the ApsaraMQ for Kafka console.
In the Resource Distribution section of the Overview page, select the region of your instance.
On the Instances page, click the name of your instance.
In the left-side navigation pane, choose .
On the CloudMonitor page, click Alert Rule. Select a resource type tab (Instance, Topic, or Group), find the target resource, and then click View CloudMonitor Metrics in the Actions column.
In the dialog box, select a time range to view metric charts for the resource.
Create an alert rule
On the CloudMonitor page, click Alert Rule. Select the Instance, Topic, or Group tab.
Click Create Alert Rule. The page redirects to the Create Alert Rule panel in the CloudMonitor console.
Configure the rule and notification settings, and then click Confirm. For parameter descriptions, see Create an alert rule.
View alert details
On the CloudMonitor page, click Alert Rule. Select the tab for the resource type.
Find the target resource and click Alert Rule in the Actions column.
In the panel, find the alert rule and click Details in the Actions column to view the rule configuration and alert history. The panel also provides options to disable, enable, or delete the rule.
What's next
Prometheus monitoring: Set up Managed Service for Prometheus for more granular metric collection and custom dashboards.
Monitoring and alerting FAQ: Troubleshoot common monitoring issues.