Dashboard metrics and query methods - ApsaraMQ for RocketMQ

ApsaraMQ for RocketMQ provides a dashboard for real-time data statistics that uses the metric storage and display capabilities of Alibaba Cloud ARMS Managed Service for Prometheus and Grafana. This feature helps you centrally collect and observe metrics from multiple dimensions to quickly understand the operational status of your business. This topic describes the scenarios, billing, metrics, and usage of the dashboard.

Scenarios

Scenario 1: You need to receive alerts and locate issues in a timely manner when exceptions occur during online message consumption.
Scenario 2: You need to check whether messages are sent as expected in the messaging system when the status of specific online orders is abnormal.
Scenario 3: You need to analyze the change trend of message traffic, the characteristics of traffic distribution, or message volume to help you analyze the business trend and make business plans.
Scenario 4: You need to view and analyze the upstream and downstream dependency topologies of applications to upgrade, optimize, or transform the architecture.

Prerequisites

Activate Managed Service for Prometheus.
Create a service-linked role.
- Role name: AliyunServiceRoleForOns
- Policy name: AliyunServiceRolePolicyForOns
- Permissions: Allows ApsaraMQ for RocketMQ to use this role to access other Alibaba Cloud services, such as CloudMonitor and ARMS, to implement features for monitoring, alerting, and dashboards.
- For more information, see Service-linked Role.

Billing

The dashboard metrics for ApsaraMQ for RocketMQ are basic metrics in ARMS Managed Service for Prometheus. Basic metrics are free of charge. Therefore, the dashboard feature is also free.

For more information, see Metrics and Pay-as-you-go.

Concepts

Before you view dashboard metrics, you need to understand the following concepts related to message accumulation.

The following figure shows the status of each message in a queue of a specific topic.

队列消息状态

In the preceding figure, ApsaraMQ for RocketMQ calculates the number of messages and the processing duration at different processing stages. The metrics that are used in this process reflect the processing rate and message accumulation in the queue. By monitoring the metrics, you can determine whether exceptions occur during consumption. The following table describes the details of the metrics and the formulas that are used to calculate the metrics.

Category	Metric	Description	Calculation formula
Message quantity	Inflight messages	The messages that a consumer client is processing and for which the client has not returned the consumption results.	Number of inflight messages = Offset of the latest pulled message - Offset of the latest acknowledged message
	Ready messages	The messages that are visible to consumers and are ready for consumption on the ApsaraMQ for RocketMQ broker.	Number of ready messages = Maximum offset - Offset of the latest pulled message
	Consumer lag	The messages that are being processed and ready to be processed.	Consumer lag = Number of inflight messages + Number of ready messages
Duration	Ready time	For a normal message or an ordered message, the ready time is the time when the message is stored in the broker. For a scheduled message, the ready time is the time that is scheduled for the broker to deliver the message. For a delayed message, the ready time is the time when the specified delay period elapses. For a transactional message, the ready time is the time when a transaction is committed.	N/A
	Ready message queue time	The interval between the current point in time and the ready time of the earliest ready message. This metric indicates how soon a consumer pulls messages.	Ready message queue time = Current time - Ready time of the earliest ready message
	Consumer lag time	The interval between the ready time of the earliest unacknowledged message and the current time. This metric indicates how soon a consumer processes messages.	Consumer lag time = Current time - Ready time of the earliest unacknowledged message

Metric details

The ApsaraMQ for RocketMQ dashboard provides the following metrics:

Producer: View metrics for a topic, such as the number of messages sent, send success rate, and send latency.
Consumer: View metrics related to a group's subscription to a specific topic, such as consumption volume, consumption success rate, and message accumulation.
Instance Top 20 overview: View the top 20 topics or groups for specific metric values within an instance.
Billing metrics: View metrics for an instance, such as message TPS, API calls, and average message size. These metrics can be used as a reference for estimating billing items.

Important

The collection period for all metrics is 1 minute. ApsaraMQ for RocketMQ supports queries for data from the last 15 days. The maximum time range for a single query is 24 hours.

Producer

Metric	Description
Message Production Rate	The message production rate and the API call rate for message production for a topic. Units: Message rate: messages/second API call rate: calls/second
Peak Message Production Rate	The maximum message production rate. Unit: messages/second.
Total Messages Produced	The total number of messages produced in a specific instance. Unit: messages.
Message Production Call Success Rate	The success rate of message production for a topic.
Message Production Call Latency	The latency of message production for a topic. Unit: ms.

Consumer

Metric	Description
Average Consumption Success Rate	The consumption success rate for all messages in a specific instance.
Accumulated Messages (Ready + Inflight)	The total number of accumulated messages in a specific instance, including ready and inflight messages. Unit: messages.
Inflight Messages	The number of messages that are being processed by a consumer client but for which a success response has not been returned. Unit: messages.
Ready Messages	The number of messages that are ready on the ApsaraMQ for RocketMQ server and can be consumed. This metric reflects the scale of messages that have not yet been processed by consumers. Unit: messages.
Ready Message Queue Time	The time difference between the current time and the ready time of the earliest ready message. This metric reflects the latency of unprocessed messages and is a critical measure for time-sensitive services. The metric value in the overview represents the average ready message queue time for the instance. The metric value in a specific chart represents the ready message queue time for a specific group subscribing to a specific topic. Unit: ms.
Message Consumption Rate	The rate at which a group consumes messages. Unit: messages/second
Peak Message Consumption Rate	The maximum message consumption rate. Unit: messages/second
Total Messages Consumed	The total number of messages consumed in a specific instance. Unit: messages.
Consumption Accumulation	The number of accumulated messages for a group, including ready and inflight messages. Unit: messages.
Message Processing Latency	The time it takes for a group to process a message, from the start of consumption to completion. Unit: ms.
Consumer Local Wait Time	The time it takes for a message to be processed after it arrives at the consumer client. Unit: ms.
Consumption Success Rate	The success rate of message consumption.
Consumer Client Access Protocol Ratio	The ratio of consumed messages by protocol type.

Instance Top 20 overview

Metric	Description
Top 20 Topics by Message Production Rate	The top 20 topics with the highest message production rate. Unit: messages/second.
Top 20 GroupIDs by Message Consumption Rate	The top 20 groups with the highest message consumption rate. Unit: messages/second.
Top 20 GroupIDs by Number of Ready Messages	The top 20 groups with the most ready messages. Unit: messages.
Top 20 GroupIDs by Ready Message Queue Time	The top 20 groups with the longest ready message queue time. Unit: ms.
Top 20 GroupIDs by Number of Accumulated Messages (Ready + Inflight)	The top 20 groups with the most accumulated messages. Unit: messages.
Top 20 GroupIDs by Number of Inflight Messages	The top 20 groups with the most inflight messages. Unit: messages.
Top 20 GroupIDs by Consumption Processing Latency	The top 20 groups with the longest consumption processing latency. Unit: ms.
Top 20 GroupIDs by Consumer Local Wait Time	The top 20 groups with the longest consumer local wait time. Unit: ms.
Top 20 Topics by Message Production Call Failure Rate	The top 20 topics with the highest failure rate for message production.
Top 20 GroupIDs by Message Consumption Failure Rate	The top 20 groups with the highest failure rate for message consumption.

Billing metrics

Note

The values of the following billing metrics include multipliers for large messages and advanced features.

Large message multiplier: The unit of measurement is 4 KB. For example, if you send a 16 KB message, the number of API calls is calculated as 16 KB / 4 KB = 4.
Advanced feature multiplier: The number of API calls for messages with advanced features, such as ordered, scheduled, delayed, and transactional messages, is five times the number of API calls for normal messages.

Metric	Description
Peak Production TPS	The maximum message production TPS. This metric can be used as a reference for estimating the peak TPS specification in the instance's billing items. Unit: calls/second.
Peak Consumption TPS	The maximum message consumption TPS. This metric can be used as a reference for estimating the peak TPS specification in the instance's billing items. Unit: calls/second.
Peak TPS	The maximum value of the sum of message production TPS and message consumption TPS. This metric can be used as a reference for estimating the peak TPS specification in the instance's billing items. Unit: calls/second.
Total API Calls	The total number of API calls. This metric can be used as a reference for estimating the number of API calls in the instance's billing items. Unit: calls.
Average Message Size	The average size of all produced messages. Unit: bytes.
Production And Consumption TPS	The sum of message production TPS and message consumption TPS. Unit: calls/second.
Daily API Calls	The daily total number of API calls for message production and consumption. Unit: calls.

Metrics Details

Important

When calculating metrics related to message TPS, the number of messages sent and received, or the total number of messages, the base unit is a 4 KB normal message. Multipliers for message size and advanced message types are applied to this base unit.

The following table describes the fields in the metrics.

Field	Value
Metric type	Gauge: A metric that can increase or decrease. Its value represents an instantaneous measurement of the statistical object. For example, the TPS of API calls.
Label	instance_id: ApsaraMQ for RocketMQ instance ID. topic: ApsaraMQ for RocketMQ topic. message_type: Message type. normal indicates a normal message. fifo indicates an ordered message. transaction indicates a transactional message. delay indicates a scheduled or delayed message. uid: Your Alibaba Cloud account ID. protocol_type: Protocol type. tcp indicates the TCP protocol. http indicates the HTTP protocol.

Server-side metrics

Metric type

Metric name

Unit

Description

Label

Gauge

rocketmq_instance_requests_threshold

count/s

Instance throttling threshold.

uid
instance_id

Gauge

rocketmq_instance_requests_max

count/s

The maximum TPS of an instance per minute. Requests that are throttled are not included.

Rule: The maximum value among the 60 TPS samples taken within 1 minute.

uid
instance_id

Producer metrics

Metric type	Metric name	Unit	Description	Label
Gauge	rocketmq_producer_requests (commercialCount, billable requests)	count	Number of API calls related to sending messages.	uid instance_id topic message_type="normal\|fifo\|transaction\|delay"
Gauge	rocketmq_producer_messages	message	Number of sent messages.	uid instance_id topic message_type="normal\|fifo\|transaction\|delay"
Gauge	rocketmq_producer_message_size_bytes	byte	Total size of sent messages.	uid instance_id topic message_type="normal\|fifo\|transaction\|delay"
Gauge	rocketmq_producer_send_success_rate	%	Send success rate.	uid instance_id topic
Gauge	rocketmq_producer_failure_api_calls	count	Number of failed API calls for sending messages.	uid instance_id topic
Gauge	rocketmq_producer_send_rt_milliseconds_avg	ms	Average latency of sending messages.	uid instance_id topic
Gauge	rocketmq_producer_send_rt_milliseconds_min	ms	Minimum latency of sending messages.	uid instance_id topic
Gauge	rocketmq_producer_send_rt_milliseconds_max	ms	Maximum latency of sending messages.	uid instance_id topic
Gauge	rocketmq_producer_send_rt_milliseconds_p95	ms	P95 latency of sending messages.	uid instance_id topic
Gauge	rocketmq_producer_send_rt_milliseconds_p99	ms	P99 latency of sending messages.	uid instance_id topic

Consumer metrics

Metric type	Metric name	Unit	Description	Label
Gauge	rocketmq_consumer_requests	count	Number of API calls related to consuming messages.	uid instance_id topic client_group protocol_type="tcp\|http"
Gauge	rocketmq_consumer_send_back_requests	count	Number of API calls to send back messages that failed to be consumed.	uid instance_id topic group_id
Gauge	rocketmq_consumer_send_back_messages	message	Messages that failed to be consumed and were sent back by consumers.	uid instance_id topic group_id
Gauge	rocketmq_consumer_messages	message	Number of consumed messages.	uid instance_id topic client_group protocol_type="tcp\|http"
Gauge	rocketmq_consumer_message_size_bytes	byte	Size of consumed messages (accumulated over one minute).	uid instance_id topic client_group protocol_type="tcp\|http"
Gauge	rocketmq_consumer_ready_and_inflight_messages	message	Message consumption lag (includes ready and inflight messages).	uid instance_id topic group_id
Gauge	rocketmq_consumer_ready_messages	message	Number of ready messages. Actual accumulation: maxOffset - lastPullOffset	uid instance_id topic group_id
Gauge	rocketmq_consumer_inflight_messages	message	Number of inflight messages. Rule: lastPullOffset - committedOffset	uid instance_id topic group_id
Gauge	rocketmq_consumer_queue_time_milliseconds	ms	Message queue time.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_await_time_milliseconds_avg	ms	Average time that a message waits for processing resources on the consumer client.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_await_time_milliseconds_min	ms	Minimum time that a message waits for processing resources on the consumer client.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_await_time_milliseconds_max	ms	Maximum time that a message waits for processing resources on the consumer client.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_await_time_milliseconds_p95	ms	P95 time that a message waits for processing resources on the consumer client.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_await_time_milliseconds_p99	ms	P99 time that a message waits for processing resources on the consumer client.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_process_time_milliseconds_avg	ms	Average message processing latency for a consumer.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_process_time_milliseconds_min	ms	Minimum message processing latency for a consumer.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_process_time_milliseconds_max	ms	Maximum message processing latency for a consumer.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_process_time_milliseconds_p95	ms	P95 message processing latency for a consumer.	uid instance_id topic group_id
Gauge	rocketmq_consumer_message_process_time_milliseconds_p99	ms	P99 message processing latency for a consumer.	uid instance_id topic group_id
Gauge	rocketmq_consumer_consume_success_rate	%	Message consumption success rate.	uid instance_id topic group_id
Gauge	rocketmq_consumer_failure_api_calls	count	Number of failed API calls for consumption.	uid instance_id topic group_id
Gauge	rocketmq_consumer_to_dlq_messages	message	Number of messages sent to the dead-letter queue (DLQ).	uid instance_id topic group_id

View the dashboard

Log on to the ApsaraMQ for RocketMQ console. In the left-side navigation pane, click Instances.
In the top navigation bar, select a region, such as China (Hangzhou). On the Instances page, click the name of the instance that you want to manage.
Use one of the following methods to view the dashboard:
- On the Instance Details page, click the Dashboard tab.
- In the left-side navigation pane of the Instance Details page, click Dashboard.
- In the left-side navigation pane of the Instance Details page, click Topics. On the page that appears, click the name of the topic that you want to manage. On the Topic Details page, click the Dashboard tab.
- In the left-side navigation pane of the Instance Details page, click Groups. On the page that appears, click the name of the group that you want to manage. On the Group Details page, click the Dashboard tab.

Dashboard FAQ

How do I obtain dashboard metric data?

Log on to the ARMS console with your Alibaba Cloud account.
In the navigation pane on the left, click Integration Center.
On the Integration Center page, enter RocketMQ in the search box and click the search icon.
In the search results, select the Alibaba Cloud service that you want to integrate, such as Alibaba Cloud RocketMQ (4.0) Service. For more information, see Step 1: Integrate monitoring data of an Alibaba Cloud service.
After the integration is successful, click Provisioning in the navigation pane on the left.
In the Cloud Service Area Environment list, click the name of the target environment to go to its details page.
On the Component Management tab, in the Basic Information section, click the region of the Prometheus Instance.
On the Settings tab, you can find different data access methods.

How do I integrate metric data provided by the dashboard of ApsaraMQ for RabbitMQ into a self-managed Grafana system?

All metric data on the dashboard of ApsaraMQ for RocketMQ are stored in Alibaba Cloud Managed Service for Prometheus. You can follow the procedure in the "How do I obtain metrics on the dashboard?" section to integrate the monitoring data of ApsaraMQ for RocketMQ into Managed Service for Prometheus, obtain the environment name and HTTP API URL, and then use the HTTP API URL to integrate the metric data on the dashboard of ApsaraMQ for RocketMQ into a self-managed Grafana system. For more information, see Use an HTTP API URL to connect a Prometheus instance to a self-managed Grafana system.

How do I understand the average TPS and max TPS of an instance?

Average TPS = Total requests in 1 minute / 60 seconds
Max TPS: Within a 1-minute statistical period, the TPS value is sampled once per second. The max TPS is the highest of these 60 sampled values.

For example:

Assume that an instance produces 60 messages in 1 minute. All messages are normal messages and each is 4 KB in size. The production rate of the instance is 60 messages per minute.

Average instance TPS = 60 calls / 60 seconds = 1 call per second

The max instance TPS is calculated as follows:

If the 60 messages are sent in the first second, the TPS values for each second of that minute are 60, 0, 0, ..., 0.
Max instance TPS = 60 calls per second.
If 40 messages are sent in the first second and 20 messages are sent in the second second, the TPS values for each second of that minute are 40, 20, 0, 0, ..., 0.
Max instance TPS = 40 calls per second.