Metrics Index Analysis

Observability seen from the life cycle of messages

Before entering the topic, let's take a look at the interaction process between RocketMQ producers, consumers and servers:

message produce and consume process

RocketMQ messages are partitioned and stored in an orderly manner in the form of queues. This queue model enables producers, consumers, and read-write queues to have a many to many mapping relationship, which can be expanded infinitely. Compared with traditional message queues, such as RabbitMQ, it has great advantages. Especially in the streaming processing scenario, it can ensure that messages in the same queue are processed by the same consumers. It is more friendly for batch processing and aggregation processing.

Next, let's take a look at the important nodes in the whole life cycle of messages:

message life cycle

The first is message sending: sending time refers to the time when a message is sent from the producer to the server and stored on the hard disk. If it is a timed message, it needs to reach the specified timing time to be visible to consumers.

After receiving the message, the server needs to process it according to the message type. The timed/transaction message is visible to consumers only when the timed time/transaction is committed. RocketMQ provides the feature of message stacking, that is, messages are not necessarily pulled immediately after being sent to the server, but can be delivered according to the consumption capacity of the client.

From the perspective of consumers, there are three stages to focus on:

• Pulling messages: It takes time for messages to be pulled from the beginning to the network and server processing when they arrive at the client;

• Message queuing: waiting for processing resources, that is, from message arrival to message processing;

• Message consumption: from the start of processing messages to the last submission point/return ACK.

Messages can be clearly defined and observed at any stage of the life cycle, which is the core concept of RocketMQ observability. Metrics introduced in this article implements this concept and provides monitoring buried points covering all phases of the message lifecycle. With the atomic capabilities provided by Metrics, we can build a monitoring system suitable for business needs:

• Daily patrol inspection and monitoring early warning;

• Macro trend/cluster capacity analysis;

• Fault problem diagnosis.

RocketMQ 4. x Metrics Implementation – Exporter

The RocketMQ exporter contributed by the RocketMQ team has been included in the official open source exporter ecosystem of Prometheus, providing rich monitoring indicators for brokers, producers, and consumers at various stages.

exporter metrics spec

Analysis of Exporter Principle

The process for RocketMQ exporter to obtain monitoring indicators is shown in the figure below. The exporter requests data from RocketMQ cluster through MQAdminExter. The acquired data is converted to the format required by Prometheus, and then exposed through the/mics interface.

rocketmq exporter

With the evolution of RocketMQ, the exporter mode gradually exposes some defects:

• Unable to support the observability requirements of proxy and other modules newly added in RocketMQ 5. x;

• The indicator definition does not conform to the open source specification and is difficult to use with other open source observable components;

• A large number of RPC calls bring additional pressure to the broker;

• Poor scalability. To add/modify indicators, you need to modify the admin interface of the broker first.

To solve the above problems, the RocketMQ community decided to embrace community standards and introduced the OpenTelemtry based Metrics solution in RocketMQ 5. x.

RocketMQ 5. x Native Metrics Implementation

Metrics based on OpenTelemtry

OpenTelemetry is an observability project of CNCF, which aims to provide standardization solutions in the field of observability, solve the standardization problems of data model, acquisition, processing, export, etc. of observational data, and provide services unrelated to third-party vendors.

When discussing the new metrics scheme, the RocketMQ community decided to comply with the OpenTelemetry specification and completely redesign the indicator definition of the new metrics: the data type is Counter, Guage, and Histogram compatible with Promethums, and the indicator naming specification recommended by Promethums is followed. It is not compatible with the old Rocketmq exporter indicator. The new indicator covers various modules such as broker, proxy, producer, and consumer, and provides monitoring capability for the whole phase of the message lifecycle.

Indicator reporting method

We provide three ways to report indicators:

• Pull mode: suitable for users of self operation and maintenance K8s and Promethues clusters;

• Push mode: suitable for users who want to do post-processing on metrics data or access cloud manufacturers' observable services;

• Exporter compatibility mode: It is suitable for users who are already using Exporter and have the need to transfer metrics data across data centers (or other network isolation environments).

Pull

Pull mode is designed to be compatible with Prometheus. There is no need to deploy additional components in the K8s deployment environment. Prometheus can automatically obtain the broker/proxy list to be pulled through the K8s service discovery mechanism provided by the community (creating PodMonitor and ServiceMonitor CDR), and pull metrics data from the endpoints they provide.

pull mode

Push

OpenTelemetry recommends the Push mode, which means that it needs to deploy a collector to transfer the indicator data.

push mode

OpenTelemetry officially provides the implementation of collector. It supports user-defined operations on indicators, such as filtering and enrichment. You can use plug-ins provided by the community to implement your own collector. In addition, most of the observability services provided by cloud manufacturers (such as AWS CloudWatch and Alibaba Cloud SLS) have embraced the OpenTelemetry community. Data can be directly pushed to the collector they provide without additional components for bridging.

OpenTelemetry collector

Compatible with RocketMQ Exporter

The new Metrics also provides compatibility with RocketMQ Exporter. Users using the exporter can access the new Metrics without changing the deployment architecture. Moreover, control surface applications (such as Promethues) and data surface applications (such as RocketMQ) may be deployed separately. Therefore, it is a good choice to use Exporter as a proxy to obtain new Metrics data.

RocketMQ community has embedded an OpenTelemetry collector implementation in the Exporter. The broker exports metrics data to the Exporter, which provides a new endpoint (metrics v2 in the following figure) for Prometheus to pull.

exporter mode

Best Practices for Building a Monitoring System

Rich indicator coverage and adherence to community standards make it easy to build a monitoring system suitable for business needs with the help of RocketMQ's Metrics capabilities. This chapter mainly introduces the best practices of building a monitoring system with a typical process:

Cluster monitoring/patrolling ->triggering alarm ->troubleshooting and analysis.

Cluster status monitoring and patrol

We can configure monitoring based on these indicators after collecting indicators to Promethums. Here are some examples:

Interface monitoring:

Monitor the interface calls, and quickly catch abnormal requests based on this

The following figure shows some related examples: time consumption (avg, pt90, pt99, etc.), success rate, failure reason, interface call and return value distribution of all RPCs.

rpc metrics

Client monitoring:

Monitor the use of the client, and find unexpected client uses, such as super large message sending, client online and offline, client version management, etc.

The following figure shows some related examples: number of client connections, client language/version distribution, and size/type distribution of messages sent.

client metrics

Broker monitoring:

Monitor the water level and service quality of the broker, and find the cluster capacity bottleneck in time.

The following figure shows some related examples: Dispatch delay, message retention time, thread pool queuing, and message stacking.

broker metrics

The above example is just the tip of Metrics' iceberg. It is necessary to flexibly combine different indicators to configure monitoring and patrolling according to business needs.

Alarm configuration

With perfect monitoring, you can configure alarms for indicators that need attention. For example, you can configure alarms for the indicator of Dispatch delay in broker monitoring:

broker alert

After receiving the alarm, you can monitor and check the specific reason. The failure rate of the associated sending interface can find that 1.7% of the consumption sending fails. The corresponding error is that the subscription group has not been created:

promblem analysis

Problem troubleshooting analysis

Finally, take the message stacking scenario as an example to see how to analyze online problems based on Metrics.

Look at the Stacking Problem from the Life Cycle of Messages

As mentioned at the beginning of this article, the problem of RocketMQ needs to be comprehensively analyzed in combination with the life cycle of the message. If one sidedly believes that it is a server/client fault, it may lead to a wrong path.

For the stacking problem, we mainly focus on two phases in the message lifecycle:

• Ready messages: Ready messages are messages that are available for consumption but have not been pulled, that is, messages stacked on the server;

• Messages in process: Messages in process are those pulled by the client but not yet consumed.

consume lag

Stacking problem of multi-dimensional indicator analysis

For the accumulation problem, rocketmq provides the consumption delay related indicator RocketMQ_ consumer_ lag_ Latency can configure alarms based on this indicator. The alarm threshold should be flexibly specified according to the tolerance of the current business to the consumption delay.

After the alarm is triggered, it is necessary to analyze whether the messages are stacked in ready messages or in processing messages. RocketMQ provides Rocketmq_ consumer_ ready_ Messages and rocketmq_ consumer_ inflight_ The two indicators of messages, combined with other consumption related indicators and client configuration, can determine the root cause of message accumulation:

• Case 1: Ready messages continue to rise, and the messages being processed reach the upper limit of the client stack

This is the most common stacking scenario. The message volume rocketmq in the client processing_ consumer_ inflight_ Messages reach the threshold configured by the client, that is, the consumption capacity of consumers is lower than the message sending volume. If the business requires real-time consumption of messages as much as possible, it needs to increase the number of consumer machines. If the business is not very sensitive to message delay, it can wait until the business peak has passed before digesting the accumulated messages.

• Case 2: Ready message is almost 0, and the message in processing continues to rise

This case mostly occurs in scenarios where RocketMQ 4. x clients are used. At this time, the consumption sites are submitted sequentially. If the consumption of a message is stuck, the sites cannot be submitted. It seems that there are a lot of messages piled up on the client side, that is, messages continue to rise during processing. Can combine consumption trajectory with Rocketmq_ process_ The time indicator captures the messages with slow consumption, analyzes the upstream and downstream links, and finds the root cause to optimize the consumption logic.

• Case 3: Ready messages continue to rise, and the message under processing is almost 0

This scenario indicates that the client does not pull messages. Generally, there are the following situations:

• Authentication problem: check ACL configuration. If you use public cloud products, check AK and SK configurations;

• Consumer hang: try to print thread stack or gc information to determine whether the process is stuck;

• Slow response of the server: check the call amount and time consumption of the pull message interface, and the hard disk read/write delay in combination with RPC related indicators. Check whether it is a server problem, such as the hard disk IOPS is full.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us