Community Blog O&M Practices and Governance of Xiaohongshu Message Queue

O&M Practices and Governance of Xiaohongshu Message Queue

This article discusses the business scenario, stability challenges, and metrics behind Xiaohongshu.

By Yihao Zhang, Head of Message Queue in Xiaohongshu


1. Message Queue Business Scenarios and Challenges

1.1 Overall Scale

The following figure shows the overall scale of RocketMQ and Kafka. The peak TPS 8000 w/s generally occurs after work at night. The write volume reaches 50GB/s, 2-3PB of data is added every day, and the number of nodes is over 1200.


1.2 Business Architecture

Although RocketMQ and Kafka have similar performances, they are different in use. The rich business features of RocketMQ make it suitable for online business scenarios, while the high throughput of Kafka makes it inclined for offline and near-line business. There will be cross-use in practical applications. Sometimes online businesses will use Kafka to decouple, and some stream processing data will use RocketMQ to store.

The following figure shows the overall business architecture. The content of business logs and APP user behavior dots are sent to Kafka. Database incremental logs, online business, and online data exchange are sent to RocketMQ. Some of the data in Kafka and RocketMQ flows into Flink to build real-time warehouses, offline data warehouses, and some data products (such as reports and monitoring). Another part of the data in RocketMQ is used for the asynchronous decoupling of online business apps.

Message Queue Business Architecture

1.3 Stability Challenges

1) Background

The overall convergence message component of Xiaohongshu was launched late. The biggest goal of the company's technical architecture is to improve system stability.

2) Challenges

The existing message components are used often, but there is no stability guarantee. Also, they face dilemmas with manpower shortages, tight time, and a lack of in-depth understanding of MQ principles.

3) Strategies

Monitoring and enhancing the observability of the cluster is the most efficient way to understand its health.

1.4 Stability Governance

In addition to monitoring alerts, we have done the following transformation work in stability governance:

1) Engine: Resource isolation, new monitoring and dotting.

2) Platform: Work order review, authority control, and business traceability

3) Governance: For the construction of cluster visualization capabilities and cluster O&M capabilities


2. Message Queue Governance Practice

2.1 Cluster Visualization: Monitoring Metrics

The following figure shows the message queue architecture built on Prometheus Grafana:

Message Queue Monitoring Architecture

The figure contains three monitoring dimensions: hardware, service, and business. Over 150 items of monitoring metrics are collected cumulatively.

How can we define the monitoring metrics of these three dimensions?

1) Hardware: It mainly includes network bandwidth, CPU utilization, disk capacity/IO, TCP packet loss/latency, and other resource metrics.

2) Service: It mainly refers to the metrics of running status, such as downtime monitoring, JVM metrics, read and write latency, request queue, etc.

3) Business: It refers to user-oriented metrics concerning customers, such as consumption latency/backlog, QPS, topic throughput, Offset, etc.

Since the company internally stipulates that a node can only use one port for Prometheus, and most of the monitoring metrics are collected separately, the metric aggregation service (MAS) is designed to bring all the metrics together while adding some metadata to help further troubleshoot the problem. MAS is equivalent to a proxy layer of metric, which can be added according to the actual situation.

2.2 Alert Handling

The following figure lists some alert information that occurred when the monitoring system was first established. At that time, there were about 600-700 alerts every day. The alert problems were varied and could not be handled at all, resulting in a useless monitoring system being.


Given the situation, we put quality before quantity in terms of monitoring. There is no difference between too many alerts and no alerts. According to this principle, a series of strategies have been formulated:

  • Initial Stage: Turn off low-quality alerts to ensure each high-quality alert can be detected and handled in time.
  • Medium Stage: With the reduction of high-quality alerts, the previously blocked alerts are turned on and further processed to reduce the number of alerts.
  • Later Stage: Turn on all alerts to ensure each alert can be detected and handled in time.

According to our experience, there will be no alerts such as service unavailable in the later stage. Most of the alerts are early warnings. If the early warning can be intervened in time, it can be ensured that the problem can be solved before the problem is expanded further.

Alert Processing Phased Policy

2.3 Cluster Visualization: Metric Design and Optimization

RocketMQ services and business metrics monitoring are based on open-source RocketMQ-exporter to solve problems (such as metrics leakage and partial metric collection deviation).

Here are two important transformations:

A. Lag Monitoring Optimization

  • Case 1 – Consumer Metric Leakage: The exporter can reach 300w+ after running for a few days. The curl interface takes 25 seconds, and the log text is 600MB.

Cause: As shown in the following figure, every time a new client is connected, the port value will increase. Due to the failure to clean up the offline client metrics in time in the exporter implementation, the client port continues to increase, resulting in system alerts.


Improvement: Add the metric expire module to the exporter

Result: The time taken by the curl interface is reduced to two seconds.

  • Case 2: The lag index is not accurate, resulting in the online false alert.

Cause: Export only provides the rocketmq_group_diff of the group dimension. If there is no broker dimension, an additional calculation is required.

Improvement: Add computational logic to the broker and calculate the lag first

Result: As shown in the following figure, the message backlog value is restored to a stable value from the jitter of 6K.


B. Optimization of Quantile Line/Sliding Window

  • Case 1 – The broker’s busy problem is often encountered online. You need to monitor the time point of occurrence. Although the exporter comes with a metric (such as send pool), it is an instantaneous value and has little reference significance.

Improvement: Add a metric to calculate the maximum value to the broker in five minutes



  • Case 2: The message writing time is the historical maximum, and the reference effect is limited.

Improvement: Optimize to five minutes and P99/P999 equal quartiles

Result: The accurate message writing time is obtained.


2.4 Cluster Visualization: Inspection System

There is a difference between the inspection system and the monitoring system. The monitoring system is a response to instantaneous problems, with fast change. It needs to be discovered and processed in time, and the presentation is relatively fixed. The inspection system is long-term work supervision. Changes are less for the static environment and configuration, and the presentation form is free.

As governance continues, how can a cluster be confirmed to be healthy?

1) Deploy clusters in strict accordance with the deployment standards, including hardware configurations, running parameters, and zones. Regularly inspect all clusters and produce reports that reflect the cluster status.

2) A total of over 20 items of core standards have been formulated, and the inspection results are presented in the form of tables, as shown in the following table:


3) Since there are too many metrics to judge the problem, the cluster health score system is set. Based on the idea that the availability of the cluster can only be reflected by a unique metric, a weight is set for each metric. The final score determines whether the cluster has problems, as shown in the following figure:


2.5 Cluster Visualization: Message Reconciliation Monitoring

When designing alerts, there will be some unconsidered alert items. The solution here is a message reconciliation system, which can effectively monitor message latency, loss, and cluster health.

The advantage of the message reconciliation system is that it provides end-to-end monitoring, including multiple monitoring effects. Its self-driving force can replace the alert items that have not been considered, and the discovery and location of faults are independent.

Message Reconciliation Monitoring System


The corresponding Kafka Monitor component is provided in the Kafka community. We transform this component into a service to provide the ability to automatically add new cluster monitoring to reduce the pressure on O&M.

2.6 Cluster O&M: Automation Platform

The construction of O&M capability is achieved through automation. Its fundamental purpose is to release manpower.

The following figure shows a topic migration tool that is transformed from RocketMQ and Kafka.

1) RocketMQ

  • Modify the nameserver delete logic to migrate topics between brokers automatically
  • It also processes the consumer-group and retry/dlq topic.
  • It relies on the proprietary management platform.

2) Kafka

  • Modify based on reassign and customize the reassign algorithm to reduce the impact of partition relocation.
  • Stage Workflow: Each step is automatically executed, and the next operation is manually confirmed.
  • Integrated Proprietary Management Platform

Topic Migration Tool

3. Future Exploration and Planning

The overall cloud-native of the message field has deepened in recent years. For example, RocketMQ 5.0 has introduced the storage separation design and raft mode. Another example is Kafka 3.0 has introduced the tiered storage mode and raft mode. The newly emerged Pulsar has begun to adopt cloud-native architecture in recent years. In the future, functional iteration can be introduced to meet specific business requirements to maximize the value of components.

0 1 0
Share on

You may also like


Related Products