Community Blog The Innovative Practice of RocketMQ in Sohu

The Innovative Practice of RocketMQ in Sohu

This article describes the usage scenarios and problems of MQ in several typical businesses.

1. Usage Scenario and Selection of MQ

Message middleware used in most video departments includes RedisMQ, ActiveMQ, RocketMQ, and Kafka. This article describes the usage scenarios and problems of MQ in several typical businesses.

1.1 An Introduction to RocketMQ

RocketMQ was first used in counting services. The counting service calculates and displays the views of the client in real-time. At that time, Redis was used for real-time counting, and the database was called asynchronously to count. At first, there was nothing wrong with this model, but as the business volume increased, the pressure on the database increased. Sometimes the CPU resources of the database are almost used up. In addition, when the database is migrated, writing needs to be suspended, and the counting is at risk of data loss.

Thus, the counting service urgently needs a reliable and accumulated MQ that supports real-time consumption to change this situation.

We considered RocketMQ and Kafka but eventually chose RocketMQ for the reasons listed below:

1.2 Abandonment of Kafka

The delivery service needs to deliver the content recommended for users to each area, but the recommendation service needs users' feedback on the recommended content. Thus, Kafka is adopted in the delivery service to interact with the recommendation service. However, due to a machine failure, a failover occurred in the Kafka cluster. Unfortunately, this cluster has too many partitions, which took several minutes to complete the failover.

This blocked the service thread, and the service entered an unresponsive state. Later, we learned that even if a broker of RocketMQ is down, messages will be sent to other brokers without blocking the entire cluster. So, the delivery service migrated all message interactions to RocketMQ.

1.3 Unreliable RedisMQ

In the past, our video basic service used RedisMQ to notify the caller of the video data changes for updating the data. However, the message push of RedisMQ is based on the pub/sub model. This model is highly real-time, but it does not guarantee reliability and persistent messages.

In some cases, these two defects made the caller unable to receive the notification. When the message was lost, it was nearly impossible to get it back.

Therefore, the video business eventually abandoned RedisMQ and turned to RocketMQ. RocketMQ ensures that messages are delivered at least once and can be persisted. Even if the client is restarted, it can start the consumption from the previous place.

1.4 Low-Performance ActiveMQ

Previously, ActiveMQ was used by the basic service for user videos, which was mainly used to notify the dependent party of data changes. Its message body contains the changed data. Unfortunately, when the number of messages is large, ActiveMQ often fails to respond, and consumers fail to receive messages for a long time. We learned that a single RocketMQ broker can support a TPS of over 100,000 and hundreds of millions of accumulated messages. Thus, this business was also migrated to RocketMQ.

Currently, RocketMQ is used in basic video services, user services, livestreaming services, payment services, audit services, and other business systems. Kafka is mostly used for log-related processing services, such as log reporting and log collection.

In addition, since RocketMQ supports more clients, it is easier for our businesses in many other languages to access RocketMQ, such as Python-based clients for AI groups and GO-based clients for some services.

2. O&M Challenges

In the early stage, we relied on command lines and the RocketMQ-Console to maintain RocketMQ. The questions frequently asked by business parties include:

  • Which of my machines are sending messages to this topic?
  • Why did the message sending time out?
  • Can I get notified of the failure?
  • Can I get informed of the failure of consumption?
  • What is the message body like?
  • Can RocketMQ clusters be degraded and isolated when they are unavailable?
  • Why does the consumption of my topic lead to confusion in other business consumption?
  • Why do I need to serialize it myself?

There are many strange problems!

As O&M personnel, in addition to answering business parties' questions, we are very careful to maintain RocketMQ by using command lines. A small mistake might cause a large-scale failure. As we get more familiar with O&M, we have written several documents on usage specifications, best practices, naming conventions, operation procedures, and other topics. However, it was found that these documents contributed little to the improvement of production efficiency. Therefore, it is better to convert experience and practice into products to serve the business rather than writing documents. As a result, MQCloud was created.

3. The Birth of MQCloud

Let's first look at the positioning of MQCloud:


It is an all-in-one service platform that integrates client SDKs, monitoring and alerting, and cluster O&M. The system architecture of MQCloud is shown below:


Now, I will explain how MQCloud solves the pain points mentioned above.

3.1 After the Separation of Business End and O&M End, Users Only Focus on Business Data

The dimensions of the user and the resource are introduced to achieve this goal. Users and resources are managed to ensure that different users only focus on their data.


  • Producers are only concerned about the topic configuration, data of message sending, and consuming party. So, we only need to show them the corresponding data.
  • Consumers only care about consumption, accumulation, consumption failure, and other aspects.
  • Administrators can perform daily operations, such as deployment, monitoring, unified configuration, and approval.

3.2 Clear Operations

Showing different views to different people makes the operations users can perform very clear.

3.3 Specifications and Safety

All operations are approved by administrators in the background approval system in the form of application forms to ensure the security and standardization of cluster operations. This improves security significantly.

3.4 Multidimensional Data Statistics and Monitoring and Alerting

One of the core functions of MQCloud is monitoring and alerting. Currently, MQCloud supports alerting of the following aspects:

  • Production Message Exception
  • Consumption Message Accumulation (from the Perspective of the Broker)
  • Consumer Client Blocking (from the Perspective of the Client)
  • Consumption Failure
  • Consumption Offset Error
  • Consumer Subscription Error
  • Consumption Lagging (If the memory threshold is exceeded, pull data from the hard disk.)
  • Dead Message (Too many consumption failures put messages in the dead-letter queue)
  • Message Traffic Exception
  • High Time Consumption of Message Storage (Time Consumption of Message Storage by the Broker)
  • Broker&NameServer Crash
  • Server Crash
  • Server CPU, memory, network traffic, and other indicators

Statistics is a must for monitoring. MQCloud does a lot of statistical work to understand the operating status of RocketMQ clusters. (Most of it depends on broker statistics.) It mainly includes the following items:

  • Topic Production Traffic per Minute: It is used for monitoring and alerting and graphing the topic production traffic.
  • Consumer Traffic per Minute: It is used for monitoring and alerting and graphing the consumption traffic.
  • Topic Production Traffic Every Ten Minutes: It is used to display the topic order by traffic.
  • Broker Production and Consumption Traffics per Minute: It is used to graph the production and consumption traffic of the broker.
  • Cluster Production and Consumption Traffics per Minute: It is used to graph the production and consumption traffic of a cluster.
  • Producer Percentile Time Consumption per Minute and Exception Statistics: It is used for monitoring and alerting and graphing the time consumption traffic of each producer by IP address.
  • Statistics of CPU, Memory, I/O, Network Traffic, and Network Connection: It is used for monitoring and alerting and graphing the server status.

3.4.1 Statistics on Abnormal Production Time Consumption


RocketMQ does not provide traffic statistics of producers. (The topic is provided, but the situation of each producer is unknown.) MQCloud provides the statistics of producers through the hook function of RocketMQ.


Statistics mainly contain the following information:

  • Client IP to Broker IP
  • Time Consumed for Sending Messages
  • Number of Messages
  • Sending Exception

After the statistics are completed, the data is regularly sent to MQCloud for storage, real-time monitoring, and display.

One thing about the statistics is that the time consumption statistics generally have maximum, minimum, and average values. Usually, the time consumption of 99% of requests can represent the real response situation. (The time consumption of 99% of requests is lower than the maximum value.) The biggest obstacle is how to control memory usage. We need to sort all the time consumption within a specific time before we can get the results. There are some algorithms and data structures for statistics of streaming data, such as t-digest. MQCloud uses an inaccurate but relatively simple segmentation statistics method. It is shown below:

1) Create a piecewise array based on maximum time consumption and different hash time span:

  • Piece 1: Time consumption ranges from 0 to 10 milliseconds. The time span is 1 millisecond.


  • Piece 2: Time consumption ranges from 11 to 100 milliseconds, and the time span is 5 milliseconds.


  • Piece 3: Time consumption ranges from 101 to 3,500 milliseconds, and the time span is 50 milliseconds.


Advantages: This method occupies a fixed memory. For example, if the maximum time consumption is 3,500 milliseconds, only an array with a size of 96 is required. Disadvantages: The accuracy needs to be set in advance and cannot be changed.

2) For the piecewise array above, create a counting array of AtomicLong of the same size. It should support concurrent statistics.


3) When performing time-consumption statistics, calculate the subscript of the piecewise array and then call the counting array to perform statistics. Please see the following figure:


  • For example, for a time consumption of 18 milliseconds, we need to find the interval to which it belongs first. It belongs to the interval between 16 and 20 milliseconds, and the corresponding array subscript is 12.
  • Obtain the subscript 12 from the corresponding counting array based on the array subscript 12 found in the first step.
  • Get the corresponding counter to perform the plus one operation, which means the counting array is called once at the place of 18 milliseconds.

As such, time consumption statistics can be obtained from the counting array in real-time. This process is shown in the following figure:


4) Then, the scheduled sampling task takes snapshots of the counting array every minute to generate the following time consumption data:


5) Since the time consumption data above is naturally arranged in order, it is easy to calculate the time consumption data of 99% and 90% of requests as well as the average time consumption.

The newly added trace function in RocketMQ 4.4.0 is also implemented by hook, so it conflicts with the statistics of MQCloud. Now, it is compatible with MQCloud. Trace and statistics are two dimensions. Trace reflects the process of messages from production to storage and consumption, while MQCloud performs statistics on producers. MQCloud can display production time consumption and provide alerts of production exceptions with the statistics data.

3.4.2 Machine Statistics

nmon is placed in the /tmp directory automatically to collect cluster conditions. Then, scheduled ssh connects to the machine to execute the nmon command, parse the returned data, and store it.

The process above has laid a solid data foundation for monitoring and alerting.

3.5 A Customized Client


For some demands of the client, mq-client has carried out development and customization based on rocketmq-client.

3.5.1 Multi-Cluster Support

MQCloud stores the relationship between producers, consumers, and clusters. Clients can route to the target cluster automatically through route adaptation, making clients transparent to multiple clusters.

3.5.2 Transparent Trace Clusters

Trace data can be sent to separate clusters by building separate trace clusters and customized clients. This will not influence the primary cluster.

3.5.3 Serialization

If clients integrate and couple different serialization mechanisms with MQCloud, they do not need to care about serialization issues. Currently, serialization mechanisms Protobuf and JSON are supported, and the mechanism can be switched online by type detection.

3.5.4 Traffic Control

The traffic control mechanism is enabled automatically by providing a token bucket and a leaky bucket throttling mechanism. This process prevents message peaks from flooding the business end and provides convenience for businesses that need to control the traffic rate accurately.

3.5.5 Isolation and Degradation

By providing isolation API for production messages with Hystrix, the business end will not be influenced when the broker fails.

3.5.6 Tracking Point Monitoring

Any disturbance of the client can be found in time through statistics, collection, and monitoring in MQCloud.

3.5.7 Standardization

Certain conventions, specifications, and best practices can be implemented through coding assurance, including (but not limited to):

  • Naming Conventions
  • Global Uniqueness of Consumer Groups to Prevent Consumption Issues
  • Retry Message Skipping
  • Secure Shutdown
  • A More Thorough Retry Mechanism

3.6 Nearly Automated O&M

3.6.1 Deployment

Manual deployment of a broker instance is not very difficult. However, when the number of instances increases, manual deployment is highly error-prone and time-consuming.

MQCloud provides a set of automated deployment mechanisms, including writing suspension, enabling and disabling, local update, and remote migration (including data verification).


Support Quick Deployment:


In addition, as the core of RocketMQ, the broker has hundreds of configuration items, and many of them involve performance tuning. This often requires careful tuning according to the status of the server. MQCloud has developed the configuration template feature to support flexible deployment.

As an O&M platform, MQCloud involves the following things that we need to consider:

1) Broker configuration items are complicated and need to be managed clearly.

2) Prompts and suggestions are provided when adjusting existing broker parameters. In addition, the following situations need to be considered:

  • The adjustment takes effect without restarting the broker.
  • After the adjustment, you need to restart the broker to take effect.

3) Parameters are inherited when a broker is newly deployed. Parameters that have been optimized and verified by the online brokers are expected to be used automatically when a broker is newly deployed.

Broker Configuration Template

MQCloud uses the following methods to solve the problems above:

  • Template groups for broker configuration. MQCloud groups the parameters to facilitate the differentiation and management of broker parameters. (Group information can be modified according to the specific conditions.)


  • Templates for broker configuration. You can add the default configuration parameters of the broker to the template.


  • Templates for cluster configuration. Configurations in templates are used when you deploy a new broker. The Add to Cluster button adds selected items in the broker configuration template to the cluster configuration template.


3.6.2 Machine O&M

MQCloud provides a complete set of machine O&M mechanisms to improve productivity.

3.6.3 Visualized Cluster Topology


3.7 Security Reinforcement

3.7.1 Enable Administrator Permissions

RocketMQ has supported ACL since version 4.4.0, but it is not enabled by default. This means that anyone can control online clusters using management tools or API. However, enabling ACL has too much impact on the existing business. MQCloud is specially designed for this problem.

After referencing the RocketMQ ACL mechanism, permission verification is only enhanced for the RocketMQ administrator operations.


It also supports customization and the hot-loading of administrator request codes, making it impossible to operate RocketMQ clusters illegally. It also improves security significantly.

Communication Reinforcement of the Broker

Since the code for data synchronization in the broker is not verified, there are security risks. If the slave communication port monitored by master is connected and data of more than 8 bytes are sent, it may cause synchronization offset errors. The code is listed below:

MQCloud ensures communication security by verifying the first pack of data:

if ((this.byteBufferRead.position() - this.processPostion) >= 8) {
 int pos = this.byteBufferRead.position() - (this.byteBufferRead.position() % 8);
 long readOffset = this.byteBufferRead.getLong(pos - 8);
 this.processPostion = pos;
 HAConnection.this.slaveAckOffset = readOffset;
 if (HAConnection.this.slaveRequestOffset < 0) {
     HAConnection.this.slaveRequestOffset = readOffset;
     log.info("slave[" + HAConnection.this.clientAddr + "] request offset " + readOffset);

4. The Road to Open-Source

The O&M scale of MQCloud is listed below:

  • Server: More than 50
  • Cluster: More than 5
  • Topic: More than 800
  • Consumer: More than 1,400
  • Number of Messages Produced and Consumed per Day: More than 400,000,000
  • Size of Messages Produced and Consumed per Day: More than 400 Gigabytes

After taking the needs of the business into account, MQCloud takes the focus of each role as its core and comprehensive monitoring as the goal to meet the needs of each business end. MQCloud is constantly developing and improving.

After MQCloud matured gradually, we opened the source code to gain more experience and serve the community. After the design and split, MQCloud was open-sourced in 2018. By now, more than 20 update versions have been released. These versions include function updates, bug fixes, and descriptions in the Wiki. Each major version has undergone detailed testing and internal operation. After that, many users were eager to try it out and provided many useful suggestions. Then, we improved it according to the feedback.

We will follow our goal and remain focused on the path of open-source:

  • We will provide businesses with stable MQ services that support monitoring, alerting, and functions to meet various needs of businesses.
  • We will accumulate experience in the MQ field and transform it into products to serve businesses better.
0 0 0
Share on

You may also like


Related Products