Community Blog Exploration and Practice of RocketMQ Full-Link Grayscale

Exploration and Practice of RocketMQ Full-Link Grayscale

This article discusses Full-Link Grayscale, including its background, design, and best practices.

By Jing Xiao (Spring Cloud Alibaba PMC + Alibaba Cloud Intelligence Technology Expert)

Full-Link Grayscale Background


The grayscale release is usually used to reduce the impact of the first change to verify the correctness of the code logic of the new version effectively and cautiously.

For example, the collection of applications may contain multiple modules (such as the trading center, commodity center, and inventory center). During the release of a new version, the feature may modify both the trading center and the commodity center. You need to make the grayscale traffic reach the grayscale version of the trading center and the commodity center to verify the correctness of the new version, which is effective. Therefore, full-link grayscale is required.

As shown in the preceding figure, when traffic comes through the ingress application and is identified as grayscale traffic, it will go to the grayscale of the trading center and the grayscale of the commodity center. If the inventory center has a grayscale, it will go there. If not, it will be downgraded to the baseline environment of the inventory center.


Assume there is a gateway. The client can access the gateway from iOS, Android, or H5. During the access process, a header (http protocol) that contains user ID information is added to the parameter. The backend is divided into three modules: A, B, and C. For example, if a new feature is released, application modules A and C need to be updated. The grayscale versions of A and C need to be released at the same time, while application B does not have new features released. We set the grayscale rule to go to the grayscale environment when user ID =120 to verify the correctness of the logic of the new versions A and C. Therefore, after the traffic comes in from the gateway, it goes to the A grayscale environment first. Then, when A calls B, it is found that B has no grayscale environment and will be degraded to the baseline environment of B. When B calls C in the next hop, it is found that C has a grayscale environment and returns to the C grayscale environment, thus realizing full-link grayscale release.

The preceding full-link grayscale release process can effectively verify the validity of the new versions of the two applications A and C when combined. Only the traffic within the controllable range is involved in the grayscale release environment, which can effectively verify the correctness of the new versions and avoid major losses or business failures caused by business logic errors in the new version.

The preceding is the full-link grayscale in the RPC layer.


How can we implement full-link grayscale when a message exists in the link request?

Assume inventory center C receives an order and sends a message to the Message Queue for the Apache RocketMQ server. Application A consumes the message. As such, if the consumption logic is modified in the process, the messages (grayscale messages) produced by application C must be consumed by the A grayscale environment, which is the grayscale consumer of RocketMQ, to implement the closed loop of the grayscale environment.

Design and Implementation of Message Grayscale


How does the producer of the message produce a grayscale message in the design of message grayscale?

You can use one of the following methods to add a grayscale tag:

① If the request is identified as a grayscale request at the ingress, the message will be marked as a grayscale message.

② If the node belongs to a grayscale node and traffic coloring is enabled, the message is marked as a grayscale message.

③ If an ingress request is not identified as grayscale traffic but the message payload belongs to grayscale traffic, the message is also marked as a grayscale message.

When a producer produces a message, it can add some fields to the tag or user-property to attach the grayscale information to the message body. However, considering tags contain business logic semantics, and each message can have only one tag, we do not recommend using tags. The user-property field belongs to the key-value structure, which is more flexible and more suitable for storing the grayscale identifier of the message.

The producer of RocketMQ provides a SendMessageHook that allows you to customize logic. When you produce a message, you can store the grayscale tag in the user-property. When the message is sent to the RocketMQ server, the grayscale information is included.


Consumer grayscale is more complex and supports filtering on both the client side and server side.

As shown in the figure, the open-source RocketMQ client has FilterMseeageHook to perform logical processing. You can use FilterMseeageHook to filter out messages that do not need to be consumed in this environment. Consumers in the formal environment and the grayscale environment use different consumer groups to separate their offsets. Then, add the corresponding logic in the FilterMseeageHook to filter out all non-grayscale messages received in the grayscale environment.

The formal environment needs to pull all messages to analyze the user-property field. The key value contains the environment tag of the message. If a message in the grayscale environment is identified, the formal environment ignores the message by using remove. The same applies to the grayscale environment.

In this scenario, the formal environment and the grayscale environment belong to two different consumer groups, and both of them need to pull all messages locally. In extreme scenarios, the grayscale consumer only has one machine, but the consumer in the formal environment has 100 machines. The grayscale environment needs to bear great pressure. On the other hand, the RocketMQ server needs to push each message twice, which increases the pressure on the server.

Client-side filtering has drawbacks. Can server-side filtering be used to avoid these drawbacks?

Server-side filtering is divided into Tag filtering and SQL92 filtering.


Server-side filtering for RocketMQ provides two modes: Tag filtering and SQL92 filtering.

In the implementation of Tag filtering, when a RocketMQ consumer subscribes to the server, the subscription information is transmitted to the server. The subscription information is SubscribtionData, which contains four fields.

  • topic
  • tagSet
  • expressionType=tag: the type of the expression. Here, it is a tag filter, so the value is a tag.
  • client version: the version number of the subscription

The client continuously sends heartbeats to the server. The client sends heartbeats every 30 seconds by default. The SubscribtionData can change dynamically during the process. If the type of the tagSet or expression is changed, the value of the client version is increased. After receiving the heartbeat, the server finds that the SubscribtionData version number in the heartbeat has changed, which means the subscription rule has changed. Then, the client's subscription logic is updated to determine the push of the server filter change.

The processing logic of the server side is listed below:

In RocketMQ, a MessageFilter class is used to compare consumer queues first. If the match is successful, a tagscode comparison is performed. The message is only pushed to the client if both comparisons match.

The advantage of the preceding process is that the grayscale environment does not pull all messages, which can reduce the burden on consumers in the grayscale environment. At the same time, the server does not push all messages twice, significantly reducing the pressure on the server.


If the grayscale information is stored in the user-property field, you can use SQL 92 to filter the information.

The server ConsumerFilterManager saves the FilterDataMapByTopic corresponding to each topic, while the FilterDataMapByTopic saves the consumption logic ConsumerFilterData corresponding to different consumer groups. The ConsumerFilterData contains the consumer group, topic, expression, and client version, which is similar to the information sent by the client, so you can use this to filter.

SQL 92 is a filtering rule that can write complex expressions. In addition to realizing tag filtering, it can filter based on the user-property field.


The preceding figure shows the filtering rules of SQL92:

  • Consume tag A or B messages, and it can be written as (TAGS is not null and TAGS in ('A', 'B')).
  • Consume a message whose version is gray in user-property, and it can be written as (version is 'gray').
  • Consume a message whose tag is A and whose version is gray in user-property, and it can be written as (TAGS is 'A') and (version is 'gray').
  • Consume a message whose tag is A and whose version is green or blue in user-property, and it can be written as (TAGS is not null and TAGS is 'A' ) and ( (version is 'green') or (version is 'blue') ).


In addition, many issues need to be considered for perfect message grayscale in actual application scenarios.

  • If the message production logic is an independent thread pool, how can we implement grayscale tag pass-through? The common method is to put the grayscale tag in a task field when submitting a task to the thread pool, read the task field during consumption, and put it back in the thread local to realize cross-thread grayscale tag transmission.
  • A rollback occurs during full-link grayscale release. Do some grayscale messages have no grayscale consumers? What should we do if the message has not been consumed? Two options are available. One is to filter unconsumed messages and perform compensation actions. If the consumption logic of messages is not significantly changed, the baseline environment can consume grayscale messages.
  • What can we do if repeated consumption occurs during message grayscale? We can ensure idempotence in the consumption logic or set more precise grayscale control logic.
  • Can the consumer's subscription behavior support dynamic changes? For example, can consumers without grayscale consume grayscale messages in a formal environment?
  • If the formal environment can consume grayscale messages, is the default behavior to consume all messages or only messages in the formal environment?
  • How do we implement custom message grayscale logic? For example, when the traffic first comes in, it is not recognized as grayscale traffic. However, some special logic is found in the message during the sending process, which happens to hit the grayscale rule. Then, it needs to be marked as a grayscale rule. How do we perform custom grayscale logic at this time?
  • Assume the formal environment can consume grayscale messages. If a grayscale consumer is activated during the process, can the formal environment automatically detect whether the grayscale messages need to be consumed? This can be achieved to avoid repeated consumption while solving the release of upstream and downstream linkage.

Best Practices for Full-Link Grayscale Release of Microservices Engine (MSE)


The following figure shows the best practices for full-link grayscale of Alibaba Cloud MSE: name=xiaoming belongs to grayscale traffic. After the rule is configured, the traffic passes through the grayscale environment of A to the baseline environment of B and then to the grayscale environment of C. The grayscale messages generated by C can be received by the RocketMQ server and accurately pushed to the grayscale environment of A, which is a closed loop.


As long as the development is based on open-source standards, accessing MSE does not need to modify any code. The business is not intrusive or aware, the upgrade cost is zero, and the existing business architecture does not need to be changed. This is implemented mainly through One Java Agent. Java Agent is used to enhance the bytecode, so message producers and consumers can automatically inject grayscale-related bytecode into the sending behavior and consumption behavior of open-source RocketMQ. After the service is deployed, Java Agent can be automatically attached to complete message grayscale without modifying any logic.

In the preceding figure, after the Java Agent is mounted to the message producer, the agent automatically modifies the logic for sending messages. If a message is identified as grayscale traffic, the message is automatically added with a grayscale tag and sent to the message server. If a message consumer recognizes a grayscale environment during startup, the consumer group is automatically modified using the Java Agent. In addition, when a subscription rule is sent, the consumer automatically subscribes to messages that only consume grayscale. The message server identifies and matches the filtering rules to ensure that only the grayscale messages are pushed to the grayscale consumers.

After you access MSE, you can perform all operations on the message in the console. Users only need to configure governance rules in the MSE console to automatically push consumption rules to consumers through the configuration center, so consumers can dynamically change consumption behavior to complete the process of message grayscale. You do not need to modify any code in the process. You only need to access Java Agent to implement all functions.

Click the link for more information:


After accessing MSE, in the demo, the grayscale environment of application A only consumes grayscale messages, and the baseline environment only consumes baseline messages.


The preceding figure shows the demo. On the left is the baseline environment, which only consumes messages produced by the baseline application. In the subsequent process of calling A, B, and C, messages are only consumed in the baseline environment. The log on the right shows the grayscale environment of A, and the consumed messages are produced in the grayscale environment of C. In the process of calling A, B, and C, A is the grayscale environment. When calling B, since B only has the baseline environment, traffic messages only reach the baseline environment of B and finally reach the grayscale environment of C. When you consume a message, you can continue to pass through the message to the correct path with the grayscale tag.

Demo Source Code:

1 2 1
Share on

You may also like


Dikky Ryan Pratama May 8, 2023 at 6:58 am

finally i found it , thank you!

Related Products