By Ling Chu, Development Engineer at Alibaba Cloud
Code review on the pull request of Network Filter compatibility with Apache RocketMQ started at the end of 2019 and took four months to complete. In May 2020, Network Filter formally joined the CNCF Envoy official community. In other words, Network Filter got listed in RocketMQ Proxy Filter official documents. This made Apache RocketMQ the second Chinese middleware released in the official Service Mesh community after Dubbo.
The following figure shows how messages are sent and received in Service Mesh.
The process of sending and consuming Apache RocketMQ messages in Service Mesh is described as follows:
Service Mesh is called the next-generation microservice. This reveals that microservices are dominant in the early development of Service Mesh, and mesh-based microservices are more efficient. Problems appear when message queues and database products tend to be mesh-based. Apache RocketMQ is no exception.
The network model of Apache RocketMQ is composed of stateful network interactions and is more complex than that of the Remote Procedure Call (RPC).
The current SDK is unable to use partitioned ordered messages because the RPC for sending and consuming requests does not contain the information such as IP address, broker name, and broker ID. As a result, the Mesh-based SDK is unable to ensure whether message queues are sent and consumed on the same broker. Thus, the information about brokers is erased. This is not a problem if globally ordered messages are sent or consumed with only one broker, since no other broker available for SLB in the data plane. Therefore, globally ordered messages do not encounter any routing problems.
In a pull or push consumer of Apache RocketMQ, queue is the basic SLB unit. In a consumer group, the native consumer perceives the number of other consumers that consume the same topic as the native consumer. Each consumer selects and consumes queues based on its location in an exclusive manner under the mapping between topics and consumer groups. However, this is difficult to implement on the data plane. Moreover, data-plane SLB at the queue granularity level cannot be implemented. Therefore, SLB policies in Apache RocketMQ are no longer applicable to Service Mesh.
To address this issue, we turn to the POP consumption interface that makes Apache RocketMQ compatible with the HyperText Transfer Protocol (HTTP). With the POP consumption interface, each queue is no longer exclusive to only one consumer specified in the mapping between topics and consumer groups. Different consumers consume the data in the same queue simultaneously, which makes it possible to apply native SLB policies of Envoy to Service Mesh.
The right part of Figure 2 shows how POP consumers consume queues in Service Mesh. In Envoy, the information about queues sent by SDK is ignored.
In Alibaba Group, the Nameserver stores GB-level routing information about topics. We abstract CDS from such information in Service Mesh. Therefore, if the topics used by an application cannot be identified in advance, the control plane pushes CDS in full, which puts huge pressure on the control plane.
In the earlier period of Envoy, CDS was pushed in full. When the data plane was just started, the control plane delivered xDS information in full. Later, the control plane was able to actively control the frequency of distributing data. However, data was still distributed in full. Subsequently, Envoy supported some delta xDS Application Programming Interfaces (APIs), which implies that Envoy was able to distribute incremental xDS data to the data plane. This greatly reduced the volume of newly issued data in sidecars. However, sidecars have full xDS data. Correspondingly, in Apache RocketMQ, the full CDS information was stored in memory, which was unacceptable. Therefore, we need on-demand CDS so that sidecars obtain CDS as required. At that time, Envoy was compatible with delta CDS and delta xDS. In fact, the xDS protocol with delta CDS was able to provide on-demand CDS, but this capability was not exposed in the control plane or data plane. To enable the data plane to initiate requests for specified CDS to the control plane, Envoy was modified and related interfaces were exposed. A simple control plane was implemented based on delta gRPC. Envoy initiates requests for specified CDS resources and provides callback interfaces for these resources.
In Apache RocketMQ, when a
SendMessage request arrives at Envoy, Envoy hangs this process and initiates a request for the corresponding CDS resource to the control plane. Envoy restarts the process when the resource is returned.
In the past, we initiated a pull request for the modification of on-demand CDS to the official Service Mesh community. But now this practice is proved improper. It happens because this practice ignored Remote Data Services (RDSs), and forcibly bound CDS with topics. Even the CDS resource names were identical to the topic names. On this point, the Senior Maintainer htuch in the official Service Mesh community refuted our previous practice. The general idea was that the actual CDS resource name might be associated with the SLB method and various prefixes and suffixes, such as inbound and outbound, and was not equivalent to the topic name. More importantly, the official Service Mesh community defines CDS to be detached from services. But our practice was too tricky and countered the intention of the community.
Therefore, we need to use RDS to abstract topics. RDS locates the desired CDS name based on topics and other information. The data plane is unable to identify the desired CDS name in advance at the code level. As a result, on-demand CDS cannot be implemented based on CDS, and topics must be sent and consumed in full. Fortunately, this does not affect the code contribution to the official Service Mesh community.
route_config: name: default_route routes: - match: topic: exact: mesh headers: - name: code exact_match: 105 route: cluster: foo-v145-acme-tau-beta-lambda
The preceding snapshot shows that RDS routes the request with the topic name of mesh to the foo-v145-acme-tau-beta-lambda CDS. We only know the topic name but are unable to determine the matching CDS resource name in advance.
Now, we find it easy to correct this fault. In fact, we did not rectify it until the subsequent code review was conducted.
However, based on the current trends in the official Service Mesh community, on-demand xDS may be a roadmap. xDS is fully compatible with delta. Moreover, Virtual Host Discovery Service (VHDS) takes the lead to support on-demand features.
A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It is responsible for the reliable delivery of requests through the complex topology of services that comprise a modern and cloud-native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the need for an application to be aware.
This definition was given by William Morgan who coined the term "Service Mesh". In summary, Service Mesh is a network proxy that is transparent to users and assumes the responsibilities as an infrastructure.
In Apache RocketMQ, Mesh Service takes responsibilities such as service discovery, load balancing, and traffic monitoring, which greatly reduces the responsibilities of the caller and the delegated party.
RocketMQ Filter has made many concessions to ensure its compatibility. For example, to ensure that SDK gets routes, RocketMQ Filter aggregates routing information into
TopicRouteData and returned such information to SDK. However, ideally, SDK does not need to care about routing. The SDK designed for Service Mesh is more compact, without consumer rebalancing and service discovery for sending and consuming messages. In the future, SDK and brokers may no longer require message compression, schema validation, and other features. This may be the ultimate form of Apache RocketMQ in Service Mesh.
RocketMQ Filter is extremely capable of sending normal messages and consuming POP messages. It needs more features as follows:
The first pull request of RocketMQ Filter included almost all the current features, resulting in an oversize pull request with more than 8,000 rows of code. I am grateful to Tianqian for his efforts in the code review. He is quite professional and helped us join the official Service Mesh community faster.
In addition, the Envoy community adopts strict Continuous Integration (CI) and requires more than 97% line coverage in unit tests. With Bazel source-code-level dependency, static links, and no caches, it takes at least 30 minutes to compile 24-logic-core CPU and reach the full load. The community needs two to seven hours to run varieties of CIs and has strict requirements in the syntax and format for the newly submitted code. Therefore, minor modifications on code in pull requests may lead to massive changes in unit tests and formats. But fortunately, we can find memory problems through unit tests. Objectively speaking, the official Service Mesh community adds strict requirements for its contributors to extensively control the code quality. In the process of completing unit tests, we find and solve many problems. In general, strict requirements are necessary. Once problems occur in the production environment that uses the C++ code, it is quite difficult to debug and track the problems.
Shutian and I jointly completed the code for RocketMQ Filter. This was a valuable opportunity for me, as I had limited experience in open-source development. I am grateful to Shutian for his help and suggestions.
Alibaba Clouder - April 18, 2018
Alibaba Developer - September 7, 2020
Alibaba Developer - March 5, 2020
Aliware - July 3, 2020
Alibaba Clouder - February 17, 2017
Alibaba Clouder - August 5, 2020
Respond to sudden traffic spikes and minimize response time with Server Load BalancerLearn More
Alibaba Cloud Server Load Balancer is built to cope with high volume traffic and each year in November is put to the test during Alibaba’s annual Global Shopping Festival. Alibaba relies on Server Load Balancer to provide uninterrupted service during the festival by switching requests between data centers and transferring transactions to the most available server.Learn More
AlibabaMQ for Apache RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.Learn More
An easy to use service that provides real-time monitoring of servers to ensure high availabilityLearn More
More Posts by Alibaba Developer