Community Blog Service Mesh Goes from "Trending" to "Predictable"

Service Mesh Goes from "Trending" to "Predictable"

This article outlines the achievements and insights gained during Service Mesh development in the last year.

By Li Yun (Zhijian)


In the past year, Alibaba Cloud has made solid progress in Service Mesh development, not only because they firmly believe that Service Mesh will be a crucial part of cloud computing infrastructure in the future, but also to repay the technical debt accumulated in the past so as to create future-oriented new technical values and experience based on current technical thoughts and best practices.

Most of the time, when going deep into a new technological development, we will inevitably be trapped into a "predictable" stage. In this stage, we need to figure out how to deal with the obstacles brought by technical debts and how to create new values and experiences for different businesses. This will push for the implementation of new technologies in a win-win manner instead of figuring out how to interpret the innovative parts of new technology. This article summarizes Alibaba Cloud's achievements and insights in constructing Service Mesh.

Realizing Incremental Business Value is Essential to Development

Realizing the (incremental) value is key to the development of Service Mesh as a new basic platform-oriented technology. From a technical perspective, the middleware technology will rapidly evolve and upgrade in an imperceptible way to business after the variable contents in the SDK under the framework thinking are implemented into the Sidecar of the Service Mesh. At the same time, the problems in the distributed application architecture will be further addressed under the guidance of platformization and systematization instead of the various frames used in the past.

From a business perspective, the critical purpose of adopting new technologies is to find out what current pain points they solve, whether the machine costs are significantly reduced, whether the stability is dramatically enhanced, and whether the O&M and R&D efficiency is improved. All these benefits are generally referred to as business value or the benefits from the business perspective. For the development of Service Mesh, it is essential to realize (incremental) business value and improve new technologies on this basis. Otherwise, it is difficult for a development team of new technologies such as Service Mesh to continuously make some achievements at each stage; which is crucial to pull up the whole team together. With the realization of sufficient business values, the team members will feel "needed" and they will further recognize the value of their work.


Over the past year, our development strategy has changed from "large-scale technological implementation first and then business value realization" to "business value realization first and then large-scale technological implementation." In the former phase, the implementation of Service Mesh primarily encounters three problems:

1) Lack of intensity in realizing incremental business values since only the existing capabilities in the Java SDK has been moved into Service Mesh.

2) Resource overhead is not to be ignored.

3) Lack of maturity in technology development with no problem locating and troubleshooting methods for tool-based implementation.

When these three problems aren't solved properly, it is very difficult to press ahead with the large-scale implementation of Service Mesh in core applications despite being supported by Alibaba Cloud from top to bottom. Therefore, value realization has been put into priority.

In this way, some business teams have gone right from addressing the above three issues to actively thinking about how to take the opportunity of implementing Service Mesh to make a major upgrade to the business traffic governance capability of their business departments. The business teams then quickly pinpointed the business pain points and created a new solution with the Service Mesh construction team. Finally, the two teams went from a Party A and Party B relationship to partners that work hand in hand in a win-win scenario. Following are some of the insights from this experience during the last year:

  • Irrespective of the new technology, the realization of its incremental business value must be superior to its implementation. No matter how advanced the technologies are, they are only visions before their incremental values are fulfilled. However, people sometimes don't want to buy these technologies, which means that the market rules are still to be obeyed during the technological implementations. In addition, it takes time for new technology to grow. No business will be willing to be the test subject in this process if no incremental business value is realized.
  • The development of basic technologies cannot merely rely on the basic technology team. The business team's active participation in seeking solutions to business pain points will become an effective "catalyst" for new technologies. The engagement of the business team can supplement the basic technology team's lack of practical business experience. As a result, the intimate collaboration between the two teams can bring win-win benefits. The basic technology team needs to strengthen the cooperation with the business team to avoid fighting alone blindly.

Non-intrusive Solution: The Key Method but Not the Final Solution

In technological evolution, we should try our best not to cause any business transformation costs when realizing the values. This explains why iptables have been used for traffic hijacking since the launch of Istio. Fully aware of the importance of non-intrusive solutions, Alibaba Group adopted a non-intrusive resolution early in the internal implementation of Service Mesh. Thus, traffic passthrough has been further supported in a non-intrusive way.

At the beginning of last year, Alibaba Group did not fully consider the compatibility in the technical solution for implementing Service Mesh internally. For historical reasons, Dubbo serialization supports Hessian2, Java, and other protocols. Service Mesh only supports Hessian2 since it is the mainstream protocol. If a meshed application calls applications that do not support the serialization protocol, the implementation of Service Mesh will fail.

Furthermore, such a technical breakthrough helps build the overall Service Mesh capabilities, namely to realize the value or lay a solid foundation for large-scale implementation through obtaining a wider range of scenarios. Large-scale O&M support is such an example of large-scale implementation. In addition, when all applications mesh, the value can be realized at least on the link serialized by Hessian2. Thus, the links on which the value can be realized will not be shortened, or the value will not be weakened due to the applications' inability to be meshed.

To this end, Service Mesh has to support all RPC serialization protocols. The following figure shows the further solution. There are Service A, B, and C in the figure, among which Service A meshes. Note that the Sidecar (Envoy) adds traffic passthrough on top of the original. Therefore, for protocols other than Hessian2, Sidecar only collects the necessary statistics for passthrough. Besides, the RPC SDK used by Service A is full-featured with service governance capabilities. In other words, the service connectivity can be guaranteed after the SDK directly sends the routing packet to the Sidecar for passthrough.


In the long run, the non-intrusive solution is definitely not the final one to Dubbo, an RPC protocol featuring service governance. This is because the Dubbo SDK needs to perceive whether it should work in the Service Mesh mode. In this mode, responsibilities such as service governance are delegated to the Sidecar, thus saving the memory and CPU overhead for service governance by SDK.

On this basis, Alibaba Cloud has devoted a considerable amount of effort to the final cloud-native solution over the past year to ensure Service Mesh works well with Dubbo 3.0. The following figure shows the Service Mesh on Dubbo 3.0.


There is a significant change in the final solution. Dubbo 3.0 SDK is more friendly to Service Mesh considering the cloud-native trend. Let's understand the main changes related to Service Mesh.

The Triple protocol implemented based on gRPC is adopted in the protocol header. Message serialization and deserialization are completely eliminated by putting the Sidecar's contents to perceive or change to the protocol header. Thus, the Sidecar is totally unaffected by the serialization protocol adopted by the message bodies.

Disaster recovery is provided for faults in Service Mesh. Dubbo 3.0 SDK has two modes, Thin and Fat, which correspond to the Service Mesh mode and non-traditional mode, respectively. With Thin SDK, the CPU and memory overheads are saved to the minimum for Sidecar. Fat SDK provides comprehensive routing governance capabilities. When Service Mesh fails, the SDK will route the service call.

In the Service Mesh mode, the Sidecar is in charge of service registration and deregistration. In other words, when the SDK works in Service Mesh mode, the SDK is completely unaware of the back-end registry. In this case, Service Mesh can ignore the underlying infrastructure details as much as possible.

Without iptables for traffic hijacking, the SDK communicates with Sidecar through inter-process communication (TCP/IP network loopback or Unix Domain Socket) on the local machine. The traffic hijacking is used to make sure that Service Mesh implements traffic governance without affecting the business and upgrading the SDK. Since Dubbo 3.0 suffers from problems in SDK upgrading, iptables are removed to avoid new stability and performance problems.

Note that there is a prerequisite for SDK's switching back to the Fat SDK mode for service call routing in case of Service Mesh failure. Service Mesh and the SDK are equally capable of meeting the basic requirements in a disaster recovery scenario. In the long run, the service governance of Service Mesh will evolve faster than that of the SDK. The features related to disaster recovery still need to be implemented in the SDK. Nevertheless, Service Mesh should systematically ensure stability and apply the SDK disaster recovery capabilities to the single machine rather than the whole application cluster.

Finally, as mentioned earlier, the development of Service Mesh requires the participation of the business side, pushing the implementation of new technologies in the process of solving business pain points. Solving business problems requires business transformation, indicating that the non-intrusive solution cannot sufficiently realize the business value from beginning to end. In other words, the business sides that intend to implement Service Mesh should not take into core consideration whether it demands business transformation. The problem lies in whether the business pain points are solved and whether the technology is upgraded to a higher level to lay a good foundation for future business development during the business transformation. Thus, this needs to be precisely and carefully considered while implementing Service Mesh. Of course, with our experience, it is highly recommended to try the implementation of Service Mesh based on the attempts to make businesses unaffected through non-intrusive solutions. Non-intrusive solutions will also be a good choice for enterprises of the same interests, which are exploring technologies related to Service Mesh or cloud-native while requiring the connectivity between old and new applications and the progressive evolution to new technologies.

Realization of Incremental Business Value

In the past year, Alibaba Group has found two ways to realize the business value with Service Mesh. As the construction is due in the next few months, Service Mesh will be implemented on a large scale based on thousands of application instances.

The first point is to deliver the regionalization and multi-group routing governance capabilities of the international middle ground to the Service Mesh to achieve unified traffic routing governance and application-level data center disaster recovery. In the past, the Java application of the international middle ground specified the routing policy through annotations. The code needs to be modified and the application relaunched as the routing policy changes, which is quite inconvenient. What's more, the disaster recovery of the international middle ground can only be implemented at the data center level. When switching traffic, all traffic in the data center must be traded away.

With the introduction of Service Mesh, the ability to specify routing policies is not realized through annotation in Java applications but by dropping the routing policies to Service Mesh in configuration. As such, only a new YAML file needs to be dynamically delivered each time the application routing policy is changed, thus completely decoupling from the application. Furthermore, since the routing policy is application-oriented, traffic shifting between data centers can be easily performed with the application as the granularity, which improves the agility of disaster recovery and reduces the risk of traffic shifting.

As the business side, the international middle ground was very active in thinking about making full use of this precious technical upgrade when exploring business value with the Service Mesh team. All the components with single-point service governance were put into the Sidecar of Service Mesh to work in a distributed manner, which removed the past burden on O&M and greatly improved the overall business stability.

The second point is to apply the flexible traffic governance capability of Service Mesh to the governance of the development environment of the new retail business group and dynamically create an independent development environment according to the needs of developers. To better support development, Alibaba Group internally builds a daily environment that is completely independent of the production environment. It deploys online applications into both environments for the development and debugging of each application. Each application in the daily environment may be changed as required by the development. To decouple the interaction between applications, a baseline environment is further established in the daily environment. Each application must be developed and debugged in the development environment isolated from the baseline environment instead of using the baseline environment directly. It is very challenging and meaningful to maintain the daily development environment for the developers when there are tens of thousands of applications and developers and hundreds of daily application changes.

In the past, the development environment isolation technology was designed based on frameworks, which require different traffic (such as RPC, message, cache, and database) to connect to the same isolation framework at the protocol level, making it quite hard for evolution and maintenance. In multi-language scenarios, the Java-centered ability is of little use. In addition, it was challenging to implement some isolation scenarios without platform-oriented technologies such as Service Mesh.

Service Mesh is designed for traffic governance. Its core capability is to implement dynamic and flexible traffic isolation and routing. The VirtualService and DestinationRule in Istio were extended to abstract TrafficLabel, a brand-new CRD. Traffic and application machines are dynamically labeled by issuing YAML files. At the same time, Envoy performs routing based on the traffic label and machine label to flexibly and quickly build the development environment supportive of multi-language applications as developers require. The following figure shows the application deployment and traffic topology when applications v1.1 and v1.2 are developed simultaneously on Service Mesh.


In the preceding figure, a YAML file needs to be issued to label specific traffic and applications. Envoy directs the traffic to the machines with the same label based on the traffic label. When there are no machines corresponding to the label, the rollback mechanism is applied to direct the traffic back to the baseline environment. For example, when application B in development environment 1 calls application C, traffic is directly directed to the baseline environment since there is no machine labeled tag 1.

It can be predicted that this ability built by Service Mesh paves the road for future tests in production. In the future, the traffic isolation environment built on Service Mesh will help save the machine cost required to build an independent development environment and also provide a new idea for exploring the new generation of a safe production environment. We still have a long way to go.

Now, Alibaba Group has internally implemented Service Mesh on tens of thousands of application instances. The capabilities of the data, control and O&M have been implemented for large-scale applications.

Software Lifecycle Theory Not to Be Ignored

With this opportunity to share his insights of the software life cycle theory, the author hopes that this theory helps readers better understand the development of new technologies and seize opportunities in this cloud-native era.

Software development is very likely to be considered bloated and error-prone software if perceived statically. This can be explained by the ignorance of the fact that software has a lifecycle. Software, like human beings, will experience periods of formation, growth, maturity, and recession, as shown in the figure below.

In the figure, the vertical axis indicates the adaptability of the software to new requirements, which refers to the friendliness of the software to the implementation of new requirements. Behind it lies whether the relationships between concepts are clear and whether people's understanding of these conforms to intuition and common sense. Essentially, it refers to the software design quality. The straight line in the figure only represents a trend, while in reality, it is more of a curve with fluctuations.


The software enters the phase of maturity when its function realization is following its original purposes and application scenarios. It enters the recession stage when new business scenarios appear. At the same time, the software concept abstraction (also known as "architecture" or "leading software design") is unfriendly to the requirements in new scenarios. As a result, the newly developed code becomes useless. Long-term software recession gives rise to the continuous deterioration of software quality and the continuous decline of coding experience.

To understand the software lifecycle, software engineers deepen their understanding of requirements step by step over time. The initial software design can hardly meet the long-term business development needs, as business gets more complicated day by day. In other words, software recession is inevitable, and technical debt is a natural software development product.

The key to getting rid of the recession is to lead the software to a new-round life cycle: to "repay the technical debt," which involves refactoring or using new ideas and new technologies to solve problems. Engineers can strengthen their abilities when repaying the technical debt through continuous reconstruction. In this process, engineers re-abstract the concept based on their personal understanding of the business or requirements, which is instrumental in developing good software design ability that is needed for the design of large software systems.


The software lifecycle theory tells us that good software can withstand various changes instead of remaining unchanged. Of course, multiple changes are backed up by engineering capabilities. Methods including unit testing, integration testing, and system testing are required to guarantee software quality. The lack of these methods makes it hard to support various changes, thus causing a stall in software development.

Even platform-oriented technologies like Service Mesh have to deal with the software lifecycle theory carefully. Platformization thinking seeks to solve problems in common use and balance common use and custom use. When the platform-oriented technology itself cannot evolve rapidly to meet technology or business development requirements, it will naturally become an obstacle rather than a booster to business development.


In the following year, we will continue to explore the value of Service Mesh. While fulfilling the worth of RPC traffic management, we will also complete Service Mesh implementation on RocketMQ and others to further extend the traffic governance capability of Service Mesh and realize more incremental business value.

We are committed to implementing Service Mesh to Alibaba Group's internal basic technologies, as Service Mesh is more needed given Alibaba Group's own business scale. In the future, we will also cooperate with more companies of the same interests, hoping to firmly step into the cloud-native era hand-in-hand with our clients based on the experience gained from the past.

0 0 0
Share on

You may also like


Related Products