Business-Driven Upgrade of Alibaba Group – The Evolution of Dubbo 3.0

By Yuanyun

Trinity

By the end of 2020, Alibaba Cloud had put forward the "Trinity" concept, aiming to integrate the proprietary technologies, open-source projects, and commercial products into a unified technology system. By doing so, the value of technology will be maximized.

After years of successful performance testing during the Double 11, the internal HSF framework of Alibaba Group has built up its core competitiveness in high performance and high availability. As one of the most popular service governance frameworks in China and abroad, Dubbo has much to say about its open-source affinity.

As the first solution to use the Trinity archiecture, Alibaba has high hopes for Dubbo 3.0. It integrates the features of the internal HSF perfectly and has the core capabilities of high performance and high availability. We hope to use it to solve the internal implementation and achieve the unification of the technology stack. It has been implemented on a large scale in Kaola. It will be implemented in many core scenarios in the future, carrying complex business scenarios, such as the 618 Shopping Festival and Double 11.

Benefits of Dubbo 3.0

Before specifying the details of the changes in Dubbo 3.0, let's discuss the benefits of upgrading to Dubbo 3.0 from two aspects:

Dubbo 3.0 will focus on improving the performance and stability in the practice of large-scale clusters and reduce the resource consumption of single machines by optimizing the data storage mode. This ensures the stability of a large-scale cluster when it scales out horizontally. Meanwhile, Dubbo 3.0 puts forward the flexible cluster concept, which can effectively ensure and improve the overall reliability and resource utilization of the comprehensive process under heterogeneous systems.
Dubbo 3.0 marked a milestone for Dubbo to fully embrace cloud-native. Dubbo has a huge user base in China and abroad. With the advent of the cloud-native era, the demand for cloud migration from these users is increasing. Therefore, Dubbo 3.0 will provide a complete set of solutions, migration paths, and best practices to help enterprises realize cloud-native transformation and enjoy the benefits of cloud-native.

Business Benefits

From the perspective of business applications, what specific benefits can one gain by upgrading to Dubbo 3.0?

First, in terms of performance and resources utilization, Dubbo 3.0 can effectively reduce the additional resource consumption caused by the framework, improving resource utilization significantly.

From the single machine perspective, Dubbo 3.0 can save about 50% of memory usage. From the cluster perspective, it can support millions of cluster instances, laying the foundation for larger business scaling in the future. Dubbo 3.0's support for the reactive stream communication model can lead to a significant increase in the overall throughput in some business scenarios.

Second, Dubbo 3.0 brings more possibilities for business architecture upgrades. The most intuitive is the upgrade of communication protocols, which brings more options to the business architecture.

The original Dubbo protocol bound the microservice access to a certain extent. For example, mobile and frontend services need to go through protocol conversion at the gateway layer to access Dubbo's backend services. For another example, Dubbo only supports request-response communication, which makes it impossible to support scenarios that require streaming or reverse communication.

Finally, Dubbo 3.0 is effective for the cloud-native upgrade of business products. Whether it is a passive change brought by the underlying infrastructure upgrade or a proactive upgrade by the business to solve pain points, when the business upgrades to cloud-native, Dubbo 3.0 can help the business products to access cloud-native quickly by giving a cloud-native solution.

Dubbo 3.0 Overview

After clarifying the benefits of upgrading to Dubbo 3.0, let's take a look at the specific changes in Dubbo 3.0:

A new service discovery model is supported. Dubbo 3.0 starts with the application model and optimizes the storage structure of its cloud-native mainstream design model to avoid interworking problems caused by the model. The new model is highly compressed in data organization and can effectively improve performance and cluster scalability.
Triple, the next generation of RPC protocol, is proposed. This new open protocol is designed based on HTTP/2 and is fully compatible with the gRPC protocol. It is based on HTTP/2 and features high gateway friendliness and penetration. The full compatibility with the gRPC protocol gives it a natural advantage in multi-language intercommunication.
The unified governance rules are proposed. This set of rules is designed for cloud-native traffic governance and can be used in scenarios, such as traditional SDK deployment, Service Mesh deployment, VM deployment, and container deployment. Governing all scenarios with one set of rules can reduce the cost of traffic governance significantly and unify the global traffic governance under heterogeneous systems.
Solutions for accessing Service Mesh are provided. For Mesh scenarios, Dubbo 3.0 proposes two access methods. One is the Thin SDK model, where the deployment model is the same as the current Service Mesh mainstream deployment scenario. Dubbo will be streamlined, shielding the same governance functions as Mesh and retaining only the core RPC capabilities. Another is the proxy mode. Dubbo takes over the job and duties of Sidecar, communicating with the control plane proactively and applying cloud-native traffic management capabilities based on Dubbo 3.0's unified governance rules.

Application-Level Service Registration and Discovery

Application-Level Service Discovery Model

The prototype of the application-level service discovery model was proposed first in Dubbo 2.7.6. After some iteration time, a relatively stable model for Dubbo 3.0 was finally formed.

In Dubbo 2.7 and earlier versions, applications perform service registration and discovery at the interface granularity. Each interface will correspond to a piece of data on the registry, and different machines will be registered with metadata information belonging to the current machine or interface-level configuration information, such as serialization, data center, unit, and timeout configuration.

All servers providing this service are changed independently at the interface granularity when they are restarted or released. For example, a gateway application relies on 30 interfaces of an upstream application. When the upstream application is being released, 30 corresponding address lists are being brought online and offline.

The approach of using interfaces as the first citizen for registration and discovery was the earliest splitting of SOA or microservices, providing the independence and dynamic change capability of a single service or a single node. As the business evolves, the number of services that a single application relies on is increasing, and the number of machines per service provider is growing as well because of business or capacity. From the perspective of the client as a whole, the total number of dependent service addresses increases rapidly. According to this situation, optimization in terms of the design of the registration and discovery process can be considered.

Here, two features should be noticed:

With the splitting of single applications into multi-microservice applications nearing completion, large-scale service splitting and reorganization is no longer a pain point. Most of the interfaces are provided by only one application or a fixed number of applications.
A large number of URLs used to tag address information are extremely redundant, such as timeout times and serialization. These configuration changes occur in every URL at an extremely low frequency.

Based on the preceding features, application-level registration and discovery are finally proposed. Applications are used as the basic dimensions of registration and discovery. The main difference from the interface level is that if an application provides 100 interfaces, 100 nodes need to be registered in the registry. If the application has 100 machines, that means 10,000 virtual node changes for its clients at each release. However, application-level registration and discovery only require one node and only 100 virtual nodes for each release. For applications that rely on a large number of services and machines, this is a drop in scale of tens to one-hundredth of a magnitude, and the memory consumption will be reduced by at least half.

However, the design of the technical scheme needs to consider the correct function and the upgrade of the existing business. Therefore, the upgrade to application-level registration and discovery is based on the need to align interface-level registration and discovery capabilities. Regardless of whether the client is upgraded or whether application-level registration and discovery are enabled, the premise is that it does not affect proper business calls.

We have designed a new component to provide this guarantee. The metadata center can manage two parts of data:

Interface Application Mapping: Reporting and querying the mapping between interfaces and applications can determine whether the client enables the application level to avoid business code changes.
Application-Level Metadata Snapshot: Data differentiation occurs when the configurations of different interfaces of an application are different. Therefore, the concept of metadata snapshot was proposed in the application-level solution. It means when each application is released, a metadata snapshot will be generated. The metadata snapshot contains the metadata version of the current application and the configuration of all interfaces provided by the current application. This snapshot ID is stored in the URL, which provides the capability to be dynamically changed and reduces the memory pressure on data storage.

Finally, since the new service discovery is highly similar to the service discovery models under Spring Cloud, Service Mesh, and other systems, Dubbo can discover data between the registry and nodes in other systems.

Dubbo 3.0 – Endorsed by Cloud-Native and Alibaba with Ease of Use

Dubbo 3.0 is the ideal microservice framework for the cloud-native era. Currently, several trends indicate that Kubernetes has become the de-facto standard for resource scheduling, Mesh has become the mainstream trend, and Kubernetes has seen rapid growth in scale. These trends put forward higher requirements for Dubbo.

Firstly, a more convenient way for users to deploy and invoke Dubbo services on Kubernetes is a significant problem that must be solved. A unified protocol and data exchange format are essential to solving this problem. Secondly, the popularity of Mesh brings diversity issues, such as how can native Dubbo and Mesh-based Dubbo coexist and ways to support multi-language scenarios. Lastly, the increases in scale will bring greater challenges to the entire Dubbo architecture since components (such as the registry and the client) will have more data and calls.

The top priority of the evolution of Dubbo is to provide more efficient services while maintaining stability.

These challenges of the cloud-native era have contributed to the development of the next generation of Dubbo, including new protocols, Kubernetes infrastructure support, multi-language support, and scalability.

1. Next-Generation RPC Protocol

The most basic capability of the RPC framework is to complete service calls across business processes, forming a chain and a network of services, of which the core carrier is the RPC protocol.

Meanwhile, due to the close coupling with business data, the design and implementation of RPC protocol also directly determines the business architecture in some aspects, such as the interaction from terminal equipment to the backend equipment, multilingual adoption in microservice architecture, and data transmission models between services.

Dubbo 2 provides the core semantics of RPC, including protocol header, flag bit, request ID, and request/response data. However, in the cloud-native era, Dubbo 2 protocol faces two main challenges. The first is that the ecosystem is not interoperable, making it difficult for users to understand the binary protocol. The second is that Dubbo is not friendly enough for gateway components, such as Mesh, that require a complete parsing protocol to obtain the call metadata. For example, some RPC contexts face challenges in terms of performance and usability.

As a service framework, Dubbo is most important for providing remote communication capabilities. The design and implementation of the Dubbo 2 RPC protocol have been proven in practice that it limits the business architecture in several aspects, such as the interaction from terminal equipment to the backend equipment, multilingual adoption in microservice architecture, and data transmission models between services.

While supporting existing features and addressing remaining problems, the following features are needed for the next-generation protocol:

The protocol needs to solve the problem of cross-language communication. The traditional multi-language mode, multi-SDK mode, and Mesh-based cross-language mode require a more universal and extensible data transmission format.
The protocol should provide a better request model. In addition to the request/response model, it should support streaming and bidirectional.
The request ID mechanism should be retained in performance to avoid performance loss caused by head blocking.
Easy scalability, including (but not limited to) tracing and monitoring, should be supported. Moreover, the protocol can be recognized by devices at all levels, making it easier for users to understand.

Based on these requirements, the HTTP2/protobuf combination is the most suitable. When mentioning this combination, it may be easy to come up with the gRPC protocol. The relationship between the new-generation protocol and gRPC is listed below:

(1) The new Dubbo protocol is a protocol extended based on gRPC, which also ensures that the new protocol and gRPC are interoperable and shared across the ecosystem.

(2) Building on the first clause, the new Dubbo protocol will more natively support Dubbo's service governance, providing greater flexibility.

(3) In terms of serialization, since most applications have not used Protobuf, the new protocol will give sufficient support in serialization, adapting existing serialization for easy migration to Protobuf.

(4) In the request model, the new protocol will support Reactive natively, which is also unavailable in the gRPC protocol.

2. Service Mesh

To make Dubbo implement in the Service Mesh system, after referring to many solutions, two Mesh solutions that are most suitable for Dubbo 3.0 were finally determined. One is the classic Sidecar-based Service Mesh, and another is the Proxyless Mesh without Sidecar.

For the Sidecar Mesh solution, its deployment method is consistent with the current mainstream Service Mesh deployment solution. Dubbo 3.0 focuses on providing a completely transparent upgrade experience for business applications as much as possible. It includes an imperceptible upgrade from a programming perspective but allows the entire call process to be updated through Dubbo 3.0 lightweight and Triple protocols, minimizing losses and O&M costs. This solution is also known as the Thin SDK solution, which removes all unnecessary components.

The Proxyless Mesh deployment solution is another Mesh form planned for Dubbo 3.0, where the goal is to interact directly with the control plane from the traditional SDK without starting Sidecar.

Imagine the following scenarios where the Proxyless Mesh deployment solution is commonly used:

Business parties expect to upgrade their Mesh solutions, but they cannot accept the performance loss caused by Sidecar traffic hijacking. This situation is common in core business scenarios.
Users hope to reduce Sidecar O&M costs and system complexity.
Legacy system upgrade is slow with a long migration process, and multiple deployment architectures coexist for a long time.
Multiple deployment environments, including multiple deployment methods such as virtual machines (VMs) and containers. It also includes the hybrid deployment of multiple application types, such as the hybrid deployment of Thin SDK and Proxyless solutions, Proxyless mode for performance-sensitive applications, and the Thin SDK deployment solution for peripheral applications. Multiple data planes are scheduled from the same control plane.

Viewing the two forms together, Dubbo has many Mesh solutions available for different business scenarios, different migration phases, and different infrastructure guarantees, which can be governed by a unified control plane.

Future Deployment

1. Deployment on Kubernetes

The preceding figure shows the expected deployment solution of Dubbo 3.0 on Kubernetes. Dubbo 3.0 will be a Kubernetes-native service for its service discovery model, supporting mutual calls without deploying an independent registry.

2. Deployment on Istio

The preceding figure shows the future deployment solution of Dubbo 3.0 on Istio. The hybrid deployment of Thin SDK and Proxyless is used here. As shown in Pod 1 and Pod 3, the data traffic is sent directly from Dubbo Service. While Pod 2 is deployed in Thin SDK mode, the traffic is intercepted by Sidecar and then flows out.

Flexible Enhanced Planning

Cloud-native has brought about major changes in technology standardization. The core objectives of all cloud-native basic components are the ways to make it easier to create and run applications on the cloud with flexible and scalable features. With the elasticity of cloud-native technologies, an application can be scaled out by a large number of machines to support business needs in a very short time.

For example, to cope with flash sales at midnight or emergencies, applications often need thousands or tens of thousands of machines to improve performance to meet user needs. However, the expansion also brings a series of problems, such as the frequency of node exceptions due to the extremely large number of cluster nodes and the uneven service capacity of nodes due to a variety of objective factors. These are the problems encountered in the large-scale deployment of clusters in cloud-native scenarios.

Dubbo is expected to solve these problems based on a flexible cluster scheduling mechanism. This mechanism can mainly solve two problems. First, the distributed service can be maintained stably and without avalanche when nodes are abnormal. Second, large-scale applications can run at the best state, providing higher throughput and performance:

From the single-service perspective, Dubbo is expected to provide an external service that cannot be overwhelmed. In the case of a very high number of requests, it can reject specific requests to ensure the correctness and timeliness of the entire service.
From a distributed perspective, the flexible scheduling mechanism can dynamically distribute traffic in an optimal way to minimize the degradation of overall performance due to complex topologies and the varying performance of different nodes. By doing so, heterogeneous systems can allocate requests rationally based on the exact service capacity at runtime, resulting in the best performance.

Dubbo 3.0 Roadmap

Apache Dubbo 3.0.0 was officially released in June 2021 as a milestone version after it was donated to Apache. This means Apache Dubbo has fully embraced cloud-native.

In November 2021, we will release Apache Dubbo 3.1 and bring the implementation and practices of Apache Dubbo deployment in Mesh scenarios.

In March 2022, we will release Apache Dubbo 3.2, which will bring a new intelligent traffic scheduling mechanism for large-scale application deployment, improving system stability and resource utilization.

Finally, Apache Dubbo 3.0 has already been integrated with the internal RPC framework of Alibaba Group. This is expected to solve the internal implementation and unify the technology stack. In the future, Apache Dubbo 3.0 will be implemented on a large scale in the Alibaba Group, supporting complex business scenarios, such as the 618 Shopping Festival and Double 11.

The community will try its best to ensure a short release cycle and fix the existing problems promptly. You are welcome to submit issues and performance requirements, and the community will review and reply as soon as possible. Thanks for your support.

Community

Business-Driven Upgrade of Alibaba Group – The Evolution of Dubbo 3.0

Trinity

Benefits of Dubbo 3.0

Business Benefits

Dubbo 3.0 Overview

Application-Level Service Registration and Discovery

Dubbo 3.0 – Endorsed by Cloud-Native and Alibaba with Ease of Use

1. Next-Generation RPC Protocol

2. Service Mesh

Future Deployment

1. Deployment on Kubernetes

2. Deployment on Istio

Flexible Enhanced Planning

Dubbo 3.0 Roadmap

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Microservices Engine (MSE)

Message Queue for Apache Kafka

Cloud-Native Applications Management Solution

Container Service for Kubernetes