By Zehuan Shi
In increasingly complex deployment scenarios, multi-cluster deployment is becoming a crucial practice. By implementing multiple clusters, you can significantly enhance availability and naturally isolate multiple deployment environments, leading to improved stability and security. However, managing multiple clusters effectively poses significant challenges. In practice, to ensure mutual backup, multiple clusters are often deployed across different data centers, regions, or public clouds. The first challenge is establishing network connectivity between these clusters. Additionally, traffic needs to be managed according to specific rules for different multi-cluster deployment practices.
This series of articles will guide you through typical multi-cluster practices in different scenarios, highlighting the challenges and solutions you may encounter in multi-cluster deployment, and providing practical steps for some scenarios in subsequent content.
Alibaba Cloud has deployed cloud service infrastructure in multiple regions. Each region contains multiple zones. A region is a geographic location where Alibaba Cloud data centers are deployed. After a resource is created, its region cannot be changed. An Availability Zone (AZ) refers to a physically distinct area within the same region, characterized by independent power and network infrastructures. Instances within the same Availability Zone generally experience lower network latency.
In some multi-cluster practices, there is a need for services to perform cross-cluster calls. For instance, in multi-cluster disaster recovery scenarios, if an application in one cluster fails, we aim to seamlessly switch to the same service in another cluster without disruption. Due to the network belonging to a specific Virtual Private Cloud (VPC) within its region, in Alibaba Cloud Container Service clusters, achieving inter-cluster communication requires establishing connectivity between the VPC networks of these clusters. In the following chapters, this article will introduce three methods for connecting cluster networks: PrivateLink, CEN, and ASM East-West Gateway.
PrivateLink is a way to use the private network of Alibaba Cloud for service interaction. You can use PrivateLink to connect multiple VPCs over a private network. You do not need to create a NAT Gateway, an Elastic IP Address, or other public IP address. PrivateLink provides higher data security and network quality because data is not transmitted over the Internet.
However, PrivateLink cannot connect VPCs across different regions. To this end, Alibaba Cloud also provides a cross-region network connection solution: Cloud Enterprise Network (CEN).
CEN is a highly available network built on the global private network of Alibaba Cloud. CEN uses transit routers to establish cross-region connections between VPCs. This enables VPCs to communicate with data centers and establish flexible, reliable, and enterprise-class networks in the cloud. CEN allows you to realize network communication between VPCs in any region.
In addition to PrivateLink and CEN, ASM is used as a cluster network infrastructure to provide a more economical and flexible way to connect networks- ASM East-West Gateway. By deploying the ASM East-West Gateway and exposing it to the public internet, it becomes possible to establish inter-cluster communication via public network links, enabling cross-cluster application calls. The ASM East-West Gateway facilitates inter-cluster network connectivity through the public internet, which may introduce higher network latency compared to dedicated CEN connections. However, it offers a more cost-effective solution compared to CEN. Users should comprehensively assess their business's requirements for network quality and choose the appropriate solution accordingly.
For different business needs, there are various practices for multi-cluster deployments. In the following chapters, the author will introduce several typical multi-cluster deployment practices. Due to the more complex network environments and dependencies in multi-cluster scenarios compared to single-cluster setups, it is essential to be familiar with these validated best practices.
The ASM control plane by default assumes that all endpoints of a service are reachable. However, in some scenarios, we may prefer to restrict cross-cluster access. For example, consider an application with partitioned services where each partition is deployed in a separate cluster. Each cluster has its independent storage (such as databases) for the application to consume. However, traffic rules between clusters and the application topology are entirely consistent. In such deployment scenarios, typical businesses include gaming applications that benefit from unified control plane management. In this scenario, ASM provides the cluster traffic retention feature. For more information about the traffic retention feature, please refer to ASM documentation on the ASM homepage. You can also follow the best practices in subsequent articles in this series.
Some distributed applications may need to be deployed across multiple clusters due to reasons such as permission isolation and resource dependencies. Applications in different clusters may also involve cross-cluster calls, which is a typical multi-cluster deployment practice. The topology looks like this. When using this multi-cluster deployment, multiple clusters are managed by the same ASM instance. Therefore, users need to ensure that services between multiple clusters do not conflict. For example, the default/httpbin service of cluster 1 and the default/httpbin service of cluster 2 are considered as the same service by the control plane.
Cross-region or cross-zone multi-cluster deployment is a common deployment mode that uses the physical isolation brought by multi-region to build a disaster recovery system. ASM provides cross-region SLB capabilities and uses the cross-region failover function of cross-region SLB to keep traffic in the local cluster by default. When a local service becomes unavailable due to a failure, calls to that service are automatically switched to the same-named service deployed in other regions or availability zones, achieving cross-region disaster recovery. For more information about cross-cluster failover, please refer to ASM documentation. You can also follow the best practices in subsequent articles in this series.
In the development process of some large-scale microservices (some microservice systems now reach 1000 + Services), the deployment of test environments is often challenging, and the following two deployment methods are widely used:
Full and single test environment: The simplest deployment mode. All developers share one test environment. For a given application, the current environment can contain only one version of the application. When an application developer deploys the environment as his own version for testing, other developers have to wait. At the same time, once a developer of an application deploys a version that is not working properly, the testing of all other application developers on the call chain is affected.
Full and multiple test environment: Each developer deploys a separate test environment. For a large-scale application, the cost of this deployment method may be unacceptable (imagine how wasteful it is to deploy a set of independent deployments for thousands of applications).
The above two methods both have very significant defects. Ideally, the test traffic of the corresponding service developers should be routed to their own version, while other developers should be routed to the stable version. This usually requires self-implementation or with the help of some framework capabilities. The former has significant development and implementation costs. The latter mostly has language limitations and is not friendly to language applications.
By using the swimlane mode of ASM, users can easily create an independent swimlane for each developer in a multi-language application environment without any intrusion into the application, and use the diversion rule to match the request characteristics (such as user ID and type), so that each developer can deploy the application under development only in his own swimlane. However, other applications in the call chain fall back to the baseline environment, which greatly improves the flexibility of deployment in the development environment. For more information about the traffic lane feature, please refer to ASM documentation. You can also follow the best practices in subsequent articles in this series.
Fluid 1.0: Bridging the Last Mile for Efficient Cloud-Native Data Usage
Building a Large Language Model Inference Service Optimized by TensorRT-LLM Based on KServe on ASM
154 posts | 28 followers
FollowHironobu Ohara - February 3, 2023
Xi Ning Wang(王夕宁) - December 16, 2020
Hironobu Ohara - February 3, 2023
Xi Ning Wang(王夕宁) - December 16, 2020
Alibaba Cloud Native - October 16, 2023
Alibaba Container Service - June 13, 2024
154 posts | 28 followers
FollowA global network for rapidly building a distributed business system and hybrid cloud to help users create a network with enterprise level-scalability and the communication capabilities of a cloud network
Learn MoreConnect your business globally with our stable network anytime anywhere.
Learn MoreEstablish high-speed dedicated networks for enterprises quickly
Learn MoreAlibaba Cloud offers an accelerated global networking solution that makes distance learning just the same as in-class teaching.
Learn MoreMore Posts by Alibaba Container Service