Design a disaster recovery solution based on Kubernetes container clusters - Container Service for Kubernetes

When you design a system architecture, you must consider possible issues that may occur in information systems and infrastructure, such as hardware failures, software crashes, misoperations, attacks, and natural disasters. A comprehensive disaster recovery (DR) solution is necessary to ensure that your business recovers from the preceding issues. This topic describes how to design a resilient DR architecture and solution based on Kubernetes clusters and the network, database, middleware, and observability services from Alibaba Cloud. The Kubernetes clusters can be clusters provided by Alibaba Cloud (Container Service for Kubernetes (ACK) clusters), another cloud service provider, or those you deploy and manage in your data centers.

DR objectives

Recovery time objective(RTO): The maximum acceptable duration between service interruption and recovery. RTO represents the maximum duration of service interruption.
Recovery point objective(RPO): The maximum acceptable duration from the last data recovery point. RPO represents the maximum tolerable amount of lost or reconstructed data.

A smaller RTO or RPO indicates a shorter service downtime or less lost data, but also indicates a higher resource cost and more complicated O&M. Therefore, you must make an appropriate RTO and RPO based on your budget.

DR strategies

Overview

The preceding figure shows three common DR strategies: Backup-Restore, Active-Standby, and Active-Active. The strategies differ in cost and benefits. You can choose a strategy based on your business importance, data loss tolerance, and budget.

Backup-Restore

As shown in the preceding figure, applications and data are regularly backed up in Backup-Restore mode. The backup data can be used to restore applications in another location to which business traffic is switched during disasters.

A specific amount of data is lost during data restoration because data is not backed up in real time. This may require a long period of time, which varies based on the size of the restored data.

Active-Standby

As shown in the preceding figure, the primary location handles most business and the secondary location runs fewer instances to save cost in Active-Standby mode. Test traffic is periodically sent to check system effectiveness.

If a fault or disaster occurs, a primary/secondary switchover is performed. In this case, more instances are started in the secondary location to handle diverted traffic.

Active-Active

As shown in the preceding figure, the same number of instances are running to concurrently handle the same amount of traffic in both locations in Active-Active mode.

If a fault or disaster occurs in one location, a switchover is performed to divert business traffic to a normally running location.

DR scope

Across zones (Multi-AZ)

In Alibaba Cloud, a region contains multiple availability zones (AZs) that run on separate power and network resources. You can design a cross-AZ DR solution to protect your business from small-scale disasters, such as power or network outages. Inter-AZ communication involves short latency. Therefore, cross-AZ DR is more suitable for applications as databases, caches, and message processors.

For more information about regions and zones, see Regions and Zones.

Across regions (Multi-region)

Specific large-scale disasters affect multiple AZs in the same region. In this case, you can develop a cross-region DR solution. However, cross-region communication has high latency. This makes the DR solution is more complicated and costly than a cross-AZ solution.

Design principle

When you design a multi-AZ or multi-region DR solution, you must determine whether the stateful applications, such as database, cache, and message processing applications, and dependent cloud services support cross-AZ or cross-region DR.

Solutions and examples

Backup-Restore

Solution 1: Cross-AZ and cross-region backup and recovery on Alibaba Cloud

The following figure shows the architecture of the solution:

The following items describe the solution:

Applications in ACK are backed up by using the backup center of Distributed Cloud Container Platform for Kubernetes (ACK One). The applications include stateless and stateful applications. For stateful applications, you can also back up the related storage data when you back up the application YAML data.
The backup center of ACK One integrates cloud services, such as Elastic Compute Service (ECS) (snapshots), Apsara File Storage NAS (NAS) (file systems), Object Storage Service (OSS) (buckets and objects), and Cloud Backup. These services support the one-click backup of application YAML data, persistent volumes (PVs) of cloud disk volumes, and PVs of file systems.
Backup data of applications and storage can be restored to ACK clusters in any region and AZ at any time.
Alibaba Cloud databases, such as ApsaraDB RDS for MySQL, can also be backed up and restored. For more information, see Backup and restoration and Migrate data between ApsaraDB RDS instances.

Solution 2: Data backup and restoration for hybrid clouds

The following figure shows the architecture of the solution:

The following items describe the solution:

You can connect Kubernetes clusters deployed in data centers or other cloud platforms to the ACK console by using registered clusters in ACK One.
Then, you can use the backup center of ACK One to back up the applications in the registered cluster. The applications include stateless applications and stateful applications. For stateful applications, you can back up the related storage data when you back up the YAML configuration file.
Backup application data (Deployment/StatefulSet) and storage data (PV/PVC) can be restored to ACK clusters in any region and AZ.

Pros and cons

A backup-restore solution is easier and lower in cost to implement than the other solutions. However, the RTO and RPO can be high, and the time to recover applications can be long depending on the amount of data and complexity of the applications. To lower the RTO and RPO, you can use a combination of full backup and incremental backups that is supported by the backup center of ACK One.

Backup-restore solutions are usually used as the last line of protection against disasters, which makes these solutions important. During system O&M, you must ensure the regular backup of data and the usability of data backups.

Backup-restore solutions can also be used to migrate applications across clusters. The following items describe the scenarios:

Migrate applications from your data center to Alibaba Cloud ACK clusters. For more information, see Migrate applications from external Kubernetes clusters to ACK clusters.
Migrate applications to new-version clusters. This solution is suitable when the current cluster version is outdated and version updates pose risks. In this case, you can create new-version clusters and then migrate the applications. For more information, see Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version.
Adjust account permissions or re-arrange organizations. For more information, see Best practice for using an ACK One Fleet instance to manage multiple clusters across platforms or across accounts and Migrate applications across clusters in different regions.

Multi-cluster Services

During application migration, a large number of applications may need to be migrated in batches. In addition, these applications may need to communicate with each other. In this case, you can use the multi-luster Services (MCS) feature of ACK One to implement cross-cluster access as long as the clusters are interconnected.

As shown in the following figure, MCS can inject the Kubernetes Service (including the endpoints) of Application 2 from Cluster 1 into Cluster 2. This way, Application 1 from Cluster 2 can access Application 2 that belongs to Cluster 1.

You can use a leased line and the registered cluster feature of ACK One to register Kubernetes clusters from your data center or another cloud platform with ACK and then enable the MCS feature of ACK One for the registered clusters.

Multi-active DR solutions for multiple AZs in a single region

You can use multi-active DR solutions in a single region that contains multiple AZs. Compared with active-standby DR solutions, multi-active DR solutions have the following advantages:

Higher resource utilization and lower costs.
Higher service quality and stronger fault tolerance: The increased number of service replicas enhances service quality and response speed. This allows you to handle peak traffic in a more efficient manner. In the event of a failure, service interruptions due to traffic switching are avoided. You can perform system updates or maintenance without interrupting the service.
Enhanced scalability: If a zone has insufficient resources, you can quickly scale the application in other zones that have available resources.

Multi-cluster gateway based on ACK One

The following figure shows the architecture of the solution:

When the system is running as expected:

When a disaster occurs, AZ1 becomes unavailable. A primary-secondary switchover is performed. The multi-cluster gateway (cloud-native MSE gateway) automatically switches traffic to ACK Cluster 2 in AZ2. Application instances are automatically scaled out in ACK Cluster 2.

The following items describe the solution:

Applications are deployed in two ACK clusters by using GitOps. Continuously consistent deployment is implemented based on Git repositories.
Standard Kubernetes Ingress rules are defined in YAML format by using a multi-cluster gateway of ACK One. This facilitates Layer-7 traffic distribution. Cross-AZ high availability (HA) is implemented for the multi-cluster gateway.
When Cluster 1 or the applications within Cluster 1 becomes unavailable, the multi-cluster gateway of ACK One seamlessly and automatically switches business traffic to Cluster 2 to complete the failover.
As traffic increases, the HPA feature scales out application replicas in Cluster 2, which triggers the autoscaler to scale out cluster nodes.
For more information about cross-AZ DR of Alibaba Cloud ApsaraDB RDS, see Build a high availability architecture.

Note

This solution is based on HTTP Layer-7 traffic forwarding and supports Layer-7 health checks. Compared with DNS-based traffic distribution, this solution features less traffic loss during primary/secondary switchovers.
Traffic governance based on Ingress rules is supported on the gateway side. Compared with DNS-based traffic distribution, this solution features a simpler system architecture and lower maintenance cost by leveraging the combination of Layer-4 load balancing (between the primary and secondary systems) and Layer-7 Ingress gateways.

Pros and cons

Single-region multi-AZ solutions are cost-efficient. You can use multi-AZ deployment and HA for cloud services, such as Application Load Balancer (ALB), cloud native gateway, container, middleware, and database services, to minimize business changes and perform rapid switchover.

Compared with the DR solutions for multiple AZs in a single region based on DNS traffic distribution, the solution implemented based on the multi-cluster gateway of ACK One has the following advantages:

Regional-level load balancing and centralized management of multi-cluster north-south Layer-7 traffic can reduce the number of gateways and costs. DNS-based traffic distribution does not support specific cross-cluster routing capabilities. For example, session persistence is required for zero round trip-time (0-RTT) of Quick UDP Internet Connections (QUIC).
Millisecond-level and second-level failovers eliminate DNS caching issues.
- Multi-cluster gateway based on ACK One: If the service in a cluster fails, traffic can be rerouted to other clusters in milliseconds or seconds. Failover is smoother than DNS-based traffic distribution.
- DNS-based traffic distribution: The IP address is changed during a failure. In most cases, the service is temporarily unavailable (at the minute level) due to client caching. To resolve caching issues, TTL values are reduced, which results in a large number of DNS access requests and higher usage costs.
Simplified management: Manage Ingress configurations and services in one control panel (Fleet). This provides an easier method to extend and maintain services or applications and reduces management costs.
Transparent cluster migration during cluster update or rebuild: Traffic is migrated to a healthy cluster based on rules and then forwarded back after the update or rebuild is complete.

The following figure shows the architecture of DNS-based traffic distribution.

Cloud + IDC DR solutions for a single region

The architecture of this solution is similar to the architecture of the single-region multi-AZ DR solution. The following items describe this solution:

A leased line connection is established between a VPC and a data center to provide management and data channels.
The cluster from the data center is connected to ACK One by using the registered cluster feature. This allows you to use the observability and security capabilities of Alibaba Cloud to manage your on-premises cluster and ACK cluster in a centralized manner.
Applications are deployed in both clusters by using ACK One GitOps. Continuously consistent deployment is implemented based on Git repositories.

Solution based on ACK One multi-cluster gateways (single-region on-premises and off-premises active-active)

The following figure shows the architecture of the solution:

Multi-region DR solutions

Single-region DR solutions cannot ensure the HA of your business if the scale of your business is large and has a large number of users from many regions. In this case, you may need a multi-region DR solution. Business systems are separately deployed in multiple regions to ensure that each region can independently operate and provide services. In multi-region DR scenarios, you can select a DR solution based on multi-cluster gateways of ACK One or a DR solution based on DNS traffic distribution. These solutions are suitable for different scenarios.

Solution 1: Multi-cluster gateway based on ACK One

ACK One allows you to use ALB multi-cluster gateways to implement a multi-region DR solution. This solution is suitable for the following scenarios:

Cross-region HA and insufficient resources in the on-premises region. For example, GPU resources are extremely scarce due to the current AI boom.
Client applications are only slightly sensitive to latency but require strong multi-cluster traffic management capabilities.

The following section describes the benefits of cross-region DR solutions based on ACK One multi-cluster gateways:

The ALB multi-cluster gateway in Region 1 is used to forward Layer-7 traffic across regions, such as 0-RTT of QUIC and header-based forwarding. Region 2 is configured with a single-cluster ALB instance as a cold standby to distribute traffic to Cluster 2 after Region 1 becomes unavailable.
DNS resolution and load distribution are implemented based on GTM. The health status of single-cluster and multi-cluster ALB instances is monitored and DR can be automatically triggered when disasters occur.
When the entire Region 1 or its ALB instance fails, GTM switches the DNS resolution of the service domain name to the ALB instance of Region 2 to implement DR. If only the cluster or service in Region 1 experiences issues, or if Region 2 becomes unavailable, the ALB multi-cluster gateway automatically reroutes traffic to the healthy cluster without the need to use GTM to implement a smoother failover.
After Cluster 1 and Cluster 2 are connected by using a CEN or VPC peering connection, cross-region traffic is forwarded by using a leased line to ensure reliability.
A multi-region HA solution is used for caches. For more information, see Overview of Global Distributed Cache for Tair.
A cross-region HA solution is used for databases. For more information, see Architecture for multi-zone deployment.

The following section describes the benefits of cross-region DR solutions based on ACK One multi-cluster gateways:

Stronger multi-cluster forwarding capabilities: Provides advanced content-based routing features and a more flexible health check mechanism than traditional GTM to adapt to more complex application scenarios.
Centralized multi-cluster traffic management Ingress: One control plane (Fleet) is used to manage Ingress configurations and services. This makes it easier to expand and maintain services and applications and reduces management costs.
Mitigate the DNS caching issue: The DR scenario shows that the IP address switching of DNS resolution is unnecessary in cases of frequent service exceptions or cluster failures. Seamless failover can be achieved in seconds.

Solution 2: DNS-based traffic distribution and a single ACK One Fleet instance

The multi-region DR solution based on DNS traffic distribution provides a global GTM and is suitable for scenarios that involve nearby access. The following figure shows the architecture of the solution:

The following items describe the solution:

Use Anti-DDoS Proxy and Web Application Firewall (WAF) to protect services from volumetric and web application attacks, such as SQL injection, cross-site scripting (XSS), and command injection attacks. For more information, see GTM works with WAF, GA, and SLB and Protect a website service by using Anti-DDoS Pro or Anti-DDoS Premium and WAF.
User requests are routed to the nearby region by using GTM.
Applications are deployed in two ACK clusters by using GitOps. Continuously consistent deployment is implemented based on Git repositories.
A multi-region HA solution is used for caches. For more information, see Overview.
A cross-region HA solution is used for databases. For more information, see Architecture for multi-zone deployment.
Multi-AZ DR solutions can be used in each region.

Solution 3: DNS-based traffic distribution and multiple ACK One Fleet instances

The following figure shows the architecture of the solution:

The following items describe the solution:

Use Anti-DDoS Proxy and WAF to protect services against volumetric and web application attacks, such as SQL injection, cross-site scripting (XSS), and command injection attacks. For more information, see GTM works with WAF, GA, and SLB and Protect a website service by using Anti-DDoS Pro or Anti-DDoS Premium and WAF.
User requests are routed to the nearby region by using GTM.
Applications are deployed in two ACK clusters by using GitOps. Continuously consistent deployment is implemented based on Git repositories.
A multi-region HA solution is used for caches. For more information, see Overview.
A cross-region HA solution is used for databases. For more information, see Architecture for multi-zone deployment.
Multi-AZ DR solutions can be used in each region.

Cross-region unit-based multi-active solution

Compared with the preceding cross-region DR solution, this solution requires you to configure rules to shard applications and data. This enables units to provide complete services based on data shards. This solution can securely isolate business and allows you to separately scale out business in different units to serve large user groups.

In this solution, business is classified into subunits and central units. A central unit that has user data manages multiple subunits that have sharded data. This solution requires the business system to support custom traffic distribution, data splitting, and unit interaction and has an extremely complex implementation.

The following figure shows the architecture of the solution: