A comprehensive disaster recovery (DR) plan protects your business against hardware failures, software crashes, operator errors, attacks, and natural disasters. This topic describes how to design a resilient DR architecture using Kubernetes clusters — including Container Service for Kubernetes (ACK) clusters, clusters from another cloud provider, or self-managed clusters in your data centers — combined with Alibaba Cloud network, database, middleware, and observability services.
Key concepts
Recovery Time Objective (RTO) is the maximum acceptable duration between a service interruption and full recovery. A smaller RTO means shorter downtime.
Recovery Point Objective (RPO) is the maximum acceptable duration from the last data recovery point. A smaller RPO means less data lost.
Smaller RTO and RPO require more resources and more complex operations. Set your RTO and RPO targets based on business criticality and budget.
Choose a DR strategy
The following table compares the three DR strategies. Select one based on your business importance, data loss tolerance, and budget.
| Strategy | How it works | Cost | Best for |
|---|---|---|---|
| Backup-Restore | Applications and data are backed up on a schedule. On failure, restore from backup in another location. | Low | Non-critical workloads, last-resort protection |
| Active-Standby | Primary location handles most traffic. Secondary location runs fewer instances. Test traffic is periodically sent to check system effectiveness. On failure, scale up the secondary and switch traffic. | Medium | Workloads with moderate availability requirements |
| Active-Active | Both locations run equal instances and handle traffic simultaneously. On failure, route all traffic to the healthy location. | High | Business-critical workloads requiring near-zero downtime |
DR scope
Across availability zones (multi-AZ)
A region contains multiple availability zones (AZs) that operate on separate power and network infrastructure. A cross-AZ DR solution protects against localized failures such as power outages or network disruptions. Because inter-AZ latency is low, cross-AZ DR is well-suited for stateful applications like databases, caches, and message queues.
For more information about regions and AZs, see Regions and zones.
Across regions (multi-region)
Large-scale disasters can affect all AZs in a region simultaneously. A cross-region DR solution handles this scenario, but cross-region latency is higher than inter-AZ latency, which makes implementation more complex and costly.
Design principle
Before designing a multi-AZ or multi-region DR solution, confirm whether your stateful applications (databases, caches, message processors) and the cloud services they depend on support the intended DR scope.
Backup-Restore solutions
Backup-Restore is the simplest and lowest-cost DR approach. The tradeoff is higher RTO and RPO — recovery time grows with data volume and application complexity. Use full backups combined with incremental backups (supported by the ACK One backup center) to reduce RTO and RPO.
Backup-Restore also serves as the last line of protection. Maintain regular backup schedules and verify backup integrity during routine operations.
Beyond DR, Backup-Restore solutions work for application migration across clusters:
-
Migrate workloads from a data center to ACK. See Migrate applications from external Kubernetes clusters to ACK clusters.
-
Upgrade aging clusters by migrating to new-version clusters instead of upgrading in place. See Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version.
-
Reorganize account permissions or restructure organizations. See Best practice for using an ACK One Fleet instance to manage multiple clusters across platforms or across accounts and Migrate applications across clusters in different regions.
Solution 1: Cross-AZ and cross-region backup and recovery on Alibaba Cloud
The backup center of Distributed Cloud Container Platform for Kubernetes (ACK One) backs up both stateless and stateful applications running in ACK clusters. For stateful applications, storage data is backed up alongside the application YAML configuration.
The backup center integrates the following Alibaba Cloud services to support one-click backup of application YAML data, persistent volumes (PVs) backed by cloud disks, and PVs backed by file systems:
Backup data can be restored to ACK clusters in any region and AZ at any time. Alibaba Cloud databases such as ApsaraDB RDS for MySQL also support backup and restore. See Backup and restoration and Migrate data between ApsaraDB RDS instances.
Solution 2: Data backup and restoration for hybrid clouds
Connect Kubernetes clusters running in on-premises data centers or other cloud platforms to ACK using registered clusters in ACK One. Once connected, use the ACK One backup center to back up applications in the registered cluster — including stateless and stateful workloads, with storage data backed up alongside the YAML configuration for stateful apps.
Backup application data (Deployments and StatefulSets) and storage data (persistent volumes (PVs) and persistent volume claims (PVCs)) can be restored to ACK clusters in any region and AZ.
Cross-cluster access during migration with multi-cluster Services
When migrating a large number of applications in batches, applications across clusters may need to communicate during the transition. Use the multi-cluster Services (MCS) feature of ACK One to enable cross-cluster access between interconnected clusters.
MCS injects the Kubernetes Service (including its endpoints) from one cluster into another. For example, MCS can inject the Service of Application 2 in Cluster 1 into Cluster 2, letting Application 1 in Cluster 2 reach Application 2 across clusters.
To register on-premises or third-party clusters with ACK One, connect via a leased line and use the registered cluster feature of ACK One, then enable MCS for those clusters.
Active-Active DR solutions for a single region (multi-AZ)
Multi-AZ Active-Active DR outperforms Active-Standby in three ways:
-
Lower cost with higher utilization: Resources in both AZs serve live traffic, reducing idle capacity.
-
Higher service quality and stronger fault tolerance: The increased number of service replicas enhances service quality and response speed, allowing you to handle peak traffic more efficiently. In the event of a failure, service interruptions due to traffic switching are avoided, and you can perform system updates or maintenance without interrupting the service.
-
Cross-zone scaling: If one zone runs short on resources, scale application replicas in other zones with available capacity.
Solution: Multi-cluster gateway based on ACK One
This solution deploys applications across two ACK clusters in separate AZs and uses the ACK One multi-cluster gateway for Layer 7 traffic routing and health-based failover.
Normal operation:
After AZ1 failure — automatic failover to Cluster 2 in AZ2:
How it works:
-
Deploy applications to two ACK clusters using GitOps. Git repositories serve as the source of truth for continuous and consistent deployment.
-
Define standard Kubernetes Ingress rules in YAML using the ACK One multi-cluster gateway. The gateway itself is deployed across AZs for high availability (HA).
-
When Cluster 1 or its applications become unavailable, the multi-cluster gateway automatically reroutes traffic to Cluster 2 without manual intervention.
-
As traffic increases in Cluster 2, the Horizontal Pod Autoscaler (HPA) scales out application replicas, which triggers the autoscaler to provision additional cluster nodes.
-
For cross-AZ DR of ApsaraDB RDS, see Build a high availability architecture.
This solution uses Layer 7 HTTP traffic forwarding with Layer 7 health checks. During a primary/secondary switchover, traffic loss is lower than with DNS-based traffic distribution.
Advantages over DNS-based traffic distribution:
| Dimension | ACK One multi-cluster gateway | DNS-based distribution |
|---|---|---|
| Failover speed | Milliseconds to seconds | Minutes (blocked by DNS TTL caching) |
| Routing capabilities | Advanced Layer 7 routing, session persistence, QUIC 0-RTT | Limited; no cross-cluster session persistence |
| Management | Single control plane (Fleet) manages all Ingress configurations | Separate DNS records per cluster |
| Cluster migration | Transparent — traffic shifts to the healthy cluster and back automatically | IP address changes disrupt clients until TTL expires |
DNS TTL workarounds (such as reducing TTL values) generate large volumes of DNS requests and increase costs without eliminating caching delays.
Architecture of the DNS-based alternative (for reference):
Cloud + data center DR for a single region
This solution extends the single-region multi-AZ architecture to a hybrid cloud setup, combining an ACK cluster with a Kubernetes cluster running in an on-premises data center.
Setup:
-
Establish a leased line between the VPC and the data center to provide management and data channels.
-
Connect the on-premises cluster to ACK One using the registered cluster feature. This lets you manage both clusters from a single control plane with Alibaba Cloud observability and security capabilities.
-
Deploy applications to both clusters using ACK One GitOps.
Architecture (single-region on-premises and cloud Active-Active):
Multi-region DR solutions
When your user base spans multiple regions or a single-region outage would be unacceptable, deploy business systems independently in multiple regions so each region can serve traffic on its own. Two solution patterns are available for multi-region DR.
Solution 1: Multi-cluster gateway based on ACK One (recommended)
Use this solution when:
-
You need cross-region HA and the primary region has constrained resources (for example, GPU resources are scarce due to AI workload demand).
-
Your applications have moderate latency sensitivity but require advanced multi-cluster traffic management.
How it works:
-
An Application Load Balancer (ALB) multi-cluster gateway in Region 1 handles Layer 7 cross-region traffic routing, including QUIC 0-RTT and header-based routing. Region 2 runs a single-cluster ALB instance in cold standby, ready to accept traffic if Region 1 fails.
-
Global Traffic Manager (GTM) provides DNS resolution and load distribution. GTM monitors the health of ALB instances in both regions and triggers DR automatically.
-
Failure handling has two paths:
-
If the entire Region 1 or its ALB instance fails, GTM switches the DNS resolution of the service domain to the Region 2 ALB instance.
-
If only a cluster or specific service in Region 1 fails, or if Region 2 becomes unavailable, the ALB multi-cluster gateway reroutes traffic to the healthy cluster directly — without requiring a GTM DNS switch.
-
-
Connect clusters across regions using a Cloud Enterprise Network (CEN) or VPC peering connection. Cross-region traffic is forwarded over a leased line for reliability.
-
Use Global Distributed Cache for Tair for multi-region cache HA.
-
Use the cross-region HA configuration for databases. See Architecture for multi-zone deployment.
Key advantages:
-
Advanced traffic management: Content-based routing and flexible health checks beyond what traditional GTM provides.
-
Centralized management: One Fleet control plane manages Ingress configurations and services across all clusters.
-
Faster failover: Seamless failover in seconds for cluster or service failures, avoiding DNS propagation delays.
Solution 2: DNS-based traffic distribution with a single ACK One Fleet instance
Use this solution for globally distributed workloads that route users to the nearest region and where DNS-level latency is acceptable.
How it works:
-
Protect services against volumetric attacks and web application attacks — including SQL injection, cross-site scripting (XSS), and command injection — using Anti-DDoS Proxy and Web Application Firewall (WAF). See GTM works with WAF, GA, and SLB and Protect a website service by using Anti-DDoS Pro or Anti-DDoS Premium and WAF.
-
Route user requests to the nearest region using GTM.
-
Deploy applications to both ACK clusters using GitOps for continuous and consistent deployment.
-
Use Global Distributed Cache for Tair for multi-region cache HA.
-
Use the cross-region HA configuration for databases. See Architecture for multi-zone deployment.
-
Apply multi-AZ DR within each region.
Solution 3: DNS-based traffic distribution with multiple ACK One Fleet instances
This solution follows the same structure as Solution 2 but uses multiple ACK One Fleet instances instead of one. Use this when organizational boundaries or compliance requirements mandate separate management planes per region.
The traffic protection, routing, application deployment, cache HA, database HA, and per-region multi-AZ DR configuration are identical to Solution 2.
Cross-region unit-based Active-Active solution
This advanced pattern requires sharding both application traffic and data across geographic units. Each unit operates independently and serves its shard of the user base. The architecture isolates failures to individual units and lets you scale each unit separately.
Architecture:
How it works:
-
Business is split into subunits (holding sharded data) and a central unit (holding user data and coordinating subunits).
-
Traffic sharding rules determine which unit handles each request based on user identity or geography.
-
Units interact for cross-shard operations.
This architecture requires implementing custom traffic distribution, data splitting, and cross-unit interaction logic. It is the most complex of all DR patterns described in this topic and is typically reserved for large-scale platforms serving large user groups.
Frequently asked questions
Why use the ACK One multi-cluster gateway instead of DNS-based distribution?
DNS-based distribution relies on TTL expiry for failover, which can take minutes. The ACK One multi-cluster gateway uses Layer 7 health checks and routes traffic in milliseconds to seconds. It also supports advanced routing features — such as QUIC 0-RTT and session persistence — that DNS cannot provide.
What happens during failover if my Cluster 2 is not pre-scaled?
If Cluster 2 does not have enough capacity when failover occurs, the HPA begins scaling replicas and the autoscaler provisions new nodes, but this takes time. Pre-scale Cluster 2 or configure HPA with appropriate minimum replica counts before enabling DR to avoid delays during an actual failure.
How do I handle stateful applications (databases, caches) in a cross-AZ setup?
Confirm that your database and cache services support cross-AZ replication before deploying a cross-AZ DR solution. For ApsaraDB RDS, see Build a high availability architecture. For cache HA across regions, use Global Distributed Cache for Tair.
Can I use these DR solutions for on-premises clusters?
Yes. Use the registered cluster feature of ACK One to connect on-premises Kubernetes clusters to ACK One. Once registered, you can apply the same backup, GitOps deployment, and multi-cluster gateway configurations to your on-premises clusters.
Which solution should I choose for my use case?
| Scenario | Recommended solution |
|---|---|
| Non-critical workloads, cost is the primary constraint | Backup-Restore |
| Single-region HA with moderate availability requirements | Active-Standby |
| Single-region HA with near-zero downtime requirement | Multi-AZ Active-Active (ACK One multi-cluster gateway) |
| Hybrid cloud (on-premises + cloud) | Cloud + data center DR |
| Multi-region HA with advanced traffic management | Multi-region Solution 1 (ACK One multi-cluster gateway) |
| Multi-region HA with geo-based routing, DNS latency acceptable | Multi-region Solution 2 or 3 (DNS-based) |
| Massive scale, large user groups | Cross-region unit-based Active-Active |