A region-level failure affects all zones in a region simultaneously, causing connection failures, data loss, or workload outages. To protect against this, you deploy identical services across two regions, each with its own Service Mesh (ASM) ingress gateway exposed through a public IP address. The ASM ingress gateway can be deployed in a Kubernetes cluster or an Elastic Container Instance (ECI). Global Traffic Manager (GTM) resolves your domain name to both gateway IP addresses and monitors their health. When one region fails, GTM removes the unhealthy IP address from the resolution pool and redirects all traffic to the surviving region.
Architecture
The following dual-region, dual-cluster architecture demonstrates region-level disaster recovery:
Multi-master control plane. Deploy a Kubernetes cluster in each region and create a separate ASM instance for each cluster. Each ASM instance acts as an independent control plane, keeping proxy-push latency low and ensuring the control plane remains available even if the other region is down.
Dual ingress gateways with GTM. Deploy an ASM ingress gateway in each cluster and expose it through a Classic Load Balancer (CLB) or Network Load Balancer (NLB) with a public IP address. Configure Alibaba Cloud DNS and GTM to resolve your domain name to both gateway IP addresses.
Automatic failover. When a region fails, GTM detects the unhealthy gateway through health checks, removes its IP address from the resolution pool, and routes all traffic to the healthy region.
Why multi-master with GTM
This architecture uses a multi-master control plane (one ASM instance per cluster) combined with DNS-based failover through GTM. Understanding the trade-offs helps you determine if this approach fits your requirements:
| Architecture | Failover mechanism | Control plane HA | Best for |
|---|---|---|---|
| Multi-master + GTM (this guide) | DNS-based, GTM health checks | Each region has an independent control plane | Region-level DR where each region operates autonomously |
| Single control plane + multi-cluster | Mesh-native locality failover | Single point of failure for the control plane | Low-latency failover within a single mesh |
Handle traffic spikes during failover
When one region goes down, the surviving region absorbs all traffic. To prevent overload:
Enable Horizontal Pod Autoscaler (HPA) for ASM gateways to scale out gateway instances automatically.
NoteHPA is available only on ASM Enterprise Edition or Ultimate Edition.
Configure local throttling on the ingress gateways to cap request rates within the cluster's capacity and prevent cascading failures.
(Optional) Set up monitoring and alerts for throttling metrics to detect traffic surges and trigger scaling early.
Process overview
Cross-region disaster recovery is supported in all types of clusters. The following process shows how to create an ACK cluster and ASM instance, configure disaster recovery, and conduct failure drills.
This guide uses a CLB instance associated with the ingress gateway as an example. For information about how to integrate GTM with an NLB instance associated with the ingress gateway, see GTM configuration process.
Prerequisites
Before you begin, make sure that you have:
Two Alibaba Cloud regions selected for deployment
Permissions to create Container Service for Kubernetes (ACK) clusters and ASM instances
A registered domain name managed by Alibaba Cloud DNS
A GTM instance (see GTM configuration process)
Step 1: Build a multi-master control plane
Create two ACK clusters,
cluster-1andcluster-2, in two different regions. Enable Expose API Server by EIP during cluster creation. For details, see Create an ACK managed cluster.Create two ASM instances,
mesh-1andmesh-2, in the regions where the clusters reside. Addcluster-1tomesh-1andcluster-2tomesh-2. For details, see Step 1 and Step 2 in Multi-cluster disaster recovery through ASM multi-master control plane architecture.
Step 2: Deploy ingress gateways and sample applications
Create an ASM ingress gateway named
ingressgatewayin each ASM instance. For details, see Create an ingress gateway.Deploy the Bookinfo sample application in each cluster. For details, see Deploy an application in an ACK cluster that is added to an ASM instance.
Create an Istio gateway and a virtual service in each ASM instance so the ingress gateway routes traffic to the Bookinfo application. For details, see Use Istio resources to route traffic to different versions of a service.
Enable cluster-local traffic retention globally for each ASM instance. For details, see the Enable cluster-local traffic retention globally section in "Enable the feature of keeping traffic in-cluster in multi-cluster scenarios".
NoteRegion-level disaster recovery requires all traffic from a cluster to stay within that cluster. Without this setting, the default load balancer in ASM may route requests to a peer service in another cluster under the same ASM instance. Cluster-local traffic retention prevents this cross-cluster routing.
(Optional) Step 3: Verify service status
Get the public IP addresses of the two ASM gateways. See Obtain the IP address of the ingress gateway in "Use Istio resources to route traffic to different versions of a service".
Run the following command with the kubeconfig file of each cluster to list the reviews service pods:
kubectl get pod | grep reviewsExpected output:
reviews-v1-5d99dxxxxx-xxxxx 2/2 Running 0 3d17h reviews-v2-69fbbxxxxx-xxxxx 2/2 Running 0 3d17h reviews-v3-8c44xxxxx-xxxxx 2/2 Running 0 3d17hOpen the following URLs in a browser and refresh each page 10 times:
http://<mesh-1-gateway-ip>/productpagehttp://<mesh-2-gateway-ip>/productpage
Each refresh should cycle through the v1, v2, and v3 versions of the reviews service, with pod names matching those from step 2. This confirms the services are running correctly and cluster-local traffic retention is active.

Step 4: Configure GTM
Use the two public gateway IP addresses to configure multi-active load balancing and disaster recovery on GTM. For details, see Use GTM to implement multi-active load balancing and disaster recovery.
The following figure shows a sample GTM configuration:

(Optional) Step 5: Configure local throttling and monitoring
Apply the following ASMLocalRateLimiter resource to both
mesh-1andmesh-2. This limits the ingress gateway to 100 requests per second on port 80. For details, see Configure local throttling on an ingress gateway.apiVersion: istio.alibabacloud.com/v1beta1 kind: ASMLocalRateLimiter metadata: name: ingressgateway namespace: istio-system spec: configs: - limit: fill_interval: seconds: 1 quota: 100 match: vhost: name: '*' port: 80 route: name_match: gw-to-productage isGateway: true workloadSelector: labels: istio: ingressgatewaySet up metric collection and alerts for throttling events. For details, see the Configure metric collection and alerts for local throttling section in "Configure local throttling in traffic management center".
Failure drill
Run a failure drill to validate that GTM redirects traffic when a region goes down. This drill uses fortio, an open-source load testing tool, to simulate external user traffic while you remove an ingress gateway to trigger failover.
Run the drill
Replace
<domain-name>with the domain name configured in GTM, then start a 5-minute load test:fortio load -jitter=False -c 1 -qps 100 -t 300s -keepalive=False -a http://<domain-name>/productpageWhile the test is running, simulate a region failure by deleting the ingress gateway in
cluster-2:Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find
cluster-2and click its name. In the left navigation pane, choose Workloads > Deployments.Select istio-system from the Namespace drop-down list.
Find istio-ingressgateway in the workload list and click More > Delete in the Actions column.
Interpret the results
After the test completes, check the summary output for these two metrics:
| Metric | Expected value | Meaning |
|---|---|---|
| Code 200 | ~86% of requests | Requests served successfully by the healthy region |
| Code -1 | ~14% of requests | Requests that failed during the failover window (connection errors to the removed gateway) |
A small percentage of failed requests is expected. These are in-flight requests directed to the failed region before GTM updated the DNS resolution. The majority of requests succeed, confirming that GTM detected the failure and redirected traffic.
Verify in the GTM console
Open the Global Traffic Manager page and check the domain name access status. The IP address of cluster-2 should no longer appear in the resolution pool.

To receive alert notifications when an address becomes unavailable, configure an alert rule. For details, see Configure alert settings.