All Products
Search
Document Center

Alibaba Cloud Service Mesh:Implement region-level disaster recovery by combining ASM with GTM

Last Updated:Mar 11, 2026

A region-level failure affects all zones in a region simultaneously, causing connection failures, data loss, or workload outages. To protect against this, you deploy identical services across two regions, each with its own Service Mesh (ASM) ingress gateway exposed through a public IP address. The ASM ingress gateway can be deployed in a Kubernetes cluster or an Elastic Container Instance (ECI). Global Traffic Manager (GTM) resolves your domain name to both gateway IP addresses and monitors their health. When one region fails, GTM removes the unhealthy IP address from the resolution pool and redirects all traffic to the surviving region.

Architecture

The following dual-region, dual-cluster architecture demonstrates region-level disaster recovery:

  1. Multi-master control plane. Deploy a Kubernetes cluster in each region and create a separate ASM instance for each cluster. Each ASM instance acts as an independent control plane, keeping proxy-push latency low and ensuring the control plane remains available even if the other region is down.

  2. Dual ingress gateways with GTM. Deploy an ASM ingress gateway in each cluster and expose it through a Classic Load Balancer (CLB) or Network Load Balancer (NLB) with a public IP address. Configure Alibaba Cloud DNS and GTM to resolve your domain name to both gateway IP addresses.

  3. Automatic failover. When a region fails, GTM detects the unhealthy gateway through health checks, removes its IP address from the resolution pool, and routes all traffic to the healthy region.

Architecture diagram

Why multi-master with GTM

This architecture uses a multi-master control plane (one ASM instance per cluster) combined with DNS-based failover through GTM. Understanding the trade-offs helps you determine if this approach fits your requirements:

ArchitectureFailover mechanismControl plane HABest for
Multi-master + GTM (this guide)DNS-based, GTM health checksEach region has an independent control planeRegion-level DR where each region operates autonomously
Single control plane + multi-clusterMesh-native locality failoverSingle point of failure for the control planeLow-latency failover within a single mesh

Handle traffic spikes during failover

When one region goes down, the surviving region absorbs all traffic. To prevent overload:

  1. Enable Horizontal Pod Autoscaler (HPA) for ASM gateways to scale out gateway instances automatically.

    Note

    HPA is available only on ASM Enterprise Edition or Ultimate Edition.

  2. Configure local throttling on the ingress gateways to cap request rates within the cluster's capacity and prevent cascading failures.

  3. (Optional) Set up monitoring and alerts for throttling metrics to detect traffic surges and trigger scaling early.

Process overview

Cross-region disaster recovery is supported in all types of clusters. The following process shows how to create an ACK cluster and ASM instance, configure disaster recovery, and conduct failure drills.

Process flow
Note

This guide uses a CLB instance associated with the ingress gateway as an example. For information about how to integrate GTM with an NLB instance associated with the ingress gateway, see GTM configuration process.

Prerequisites

Before you begin, make sure that you have:

  • Two Alibaba Cloud regions selected for deployment

  • Permissions to create Container Service for Kubernetes (ACK) clusters and ASM instances

  • A registered domain name managed by Alibaba Cloud DNS

  • A GTM instance (see GTM configuration process)

Step 1: Build a multi-master control plane

  1. Create two ACK clusters, cluster-1 and cluster-2, in two different regions. Enable Expose API Server by EIP during cluster creation. For details, see Create an ACK managed cluster.

  2. Create two ASM instances, mesh-1 and mesh-2, in the regions where the clusters reside. Add cluster-1 to mesh-1 and cluster-2 to mesh-2. For details, see Step 1 and Step 2 in Multi-cluster disaster recovery through ASM multi-master control plane architecture.

Step 2: Deploy ingress gateways and sample applications

  1. Create an ASM ingress gateway named ingressgateway in each ASM instance. For details, see Create an ingress gateway.

  2. Deploy the Bookinfo sample application in each cluster. For details, see Deploy an application in an ACK cluster that is added to an ASM instance.

  3. Create an Istio gateway and a virtual service in each ASM instance so the ingress gateway routes traffic to the Bookinfo application. For details, see Use Istio resources to route traffic to different versions of a service.

  4. Enable cluster-local traffic retention globally for each ASM instance. For details, see the Enable cluster-local traffic retention globally section in "Enable the feature of keeping traffic in-cluster in multi-cluster scenarios".

    Note

    Region-level disaster recovery requires all traffic from a cluster to stay within that cluster. Without this setting, the default load balancer in ASM may route requests to a peer service in another cluster under the same ASM instance. Cluster-local traffic retention prevents this cross-cluster routing.

(Optional) Step 3: Verify service status

  1. Get the public IP addresses of the two ASM gateways. See Obtain the IP address of the ingress gateway in "Use Istio resources to route traffic to different versions of a service".

  2. Run the following command with the kubeconfig file of each cluster to list the reviews service pods:

    kubectl get pod | grep reviews

    Expected output:

    reviews-v1-5d99dxxxxx-xxxxx       2/2     Running   0          3d17h
    reviews-v2-69fbbxxxxx-xxxxx       2/2     Running   0          3d17h
    reviews-v3-8c44xxxxx-xxxxx        2/2     Running   0          3d17h
  3. Open the following URLs in a browser and refresh each page 10 times:

    • http://<mesh-1-gateway-ip>/productpage

    • http://<mesh-2-gateway-ip>/productpage

    Each refresh should cycle through the v1, v2, and v3 versions of the reviews service, with pod names matching those from step 2. This confirms the services are running correctly and cluster-local traffic retention is active.

    Service verification

Step 4: Configure GTM

Use the two public gateway IP addresses to configure multi-active load balancing and disaster recovery on GTM. For details, see Use GTM to implement multi-active load balancing and disaster recovery.

The following figure shows a sample GTM configuration:

GTM configuration

(Optional) Step 5: Configure local throttling and monitoring

  1. Apply the following ASMLocalRateLimiter resource to both mesh-1 and mesh-2. This limits the ingress gateway to 100 requests per second on port 80. For details, see Configure local throttling on an ingress gateway.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: ASMLocalRateLimiter
    metadata:
      name: ingressgateway
      namespace: istio-system
    spec:
      configs:
        - limit:
            fill_interval:
              seconds: 1
            quota: 100
          match:
            vhost:
              name: '*'
              port: 80
              route:
                name_match: gw-to-productage
      isGateway: true
      workloadSelector:
        labels:
          istio: ingressgateway
  2. Set up metric collection and alerts for throttling events. For details, see the Configure metric collection and alerts for local throttling section in "Configure local throttling in traffic management center".

Failure drill

Run a failure drill to validate that GTM redirects traffic when a region goes down. This drill uses fortio, an open-source load testing tool, to simulate external user traffic while you remove an ingress gateway to trigger failover.

Run the drill

  1. Replace <domain-name> with the domain name configured in GTM, then start a 5-minute load test:

    fortio load -jitter=False -c 1 -qps 100 -t 300s -keepalive=False -a http://<domain-name>/productpage
  2. While the test is running, simulate a region failure by deleting the ingress gateway in cluster-2:

    1. Log on to the ACK console. In the left navigation pane, click Clusters.

    2. On the Clusters page, find cluster-2 and click its name. In the left navigation pane, choose Workloads > Deployments.

    3. Select istio-system from the Namespace drop-down list.

    4. Find istio-ingressgateway in the workload list and click More > Delete in the Actions column.

Interpret the results

After the test completes, check the summary output for these two metrics:

MetricExpected valueMeaning
Code 200~86% of requestsRequests served successfully by the healthy region
Code -1~14% of requestsRequests that failed during the failover window (connection errors to the removed gateway)

A small percentage of failed requests is expected. These are in-flight requests directed to the failed region before GTM updated the DNS resolution. The majority of requests succeed, confirming that GTM detected the failure and redirected traffic.

Verify in the GTM console

Open the Global Traffic Manager page and check the domain name access status. The IP address of cluster-2 should no longer appear in the resolution pool.

GTM access status

To receive alert notifications when an address becomes unavailable, configure an alert rule. For details, see Configure alert settings.

Related topics