All Products
Search
Document Center

Alibaba Cloud Service Mesh:Use ASM to implement disaster recovery for service-level failures

Last Updated:Dec 15, 2025

Service-level failures are common in cloud-native businesses. Although these failures are smaller than zone-level issues, they can still lead to business unavailability or degradation due to single points of failure. Alibaba Cloud Service Mesh (ASM) supports service discovery and network communication across multiple clusters. By using a multi-region, multi-cluster deployment, ASM can quickly switch traffic to available service instances within seconds if a service fails. This ensures global availability for business applications. This topic explains how to use ASM for disaster recovery in service-level failures.

Disaster recovery architecture

Failover mechanism: Service Mesh-based geographic failover

ASM supports a geographic-based failover mechanism. By detecting whether a workload continuously experiences response errors within a time window, ASM determines the failure status of the workload and automatically transfers traffic targets to other available workloads when a workload fails.

The following example involves two clusters located in different regions, where two sets of applications are deployed peer-to-peer. The workloads of each service in each cluster are evenly distributed across zones through topology spread constraints.

  1. Normal traffic topology

    When services are normal, the service mesh keeps traffic within the same zone to minimize the impact of the target location on request latency.

    image
  2. Automatic geographic failover during workload anomalies

    When ASM detects continuous response errors in a workload, it removes the service from the load balancing pool to transfer requests to other available workloads. Traffic failover follows the priority order below:

    1. Other workloads in the same zone.

    2. Workloads in different zones within the same region.

    Additionally, by configuring Prometheus instances to collect service mesh-related metrics, you will receive alert notifications when failover occurs.

    image
  3. Cross-region traffic transfer

    When there are no available workloads in the zones within the same region, traffic can be transferred to the same service deployed in another region's cluster. This process is automatically completed by the service mesh.

    In this case, routing must be established between the two clusters so that a pod in cluster A can connect to one in cluster B. For scenarios involving multiple regions in Alibaba Cloud, use Cloud Enterprise Network (CEN) to connect the physical networks between clusters and achieve cluster intercommunication. This solution requires planning the cluster network (such as pod, service, and VPC CIDR) and using CEN instances to achieve cross-region network connectivity, suitable for customers with high-quality cross-region traffic requirements.

    For clusters in data centers or third-party cloud providers, network planning can limit physical connectivity. To address this, ASM allows the establishment of an mTLS secure channel through cross-cluster mesh proxies for essential communication and secure cross-region traffic transfer. However, due to the limited quality of public networks, this solution is best for users with minimal inter-cluster communication needs or those facing practical network conflicts.

    image

Application deployment topology: Support methods and comparison of various cluster topologies

ASM's geographic-based failover mechanism can be seamlessly applied to various application deployment topologies. Deployment topologies differ in their scope of failure dimensions, resource costs, and operation and maintenance thresholds.

  1. Single-cluster multi-zone deployment

    This is the simplest deployment method for disaster recovery. By configuring multi-zone node pools, selecting multi-zone virtual switches for node pools, and choosing a balanced distribution strategy when configuring scaling policies, you can evenly distribute Elastic Compute Service (ECS) instances across the specified multi-zone scaling group.

    image

    Comparison of advantages and disadvantages of this method:

    Advantages

    Disadvantages

    Only one Container Service for Kubernetes (ACK) cluster with a multi-zone node pool is needed. Applications are evenly deployed across zones through topology spread constraints.

    Cannot handle failures caused by Kubernetes misconfiguration or application dependencies.

    Load balancing, databases, Kubernetes clusters, service meshes, and other cloud resources only need to be prepared within the same region.

    Cannot handle region-level failures.

  2. Single-region multi-zone multi-cluster deployment

    Expand from a single-cluster multi-zone to multiple clusters, each with its own configuration. During failover, ASM first checks for workloads in the same zone and then randomly selects one from other available workloads in different zones within the same region. We recommend that you choose independent zones for the clusters. If the zones overlap, workloads in the same zone may be prioritized over those in different clusters, which could lead to unexpected business outcomes.

    image

    Comparison of advantages and disadvantages of this method:

    Advantages

    Disadvantages

    Compared to single-cluster deployment, this topology expands the dimensions of fault causes covered, further improving the overall availability of the system.

    Services must be deployed in two Kubernetes cluster environments, with high deployment and maintenance complexity. Load balancing, Kubernetes clusters, and other cloud resource infrastructure must be deployed multiple times, resulting in high costs.

    Only two clusters need to be within the same VPC. When traffic failover to services in another cluster is required, this deployment topology has lower cluster intercommunication costs.

    Cannot handle region-level failures.

    Note
    • In this deployment topology, you must plan the network configuration of multiple clusters to avoid network conflicts. For more information, see Plan CIDR blocks for multiple clusters on the data plane.

    • Do not use single-region multi-zone multi-cluster deployment as the first choice, because it significantly increases deployment and maintenance complexity without providing the highest level of availability.

  3. Multi-region multi-cluster deployment

    Based on multiple clusters, change single-region deployment to multi-region deployment. This ensures that each service is distributed across dimensions, which means they have different availability zones, deployment regions, clusters, and dependencies. As a result, normal workloads can continue to operate during potential failures.

    image

    Comparison of advantages and disadvantages of this method:

    Advantages

    Disadvantages

    Can distribute the workloads of the same service across different regions, zones, and clusters to improve service availability.

    Services must be deployed in two Kubernetes cluster environments. For better push latency performance, two copies of service mesh instances must also be prepared.

    Based on a multi-master control plane architecture, latency issues can be effectively controlled.

    Cross-region network intercommunication is relatively complex.

    When you have high availability requirements, use this topology for business application deployment. In scenarios requiring cross-cluster network connectivity, choose from the following two methods to achieve cluster network connectivity:

  4. Multi-cloud multi-region multi-cluster deployment.

    This deployment topology represents the highest level of deployment architecture. ASM offers comprehensive management for Kubernetes clusters across various cloud providers and on-premises data centers. By deploying Kubernetes cluster environments across different clouds or hybrid environments, it guarantees continuous service availability even if one infrastructure experiences a failure.

    image

    Similar to multi-region multi-cluster deployments, multi-cloud multi-region multi-cluster deployments also require network connectivity between clusters. However, ASM cannot provide a managed control plane in environments outside of Alibaba Cloud. To address push latency issues in cross-cloud scenarios, use the ASM remote control plane.

    Comparison of advantages and disadvantages of this method:

    Advantages

    Disadvantages

    Allows distribution of the same service's workloads across various regions, zones, clusters, and cloud providers, enhancing service availability.

    Services must be deployed across two Kubernetes cluster environments. Multi-cloud deployments may face numerous compatibility and other unforeseen issues due to infrastructure and product capability differences between providers.

    The multi-master control plane architecture effectively mitigates latency issues.

    Intercommunication across regions is quite complex.

Disaster recovery configuration

This section describes the process for setting up service-level disaster recovery using a multi-region, multi-cluster deployment approach as an example.

Step 1: Environment preparation

  1. Create a Kubernetes cluster in each of the two regions, naming them cluster 1 and cluster 2, both configured across multiple zones and with the Expose API Server with EIP feature enabled. For more information, see Create an ACK managed cluster.

  2. Create ASM instances mesh-1 and mesh-2 in the same regions as the ACK clusters, ensuring that the selected switches during creation are in the same zones. For more information, see Steps 1, 2, and 3 in Multi-cluster disaster recovery through ASM multi-master control plane architecture.

  3. Deploy ASM gateways and sample services.

    1. Create a Network Load Balancer (NLB)-type ingress gateway named ingressgateway in both mesh-1 and mesh-2, ensuring you select the same two zones used by the ASM instances and ACK clusters. For more information, see Associate an NLB instance with an ingress gateway.

    2. Enable automatic sidecar injection in the default namespace for both clusters within mesh-1 and mesh-2. For detailed instructions, see Manage global namespaces.

    3. Deploy sample applications in a peer-to-peer manner across both ACK clusters. The four zones used in the example are cn-hangzhou-h, cn-hangzhou-k, ap-northeast-1a, and ap-northeast-1b. Adjust the deployment commands in the YAML as needed.

      Expand to View Command Content

      kubectl apply -f- <<EOF
      apiVersion: v1
      kind: Service
      metadata:
        name: mocka
        labels:
          app: mocka
          service: mocka
      spec:
        ports:
        - port: 8000
          name: http
        selector:
          app: mocka
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mocka-cn-hangzhou-h
        labels:
          app: mocka
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mocka
        template:
          metadata:
            labels:
              app: mocka
              locality: cn-hangzhou-h
          spec:
            nodeSelector:      
              topology.kubernetes.io/zone: cn-hangzhou-h  
            containers:
            - name: default
              image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
              imagePullPolicy: IfNotPresent
              env:
              - name: version
                value: cn-hangzhou-h 
              - name: app
                value: mocka
              - name: upstream_url
                value: "http://mockb:8000/"
              ports:
              - containerPort: 8000
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mocka-cn-hangzhou-k
        labels:
          app: mocka
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mocka
        template:
          metadata:
            labels:
              app: mocka
              locality: cn-hangzhou-k
          spec:
            nodeSelector:      
              topology.kubernetes.io/zone: cn-hangzhou-k
            containers:
            - name: default
              image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
              imagePullPolicy: IfNotPresent
              env:
              - name: version
                value: cn-hangzhou-k
              - name: app
                value: mocka
              - name: upstream_url
                value: "http://mockb:8000/"
              ports:
              - containerPort: 8000
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: mockb
        labels:
          app: mockb
          service: mockb
      spec:
        ports:
        - port: 8000
          name: http
        selector:
          app: mockb
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mockb-cn-hangzhou-h
        labels:
          app: mockb
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mockb
        template:
          metadata:
            labels:
              app: mockb
              locality: cn-hangzhou-h
          spec:
            nodeSelector:      
              topology.kubernetes.io/zone: cn-hangzhou-h
            containers:
            - name: default
              image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
              imagePullPolicy: IfNotPresent
              env:
              - name: version
                value: cn-hangzhou-h
              - name: app
                value: mockb
              - name: upstream_url
                value: "http://mockc:8000/"
              ports:
              - containerPort: 8000
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mockb-cn-hangzhou-k
        labels:
          app: mockb
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mockb
        template:
          metadata:
            labels:
              app: mockb
              locality: cn-hangzhou-k
          spec:
            nodeSelector:      
              topology.kubernetes.io/zone: cn-hangzhou-k
            containers:
            - name: default
              image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
              imagePullPolicy: IfNotPresent
              env:
              - name: version
                value: cn-hangzhou-k
              - name: app
                value: mockb
              - name: upstream_url
                value: "http://mockc:8000/"
              ports:
              - containerPort: 8000
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: mockc
        labels:
          app: mockc
          service: mockc
      spec:
        ports:
        - port: 8000
          name: http
        selector:
          app: mockc
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mockc-cn-hangzhou-h
        labels:
          app: mockc
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mockc
        template:
          metadata:
            labels:
              app: mockc
              locality: cn-hangzhou-h
          spec:
            nodeSelector:      
              topology.kubernetes.io/zone: cn-hangzhou-h
            containers:
            - name: default
              image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
              imagePullPolicy: IfNotPresent
              env:
              - name: version
                value: cn-hangzhou-h
              - name: app
                value: mockc
              ports:
              - containerPort: 8000
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mockc-cn-hangzhou-k
        labels:
          app: mockc
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mockc
        template:
          metadata:
            labels:
              app: mockc
              locality: cn-hangzhou-k
          spec:
            nodeSelector:      
              topology.kubernetes.io/zone: cn-hangzhou-k
            containers:
            - name: default
              image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
              imagePullPolicy: IfNotPresent
              env:
              - name: version
                value: cn-hangzhou-k
              - name: app
                value: mockc
              ports:
              - containerPort: 8000
      ---
      apiVersion: networking.istio.io/v1beta1
      kind: Gateway
      metadata:
        name: mocka
        namespace: default
      spec:
        selector:
          istio: ingressgateway
        servers:
          - hosts:
              - '*'
            port:
              name: test
              number: 80
              protocol: HTTP
      ---
      apiVersion: networking.istio.io/v1beta1
      kind: VirtualService
      metadata:
        name: demoapp-vs
        namespace: default
      spec:
        gateways:
          - mocka
        hosts:
          - '*'
        http:
          - name: test
            route:
              - destination:
                  host: mocka
                  port:
                    number: 8000
      EOF

      After executing the commands in both clusters, the services mocka, mockb, and mockc will be deployed peer-to-peer. Each service will have two stateless deployments with a single replica, distributed across different zones using NodeSelectors, and each will be configured with environment variables to specify its zone.

      Note

      For demo purposes, this example utilizes the NodeSelector field to manually select the zone for the pod's location. When building a high-availability environment, set up topology spread constraints to maximize the distribution of pods across various zones. For more information, see Workload HA configuration.

  4. Set up Global Traffic Manager (GTM) for multi-active disaster recovery of ingress gateways using NLB instances.

Step 2: Enable geographic-based failover

ASM provides geographic-based failover capabilities by default, which require host-level circuit breaking in the destination rules to work effectively.

  1. Deploy host-level circuit breaking rules in both cluster 1 and cluster 2.

    kubectl apply -f- <<EOF
    apiVersion: networking.istio.io/v1beta1
    kind: DestinationRule
    metadata:
      name: mocka
    spec:
      host: mocka
      trafficPolicy:
        outlierDetection:
          splitExternalLocalOriginErrors: true
          consecutiveLocalOriginFailures: 1
          baseEjectionTime: 5m
          consecutive5xxErrors: 1
          interval: 30s
          maxEjectionPercent: 100
    ---
    apiVersion: networking.istio.io/v1beta1
    kind: DestinationRule
    metadata:
      name: mockb
    spec:
      host: mockb
      trafficPolicy:
        outlierDetection:
          splitExternalLocalOriginErrors: true
          consecutiveLocalOriginFailures: 1
          baseEjectionTime: 5m
          consecutive5xxErrors: 1
          interval: 30s
          maxEjectionPercent: 100
    ---
    apiVersion: networking.istio.io/v1beta1
    kind: DestinationRule
    metadata:
      name: mockc
    spec:
      host: mockc
      trafficPolicy:
        outlierDetection:
          splitExternalLocalOriginErrors: true
          consecutiveLocalOriginFailures: 1
          baseEjectionTime: 5m
          consecutive5xxErrors: 1
          interval: 30s
          maxEjectionPercent: 100
    EOF

    outlierDetection is the service mesh mechanism for detecting service failures and evicting faulty endpoints. The configuration items are described below:

    Configuration item

    Description

    interval

    The time for failure detection.

    baseEjectionTime

    The duration to evict the endpoint from the load balancing pool after it is deemed faulty.

    maxEjectionPercent

    The maximum percentage of evicted endpoints.

    consecutive5xxErrors

    The number of consecutive 5xx errors returned before the endpoint is deemed faulty.

    splitExternalLocalOriginErrors

    Whether to consider connection failures and timeouts as faults.

    consecutiveLocalOriginFailures

    The number of consecutive non-5xx errors (such as connection failures and timeouts) before an endpoint is deemed faulty.

  2. Verify the service status.

    Send a request to the service's domain name.

    curl mock.asm-demo.work/mock -v

    Expected output: The service call chain remains within workloads in the same zone.

    * Host mock.asm-demo.work:80 was resolved.
    * IPv6: (none)
    * IPv4: 8.xxx.xxx.47, 8.xxx.xxx.42
    *   Trying 8.xxx.xxx.47:80...
    * Connected to mock.asm-demo.work (8.209.XXX.XX) port 80
    > GET /mock HTTP/1.1
    > Host: mock.asm-demo.work
    > User-Agent: curl/8.7.1
    > Accept: */*
    > 
    * Request completely sent off
    < HTTP/1.1 200 OK
    < date: Wed, 11 Dec 2024 12:10:56 GMT
    < content-length: 153
    < content-type: text/plain; charset=utf-8
    < x-envoy-upstream-service-time: 3
    < server: istio-envoy
    < 
    * Connection #0 to host mock.asm-demo.work left intact
    -> mocka(version: ap-northeast-1b, ip: 10.1.225.40)-> mockb(version: ap-northeast-1b, ip: 10.1.225.31)-> mockc(version: ap-northeast-1b, ip: 10.1.225.32)%

Step 3: Fault simulation

Simulate a workload failure by manually changing the container image of the mockb service in a specific zone.

  1. Replace the image of mockb-ap-northeast-1a in cluster 2 to simulate a failure.

    1. Log on to the ACK console. In the left navigation pane, click Clusters.

    2. On the Clusters page, find the cluster you want to manage and click its name. In the left navigation pane, choose Workloads > Deployments.

    3. Click Edit in the Actions column of the mockb workload, change the image to registry.cn-hangzhou.aliyuncs.com/acs/curl:8.1.2, and add ["sleep","3600"] to the command in Start to allow the container to start normally. Click Update.

  2. Continuously access the application domain name to observe the request chain after the failure.

    curl mock.asm-demo.work/mock -v

    Expectation 1: When the request is sent to a zone other than ap-northeast-1a, the request remains within the normal zone.

    * Host mock.asm-demo.work:80 was resolved.
    * IPv6: (none)
    * IPv4: 8.209.XXX.XX, 8.221.XXX.XX
    *   Trying 8.209.247.47:80...
    * Connected to mock.asm-demo.work (8.209.247.47) port 80
    > GET /mock HTTP/1.1
    > Host: mock.asm-demo.work
    > User-Agent: curl/8.7.1
    > Accept: */*
    > 
    * Request completely sent off
    < HTTP/1.1 200 OK
    < date: Wed, 11 Dec 2024 12:10:56 GMT
    < content-length: 153
    < content-type: text/plain; charset=utf-8
    < x-envoy-upstream-service-time: 3
    < server: istio-envoy
    < 
    * Connection #0 to host mock.asm-demo.work left intact
    -> mocka(version: ap-northeast-1b, ip: 10.1.225.40)-> mockb(version: ap-northeast-1b, ip: 10.1.225.31)-> mockc(version: ap-northeast-1b, ip: 10.1.225.32)% 

    Expectation 2: When the request is first sent to the ap-northeast-1a zone, a connection refused error occurs when the request reaches the mockb service.

    * Host mock.asm-demo.work:80 was resolved.
    * IPv6: (none)
    * IPv4: 112.124.XX.XXX, 121.41.XXX.XXX
    *   Trying 112.124.65.120:80...
    * Connected to mock.asm-demo.work (112.124.65.120) port 80
    > GET /mock HTTP/1.1
    > Host: mock.asm-demo.work
    > User-Agent: curl/8.7.1
    > Accept: */*
    > 
    * Request completely sent off
    < HTTP/1.1 200 OK
    < date: Wed, 11 Dec 2024 12:08:45 GMT
    < content-length: 220
    < content-type: text/plain; charset=utf-8
    < x-envoy-upstream-service-time: 48
    < server: istio-envoy
    < 
    * Connection #0 to host mock.asm-demo.work left intact
    -> mocka(version: cn-hangzhou-h, ip: 192.168.122.135)upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused% 

    Expectation 3: When the request is sent again to the ap-northeast-1a zone, the request is then transferred to ap-northeast-1b in a different zone within the same region.

    * Host mock.asm-demo.work:80 was resolved.
    * IPv6: (none)
    * IPv4: 8.209.XXX.XX, 8.221.XXX.XX
    *   Trying 8.209.247.47:80...
    * Connected to mock.asm-demo.work (8.209.247.47) port 80
    > GET /mock HTTP/1.1
    > Host: mock.asm-demo.work
    > User-Agent: curl/8.7.1
    > Accept: */*
    > 
    * Request completely sent off
    < HTTP/1.1 200 OK
    < date: Wed, 11 Dec 2024 12:10:59 GMT
    < content-length: 154
    < content-type: text/plain; charset=utf-8
    < x-envoy-upstream-service-time: 4
    < server: istio-envoy
    < 
    * Connection #0 to host mock.asm-demo.work left intact
    -> mocka(version: ap-northeast-1a, ip: 10.0.239.141)-> mockb(version: ap-northeast-1b, ip: 10.1.225.31)-> mockc(version: ap-northeast-1b, ip: 10.1.225.32)% 

Step 4 (Optional): Configure alerts for service-level failures

  1. Configure the sidecar proxy's proxyStatsMatcher to report relevant metrics, then use Prometheus to collect and analyze circuit breaking-related metrics.

    1. Configure the sidecar proxy to report circuit breaking metrics using the proxyStatsMatcher. When setting up proxyStatsMatcher, select Regular Expression Match and enter .*outlier_detection.*. For more information, see Configure proxyStatsMatcher.

      Key circuit breaking metrics are described below:

      Metric

      Metric type

      Description

      envoy_cluster_outlier_detection_ejections_active

      Gauge

      The number of currently evicted hosts.

      envoy_cluster_outlier_detection_ejections_enforced_total

      Counter

      The number of host eviction events occurred.

      envoy_cluster_outlier_detection_ejections_overflow

      Counter

      The number of times host eviction was abandoned due to exceeding the maximum eviction percentage.

      ejections_detected_consecutive_5xx

      Counter

      The number of times a host was detected to produce consecutive 5xx errors.

    2. Redeploy the httpbin stateless workload.

  2. Create alert rules for host-level circuit breaking.

    1. To ensure that the observability monitoring Prometheus version can collect exposed circuit breaking metrics, integrate the Alibaba Cloud ASM component or upgrade it to the latest version for the data plane cluster. For more information, see Manage components. If service mesh monitoring has been configured with a self-built Prometheus as described in Monitor ASM instances by using a self-managed Prometheus instance, skip this step.

    2. Create alert rules for host-level circuit breaking. For more information, see Create an alert rule for a Prometheus instance.

      Below are examples of how to fill in key parameters for configuring alert rules. Refer to the previously mentioned topics for other parameters as needed.

      Parameter

      Example

      Description

      Custom PromQL Statements

      (sum (envoy_cluster_outlier_detection_ejections_active) by (cluster_name, namespace)) > 0

      The example determines whether there are currently evicted hosts in the current cluster by querying the envoy_cluster_outlier_detection_ejections_active metric. Then it groups the query results by the namespace and service name where the service is located.

      Alert Message

      Host-level circuit breaking triggered, workloads continuously experiencing errors have been evicted from the service load balancing pool! Namespace: {{$labels.namespace}}, Service where eviction occurred: {{$labels.cluster_name}}. Number of evictions: {{ $value }}

      The example alert message displays the namespace and service name where the circuit breaking was triggered, along with the current number of evictions for that service.