Service-level failures are common in cloud-native businesses. Although these failures are smaller than zone-level issues, they can still lead to business unavailability or degradation due to single points of failure. Alibaba Cloud Service Mesh (ASM) supports service discovery and network communication across multiple clusters. By using a multi-region, multi-cluster deployment, ASM can quickly switch traffic to available service instances within seconds if a service fails. This ensures global availability for business applications. This topic explains how to use ASM for disaster recovery in service-level failures.
Disaster recovery architecture
Failover mechanism: Service Mesh-based geographic failover
ASM supports a geographic-based failover mechanism. By detecting whether a workload continuously experiences response errors within a time window, ASM determines the failure status of the workload and automatically transfers traffic targets to other available workloads when a workload fails.
The following example involves two clusters located in different regions, where two sets of applications are deployed peer-to-peer. The workloads of each service in each cluster are evenly distributed across zones through topology spread constraints.
Normal traffic topology
When services are normal, the service mesh keeps traffic within the same zone to minimize the impact of the target location on request latency.
Automatic geographic failover during workload anomalies
When ASM detects continuous response errors in a workload, it removes the service from the load balancing pool to transfer requests to other available workloads. Traffic failover follows the priority order below:
Other workloads in the same zone.
Workloads in different zones within the same region.
Additionally, by configuring Prometheus instances to collect service mesh-related metrics, you will receive alert notifications when failover occurs.
Cross-region traffic transfer
When there are no available workloads in the zones within the same region, traffic can be transferred to the same service deployed in another region's cluster. This process is automatically completed by the service mesh.
In this case, routing must be established between the two clusters so that a pod in cluster A can connect to one in cluster B. For scenarios involving multiple regions in Alibaba Cloud, use Cloud Enterprise Network (CEN) to connect the physical networks between clusters and achieve cluster intercommunication. This solution requires planning the cluster network (such as pod, service, and VPC CIDR) and using CEN instances to achieve cross-region network connectivity, suitable for customers with high-quality cross-region traffic requirements.
For clusters in data centers or third-party cloud providers, network planning can limit physical connectivity. To address this, ASM allows the establishment of an mTLS secure channel through cross-cluster mesh proxies for essential communication and secure cross-region traffic transfer. However, due to the limited quality of public networks, this solution is best for users with minimal inter-cluster communication needs or those facing practical network conflicts.
Application deployment topology: Support methods and comparison of various cluster topologies
ASM's geographic-based failover mechanism can be seamlessly applied to various application deployment topologies. Deployment topologies differ in their scope of failure dimensions, resource costs, and operation and maintenance thresholds.
Single-cluster multi-zone deployment
This is the simplest deployment method for disaster recovery. By configuring multi-zone node pools, selecting multi-zone virtual switches for node pools, and choosing a balanced distribution strategy when configuring scaling policies, you can evenly distribute Elastic Compute Service (ECS) instances across the specified multi-zone scaling group.
Comparison of advantages and disadvantages of this method:
Advantages
Disadvantages
Only one Container Service for Kubernetes (ACK) cluster with a multi-zone node pool is needed. Applications are evenly deployed across zones through topology spread constraints.
Cannot handle failures caused by Kubernetes misconfiguration or application dependencies.
Load balancing, databases, Kubernetes clusters, service meshes, and other cloud resources only need to be prepared within the same region.
Cannot handle region-level failures.
Single-region multi-zone multi-cluster deployment
Expand from a single-cluster multi-zone to multiple clusters, each with its own configuration. During failover, ASM first checks for workloads in the same zone and then randomly selects one from other available workloads in different zones within the same region. We recommend that you choose independent zones for the clusters. If the zones overlap, workloads in the same zone may be prioritized over those in different clusters, which could lead to unexpected business outcomes.
Comparison of advantages and disadvantages of this method:
Advantages
Disadvantages
Compared to single-cluster deployment, this topology expands the dimensions of fault causes covered, further improving the overall availability of the system.
Services must be deployed in two Kubernetes cluster environments, with high deployment and maintenance complexity. Load balancing, Kubernetes clusters, and other cloud resource infrastructure must be deployed multiple times, resulting in high costs.
Only two clusters need to be within the same VPC. When traffic failover to services in another cluster is required, this deployment topology has lower cluster intercommunication costs.
Cannot handle region-level failures.
NoteIn this deployment topology, you must plan the network configuration of multiple clusters to avoid network conflicts. For more information, see Plan CIDR blocks for multiple clusters on the data plane.
Do not use single-region multi-zone multi-cluster deployment as the first choice, because it significantly increases deployment and maintenance complexity without providing the highest level of availability.
Multi-region multi-cluster deployment
Based on multiple clusters, change single-region deployment to multi-region deployment. This ensures that each service is distributed across dimensions, which means they have different availability zones, deployment regions, clusters, and dependencies. As a result, normal workloads can continue to operate during potential failures.
Comparison of advantages and disadvantages of this method:
Advantages
Disadvantages
Can distribute the workloads of the same service across different regions, zones, and clusters to improve service availability.
Services must be deployed in two Kubernetes cluster environments. For better push latency performance, two copies of service mesh instances must also be prepared.
Based on a multi-master control plane architecture, latency issues can be effectively controlled.
Cross-region network intercommunication is relatively complex.
When you have high availability requirements, use this topology for business application deployment. In scenarios requiring cross-cluster network connectivity, choose from the following two methods to achieve cluster network connectivity:
Use CEN to directly connect the underlying networks of cross-region clusters. First, plan the network configuration of multiple clusters to avoid network conflicts. For specific operations, see Plan CIDR blocks for multiple clusters on the data plane. Next, use CEN to connect cross-region networks. For specific operations, see Manage inter-region connections.
Use ASM cross-cluster mesh proxy to achieve cross-cluster network connectivity and failover, optimizing resource costs and solving complex network intercommunication issues. Using this method, the networks of the two clusters do not need to be directly interconnected, and cluster network planning is not required. For details on ASM cross-cluster mesh proxy, see Use ASM cross-cluster mesh proxy to implement cross-network communication among multiple clusters.
Multi-cloud multi-region multi-cluster deployment.
This deployment topology represents the highest level of deployment architecture. ASM offers comprehensive management for Kubernetes clusters across various cloud providers and on-premises data centers. By deploying Kubernetes cluster environments across different clouds or hybrid environments, it guarantees continuous service availability even if one infrastructure experiences a failure.
Similar to multi-region multi-cluster deployments, multi-cloud multi-region multi-cluster deployments also require network connectivity between clusters. However, ASM cannot provide a managed control plane in environments outside of Alibaba Cloud. To address push latency issues in cross-cloud scenarios, use the ASM remote control plane.
Comparison of advantages and disadvantages of this method:
Advantages
Disadvantages
Allows distribution of the same service's workloads across various regions, zones, clusters, and cloud providers, enhancing service availability.
Services must be deployed across two Kubernetes cluster environments. Multi-cloud deployments may face numerous compatibility and other unforeseen issues due to infrastructure and product capability differences between providers.
The multi-master control plane architecture effectively mitigates latency issues.
Intercommunication across regions is quite complex.
Disaster recovery configuration
This section describes the process for setting up service-level disaster recovery using a multi-region, multi-cluster deployment approach as an example.
Step 1: Environment preparation
Create a Kubernetes cluster in each of the two regions, naming them cluster 1 and cluster 2, both configured across multiple zones and with the Expose API Server with EIP feature enabled. For more information, see Create an ACK managed cluster.
Create ASM instances mesh-1 and mesh-2 in the same regions as the ACK clusters, ensuring that the selected switches during creation are in the same zones. For more information, see Steps 1, 2, and 3 in Multi-cluster disaster recovery through ASM multi-master control plane architecture.
Deploy ASM gateways and sample services.
Create a Network Load Balancer (NLB)-type ingress gateway named
ingressgatewayin both mesh-1 and mesh-2, ensuring you select the same two zones used by the ASM instances and ACK clusters. For more information, see Associate an NLB instance with an ingress gateway.Enable automatic sidecar injection in the default namespace for both clusters within mesh-1 and mesh-2. For detailed instructions, see Manage global namespaces.
Deploy sample applications in a peer-to-peer manner across both ACK clusters. The four zones used in the example are
cn-hangzhou-h,cn-hangzhou-k,ap-northeast-1a, andap-northeast-1b. Adjust the deployment commands in the YAML as needed.After executing the commands in both clusters, the services mocka, mockb, and mockc will be deployed peer-to-peer. Each service will have two stateless deployments with a single replica, distributed across different zones using NodeSelectors, and each will be configured with environment variables to specify its zone.
NoteFor demo purposes, this example utilizes the NodeSelector field to manually select the zone for the pod's location. When building a high-availability environment, set up topology spread constraints to maximize the distribution of pods across various zones. For more information, see Workload HA configuration.
Step 2: Enable geographic-based failover
ASM provides geographic-based failover capabilities by default, which require host-level circuit breaking in the destination rules to work effectively.
Deploy host-level circuit breaking rules in both cluster 1 and cluster 2.
kubectl apply -f- <<EOF apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: mocka spec: host: mocka trafficPolicy: outlierDetection: splitExternalLocalOriginErrors: true consecutiveLocalOriginFailures: 1 baseEjectionTime: 5m consecutive5xxErrors: 1 interval: 30s maxEjectionPercent: 100 --- apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: mockb spec: host: mockb trafficPolicy: outlierDetection: splitExternalLocalOriginErrors: true consecutiveLocalOriginFailures: 1 baseEjectionTime: 5m consecutive5xxErrors: 1 interval: 30s maxEjectionPercent: 100 --- apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: mockc spec: host: mockc trafficPolicy: outlierDetection: splitExternalLocalOriginErrors: true consecutiveLocalOriginFailures: 1 baseEjectionTime: 5m consecutive5xxErrors: 1 interval: 30s maxEjectionPercent: 100 EOFoutlierDetection is the service mesh mechanism for detecting service failures and evicting faulty endpoints. The configuration items are described below:
Configuration item
Description
interval
The time for failure detection.
baseEjectionTime
The duration to evict the endpoint from the load balancing pool after it is deemed faulty.
maxEjectionPercent
The maximum percentage of evicted endpoints.
consecutive5xxErrors
The number of consecutive 5xx errors returned before the endpoint is deemed faulty.
splitExternalLocalOriginErrors
Whether to consider connection failures and timeouts as faults.
consecutiveLocalOriginFailures
The number of consecutive non-5xx errors (such as connection failures and timeouts) before an endpoint is deemed faulty.
Verify the service status.
Send a request to the service's domain name.
curl mock.asm-demo.work/mock -vExpected output: The service call chain remains within workloads in the same zone.
* Host mock.asm-demo.work:80 was resolved. * IPv6: (none) * IPv4: 8.xxx.xxx.47, 8.xxx.xxx.42 * Trying 8.xxx.xxx.47:80... * Connected to mock.asm-demo.work (8.209.XXX.XX) port 80 > GET /mock HTTP/1.1 > Host: mock.asm-demo.work > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 200 OK < date: Wed, 11 Dec 2024 12:10:56 GMT < content-length: 153 < content-type: text/plain; charset=utf-8 < x-envoy-upstream-service-time: 3 < server: istio-envoy < * Connection #0 to host mock.asm-demo.work left intact -> mocka(version: ap-northeast-1b, ip: 10.1.225.40)-> mockb(version: ap-northeast-1b, ip: 10.1.225.31)-> mockc(version: ap-northeast-1b, ip: 10.1.225.32)%
Step 3: Fault simulation
Simulate a workload failure by manually changing the container image of the mockb service in a specific zone.
Replace the image of mockb-ap-northeast-1a in cluster 2 to simulate a failure.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster you want to manage and click its name. In the left navigation pane, choose .
Click Edit in the Actions column of the mockb workload, change the image to
registry.cn-hangzhou.aliyuncs.com/acs/curl:8.1.2, and add["sleep","3600"]to the command in Start to allow the container to start normally. Click Update.
Continuously access the application domain name to observe the request chain after the failure.
curl mock.asm-demo.work/mock -vExpectation 1: When the request is sent to a zone other than
ap-northeast-1a, the request remains within the normal zone.* Host mock.asm-demo.work:80 was resolved. * IPv6: (none) * IPv4: 8.209.XXX.XX, 8.221.XXX.XX * Trying 8.209.247.47:80... * Connected to mock.asm-demo.work (8.209.247.47) port 80 > GET /mock HTTP/1.1 > Host: mock.asm-demo.work > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 200 OK < date: Wed, 11 Dec 2024 12:10:56 GMT < content-length: 153 < content-type: text/plain; charset=utf-8 < x-envoy-upstream-service-time: 3 < server: istio-envoy < * Connection #0 to host mock.asm-demo.work left intact -> mocka(version: ap-northeast-1b, ip: 10.1.225.40)-> mockb(version: ap-northeast-1b, ip: 10.1.225.31)-> mockc(version: ap-northeast-1b, ip: 10.1.225.32)%Expectation 2: When the request is first sent to the
ap-northeast-1azone, a connection refused error occurs when the request reaches the mockb service.* Host mock.asm-demo.work:80 was resolved. * IPv6: (none) * IPv4: 112.124.XX.XXX, 121.41.XXX.XXX * Trying 112.124.65.120:80... * Connected to mock.asm-demo.work (112.124.65.120) port 80 > GET /mock HTTP/1.1 > Host: mock.asm-demo.work > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 200 OK < date: Wed, 11 Dec 2024 12:08:45 GMT < content-length: 220 < content-type: text/plain; charset=utf-8 < x-envoy-upstream-service-time: 48 < server: istio-envoy < * Connection #0 to host mock.asm-demo.work left intact -> mocka(version: cn-hangzhou-h, ip: 192.168.122.135)upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused%Expectation 3: When the request is sent again to the
ap-northeast-1azone, the request is then transferred toap-northeast-1bin a different zone within the same region.* Host mock.asm-demo.work:80 was resolved. * IPv6: (none) * IPv4: 8.209.XXX.XX, 8.221.XXX.XX * Trying 8.209.247.47:80... * Connected to mock.asm-demo.work (8.209.247.47) port 80 > GET /mock HTTP/1.1 > Host: mock.asm-demo.work > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 200 OK < date: Wed, 11 Dec 2024 12:10:59 GMT < content-length: 154 < content-type: text/plain; charset=utf-8 < x-envoy-upstream-service-time: 4 < server: istio-envoy < * Connection #0 to host mock.asm-demo.work left intact -> mocka(version: ap-northeast-1a, ip: 10.0.239.141)-> mockb(version: ap-northeast-1b, ip: 10.1.225.31)-> mockc(version: ap-northeast-1b, ip: 10.1.225.32)%
Step 4 (Optional): Configure alerts for service-level failures
Configure the sidecar proxy's proxyStatsMatcher to report relevant metrics, then use Prometheus to collect and analyze circuit breaking-related metrics.
Configure the sidecar proxy to report circuit breaking metrics using the proxyStatsMatcher. When setting up proxyStatsMatcher, select Regular Expression Match and enter
.*outlier_detection.*. For more information, see Configure proxyStatsMatcher.Key circuit breaking metrics are described below:
Metric
Metric type
Description
envoy_cluster_outlier_detection_ejections_active
Gauge
The number of currently evicted hosts.
envoy_cluster_outlier_detection_ejections_enforced_total
Counter
The number of host eviction events occurred.
envoy_cluster_outlier_detection_ejections_overflow
Counter
The number of times host eviction was abandoned due to exceeding the maximum eviction percentage.
ejections_detected_consecutive_5xx
Counter
The number of times a host was detected to produce consecutive 5xx errors.
Create alert rules for host-level circuit breaking.
To ensure that the observability monitoring Prometheus version can collect exposed circuit breaking metrics, integrate the Alibaba Cloud ASM component or upgrade it to the latest version for the data plane cluster. For more information, see Manage components. If service mesh monitoring has been configured with a self-built Prometheus as described in Monitor ASM instances by using a self-managed Prometheus instance, skip this step.
Create alert rules for host-level circuit breaking. For more information, see Create an alert rule for a Prometheus instance.
Below are examples of how to fill in key parameters for configuring alert rules. Refer to the previously mentioned topics for other parameters as needed.
Parameter
Example
Description
Custom PromQL Statements
(sum (envoy_cluster_outlier_detection_ejections_active) by (cluster_name, namespace)) > 0
The example determines whether there are currently evicted hosts in the current cluster by querying the envoy_cluster_outlier_detection_ejections_active metric. Then it groups the query results by the namespace and service name where the service is located.
Alert Message
Host-level circuit breaking triggered, workloads continuously experiencing errors have been evicted from the service load balancing pool! Namespace: {{$labels.namespace}}, Service where eviction occurred: {{$labels.cluster_name}}. Number of evictions: {{ $value }}
The example alert message displays the namespace and service name where the circuit breaking was triggered, along with the current number of evictions for that service.