×
Community Blog Service Mesh Disaster Recovery Scenarios (1): Use Service Mesh to Deal with Region-level Disaster Recovery

Service Mesh Disaster Recovery Scenarios (1): Use Service Mesh to Deal with Region-level Disaster Recovery

This article introduces how to achieve automatic detection of region-level faults and traffic redirection based on ASM.

By Hang Yin

This article introduces how to achieve automatic detection of region-level faults and traffic redirection based on ASM. By deploying Kubernetes clusters and ASM gateways across multiple regions and integrating Alibaba Cloud DNS and Global Traffic Manager (GTM), it ensures business continuity and high availability.

A region-level fault is an extreme type of fault that cloud services may encounter. When a region-level fault occurs, services in any zone in a specific region are exposed to risks such as connection failure, data loss, and workload failure. Possible causes of region-level faults include but are not limited to:

Natural disasters: Earthquakes, floods, and other disasters may cause structural damage in data centers and power outages.

Telecom network issues: Major ISP faults affect the accessibility of cloud services, or fiber breaks in the key communication line interrupt the network.

Human errors: Incorrect operation command from an O&M personnel causes workloads in a region to fail.

Security incidents: Such as large-scale distributed denial of service attacks.

ASM allows you to deploy the ASM ingress gateway in Kubernetes clusters or on Elastic Container Instances (ECI) to serve as a unified traffic ingress for business applications. Each cluster has an independent traffic ingress IP address. In this way, the ASM gateway can combine Alibaba Cloud DNS and Global Traffic Manager (GTM). In normal times, the ASM ingress gateways in the two regions receive a part of the traffic according to the specified proportion. When one of the regions fails, the faulty IP address is removed and all the traffic is transferred to the normal region to achieve region-level disaster recovery.

Overview of Disaster Recovery Architecture

1

As shown in the preceding figure, to verify the response to region-level faults, we need to prepare a Kubernetes cluster in each region in a multi-region environment (the following example uses dual-region and dual-cluster) and deploy cloud-native services in each cluster in a fully peer-to-peer manner. Services call each other through Kubernetes cluster domain names, and the names of peer services in each cluster remain identical. In addition, an ASM gateway needs to be deployed in the cluster in each region. We must configure the ASM gateway to expose a public IP address through an Internet-facing CLB instance and use Alibaba Cloud DNS and GTM to resolve the domain name to the two IP addresses.

2

When a region-level fault occurs, the services in the other healthy region are not affected and can still work normally. In addition, ASM gateways distribute the workload in clusters in two regions. Therefore, ASM gateways deployed in the healthy region can still receive traffic and serve as traffic ingress. In this case, GTM can remove the IP address of the faulty region from the resolution pool and forward all traffic to the ASM gateway in the healthy region.

You can enable the following service mesh features to properly handle traffic redirection in case of region faults:

  1. You can enable HPA for the ASM gateway: When traffic bursts, quickly scale out more ASM gateway instances to handle burst traffic.
  2. You can configure throttling for key services in the ASM gateway or the cluster: When traffic bursts, the throttling feature of ASM can reject requests that exceed the capacity of services in the cluster to prevent application services in the healthy region from being overwhelmed by a large amount of traffic that fails over. In addition, you can configure metric observation and alerts for ASM throttling. This allows you to observe and detect fault events and scale out service workloads in healthy regions in time.

Practice of Configuring Disaster Recovery

Step 1: Deploy a Kubernetes cluster in each of the two regions

To implement region-level disaster recovery, you must deploy two Kubernetes clusters across regions.

Create two Kubernetes clusters named cluster-1 and cluster-2. Enable Expose API Server with EIP. When you create the clusters, specify two different regions for cluster-1 and cluster-2.

For more information, see Create an ACK managed cluster.

Step 2: Build service mesh with a multi-master control plane architecture based on two clusters

After the Kubernetes cluster is created, you must create an ASM instance and add the Kubernetes cluster to ASM for management.

The multi-master control plane architecture is an architecture mode that uses service mesh to manage multiple Kubernetes clusters. In this architecture, multiple service mesh instances manage the data plane components of their respective Kubernetes clusters and distribute configurations for the mesh proxies in the clusters. At the same time, these instances rely on a shared trusted root certificate for service discovery and communication across clusters.

3

In a cross-region cluster deployment environment, although both clusters can be managed through a unified service mesh instance, building a multi-master service mesh remains the best practice for cross-region disaster recovery solutions due to the following reasons:

1. Better configuration push latency: When deployed across regions, multiple service mesh instances are used to connect to the mesh proxy in the nearest Kubernetes cluster to achieve better configuration push performance.

2. Better stability performance: When the region is unavailable, the way that a service mesh control plane connects all clusters may cause the cluster to be unable to synchronize configuration or to start up because it cannot connect to the control plane; in a multi-master control plane architecture, the mesh proxy in the normal region can still connect to the control plane without affecting the distribution of service mesh configuration and the normal startup of the mesh proxy. This ensures that the ASM gateway and services in the normal region can be scaled out even if the service mesh control plane in the faulty region cannot be connected to the services.

For more information about how to build a multi-master service mesh, see Multi-cluster disaster recovery through ASM multi-master control plane architecture. Complete Step 1 and Step 2.

When the architecture is built, service mesh instances named mesh-1 and mesh-2 are created; the mesh proxies in cluster-1 connect to the control plane of the mesh-1 instance, and the mesh proxies in cluster-2 connect to the control plane of the mesh-2 instance.

Step 3: Deploy the ASM gateway and sample service in the two clusters

1) Create an ASM ingress gateway named ingressgateway in the mesh-1 and mesh-2 instances. For more information, see Create an ingress gateway. When you create a gateway, you can enable HPA to achieve auto scaling of the ASM gateway for CPU and memory metrics.

2) Deploy the Bookinfo sample application in cluster-1 and cluster-2 respectively. For more information, see Deploy an application in an ACK cluster that is added to an ASM instance.

The sample application, Bookinfo, is a book review application. Its service architecture is as follows:

4

3) Create gateway rules and virtual services in the mesh-1 and mesh-2 instances respectively, and use the ASM gateway as the traffic ingress of the Bookinfo application. For more information, see Use Istio resources to route traffic to different versions of a service.

Step 4: Configure services that keep traffic in cluster for the ASM instance

When two or more Kubernetes clusters are added to the same service mesh instance, the default load balancing mechanism of the service mesh may cause the service to try to call the peer-deployed service in the opposite cluster if no configuration is provided.

5

In the scenario of region-level disaster recovery, we expect that when everything is normal, the traffic is always maintained in a single cluster and no cross-cluster calls are made. When a fault occurs in a region, it usually means that all workloads in the clusters of the region are unavailable. In this case, cross-cluster calls are also not required. Therefore, in the scenario of region-level disaster recovery, we always want the traffic to remain in a single cluster.

By configuring services that keep traffic in cluster, you can keep requests for services in the current cluster, meeting the preceding requirements. For more information, see Configure services that keep traffic in cluster in multi-cluster scenarios.

6

Note:

Although services that keep traffic in cluster are simple to configure and help cope with region-level disaster recovery scenarios, when faults occur at a finer granularity (such as service-level faults and node-level faults), we will expect the traffic to fail over to normal peer-deployed services in other regions/zones. In this case, services that keep traffic in cluster cannot meet such disaster recovery demands.

By configuring geographic location-based failover, disaster recovery can be implemented at a finer fault granularity like services and nodes: When everything is normal, traffic is always kept in a single cluster/region, and when a service on the trace in the cluster fails, can traffic be failed over to peer services in other clusters through cross-cluster calls? Please stay tuned for subsequent articles for details.

Step 5: Configure GTM to implement cross-region disaster recovery based on the IP addresses of the two ASM ingress gateways

ASM gateways can be used in combination with Alibaba Cloud DNS and GTM. In normal times, ASM ingress gateways in the two regions receive a part of the traffic in a specified ratio. When a fault occurs in one region, the faulty IP address is removed and all traffic is redirected to the normal region to implement region-level disaster recovery.

After the ASM gateways and the peer applications are deployed in the clusters in the two regions, the deployed ASM ingress gateways have their own public IP addresses as the traffic ingress IP addresses of the applications. You can log on to the ASM console to obtain the public IP addresses of the two ASM gateways. For more information about how to obtain the IP addresses of the ASM gateways, see Obtain the IP address of the ingress gateway.

Use the IP addresses of the two ASM gateways as the traffic ingress IP addresses of the application. Both IP addresses are connected to the same application domain name through Alibaba Cloud DNS. Configure multi-active load balancing and disaster recovery through GTM. For more information, see Use GTM to implement multi-active load balancing and disaster recovery.

7

[Optional] Step 6: Configure local throttling and throttling-based observations and alerts on the ingress gateway

In the mesh-1 and mesh-2 instances, create local throttling rules on the ASM gateway based on the following YAML code: For more information, see Configure local throttling on an ingress gateway.

apiVersion: istio.alibabacloud.com/v1beta1
kind: ASMLocalRateLimiter
metadata:
  name: ingressgateway
  namespace: istio-system
spec:
  configs:
    - limit:
        fill_interval:
          seconds: 1
        quota: 100
      match:
        vhost:
          name: '*'
          port: 80
          route:
            name_match: gw-to-productage
  isGateway: true
  workloadSelector:
    labels:
      istio: ingressgateway

At the same time, configure metric collection and alerts for local throttling for the two service mesh instances. For more information, see Configure metric collection and alerts for local throttling.

Region-level fault drill

1.  Run the following command to simulate sending a request to the sample application by using the stress testing tool fortio. Replace the domain name with the domain name configured in GTM. Before you run the command, install the stress testing tool fortio by referring to the installation instructions in the fortio project.

fortio load -jitter=False  -c 1 -qps 100 -t 300s -keepalive=False -a http://{domain name}/productpage

2.  The stress test will last for 5 minutes, during which we will simulate the region-level fault by deleting the ingress gateway workload:

a) Log on to the Container Service for Kubernetes console, find cluster-2, and click Details.

b) Click the Deployments tab and select the istio-system namespace.

c) On the page that appears, find istio-ingressgateway, click More in the Actions column, and then click Delete.

8

3.  Wait for the stress test to end.

Expected output:

# range, mid point, percentile, count
>= -261.054 <= -0.0693516 , -130.561 , 100.00, 3899
# target 50% -130.595
WARNING 100.00% of sleep were falling behind
Aggregated Function Time : count 3899 avg 0.076910055 +/- 0.02867 min 0.062074583 max 1.079674 sum 299.872304
# range, mid point, percentile, count
>= 0.0620746 <= 0.07 , 0.0660373 , 19.34, 754
> 0.07 <= 0.08 , 0.075 , 71.94, 2051
> 0.08 <= 0.09 , 0.085 , 96.08, 941
> 0.09 <= 0.1 , 0.095 , 99.23, 123
> 0.1 <= 0.12 , 0.11 , 99.62, 15
> 0.12 <= 0.14 , 0.13 , 99.82, 8
> 0.14 <= 0.16 , 0.15 , 99.92, 4
> 1 <= 1.07967 , 1.03984 , 100.00, 3
# target 50% 0.0758289
# target 75% 0.0812673
# target 90% 0.0874825
# target 99% 0.0992691
# target 99.9% 0.155505
Error cases : count 527 avg 0.074144883 +/- 0.07572 min 0.062074583 max 1.079674 sum 39.0743532
# range, mid point, percentile, count
>= 0.0620746 <= 0.07 , 0.0660373 , 82.54, 435
> 0.07 <= 0.08 , 0.075 , 96.58, 74
> 0.08 <= 0.09 , 0.085 , 99.05, 13
> 0.09 <= 0.1 , 0.095 , 99.24, 1
> 0.12 <= 0.14 , 0.13 , 99.43, 1
> 1 <= 1.07967 , 1.03984 , 100.00, 3
# target 50% 0.0668682
# target 75% 0.0692741
# target 90% 0.0753108
# target 99% 0.0897923
# target 99.9% 1.06568
# Socket and IP used for each connection:
[0] 3900 socket used, resolved to [121.41.113.220:80 (3373), 8.209.197.28:80 (527)], connection timing : count 3900 avg 0.038202153 +/- 0.03097 min 0.027057 max 1.07747175 sum 148.988395
Connection time histogram (s) : count 3900 avg 0.038202153 +/- 0.03097 min 0.027057 max 1.07747175 sum 148.988395
# range, mid point, percentile, count
>= 0.027057 <= 0.03 , 0.0285285 , 13.28, 518
> 0.03 <= 0.035 , 0.0325 , 62.79, 1931
> 0.035 <= 0.04 , 0.0375 , 83.95, 825
> 0.04 <= 0.045 , 0.0425 , 86.13, 85
> 0.045 <= 0.05 , 0.0475 , 86.18, 2
> 0.05 <= 0.06 , 0.055 , 86.28, 4
> 0.06 <= 0.07 , 0.065 , 98.03, 458
> 0.07 <= 0.08 , 0.075 , 99.77, 68
> 0.08 <= 0.09 , 0.085 , 99.92, 6
> 1 <= 1.07747 , 1.03874 , 100.00, 3
# target 50% 0.0337079
# target 75% 0.0378848
# target 90% 0.0631659
# target 99% 0.0755882
# target 99.9% 0.0885
Sockets used: 3900 (for perfect keepalive, would be 1)
Uniform: false, Jitter: false, Catchup allowed: true
IP addresses distribution:
121.41.113.220:80: 3373
8.209.197.28:80: 527
Code  -1 : 527 (13.5 %)
Code 200 : 3372 (86.5 %)
Response Header Sizes : count 3899 avg 178.19851 +/- 70.45 min 0 max 207 sum 694796
Response Body/Total Sizes : count 3899 avg 4477.7081 +/- 1822 min 0 max 5501 sum 17458584
All done 3899 calls (plus 1 warmup) 76.910 ms avg, 13.0 qps

You can see that a small number of requests fail to connect under the simulated region-level fault, but most requests succeed, proving that ASM and GTM have effectively completed disaster recovery for region-level faults.

In the GTM console, it can be seen that the IP address on one side is automatically removed.

9

In this scenario, the health check capability of GTM is used to automatically remove unavailable IP addresses for disaster recovery. You can also configure alerts to receive alarms when IP addresses are unavailable and manually remove them. For more information, see Alert configuration.

0 1 0
Share on

Alibaba Container Service

204 posts | 33 followers

You may also like

Comments

Alibaba Container Service

204 posts | 33 followers

Related Products