This topic describes the high-availability architecture of SLB. You can use SLB in concert with DNS to implement geo-disaster recovery. SLB is designed to offer a multi-zone service availability of 99.99% and a single-zone service availability of 99.90%.

High availability of the SLB architecture

SLB instances are deployed in clusters to synchronize sessions and protect backend servers from SPOFs, improving redundancy and ensuring service stability. Layer-4 SLB uses the open-source Linux Virtual Server (LVS) and Keepalived software to balance loads, whereas Layer-7 SLB uses Tengine. Tengine, a web server project launched by Taobao, is based on NGINX and adds advanced features dedicated for high-traffic websites.

Requests from the Internet reach an LVS cluster along Equal-Cost Multi Path (ECMP) routes. In the LVS cluster, each machine uses multicast packets to synchronize sessions with the other machines. At the same time, the LVS cluster performs health checks on the Tengine cluster and removes unhealthy machines from the Tengine cluster to ensure the availability of Layer-7 SLB.

Best practice:

You can use session synchronization to prevent persistent connections from being affected by server failures within a cluster. However, for short-lived connections or if the session synchronization rule is not triggered by the connection (the three-way handshake is not completed), server failures in the cluster may still affect user requests. To prevent session interruptions caused by server failures within the cluster, you can add a retry mechanism to the service logic to reduce the impact on user access.

The high-availability solution with one SLB instance

To provide more stable and reliable load balancing services, you can deploy SLB instances across multiple zones in most regions to achieve cross-data-center disaster recovery. Specifically, you can deploy an SLB instance in two zones within the same region whereby one zone acts as the primary zone and the other acts as the secondary zone. If the primary zone suffers an outage, a failover is triggered to redirect requests to the servers in the secondary zone within approximately 30 seconds. After the primary zone is restored, traffic will be automatically switched back to the servers in the primary zone.

Note Zone-disaster recovery is implemented between the primary and secondary zones. SLB implements failovers only when the whole SLB cluster within the primary zone is unavailable or fails, for example, due to power outage or optical cable failures. A failover will not be triggered when a single backend server fails.
Best practice:
  1. We recommend that you create SLB instances in regions that support primary/secondary deployment for zone-disaster recovery.
  2. You can choose the primary zone for your SLB instance based on the distribution of ECS instances. That is, select the zone where most of the ECS instances are located as the primary zone for minimized latency.

    However, we recommend that you do not deploy all ECS instances in the primary zone. When you develop a failover solution, you must deploy several ECS instances in the secondary zone to ensure that requests can still be distributed to backend servers in the secondary zone for processing when the primary zone experiences a downtime.

    The high-availability solution with one SLB instance

The high-availability solution with multiple SLB instances

In the context of one SLB instance, traffic distribution for your applications can still be compromised by network attacks or invalid SLB configurations, because the failover between the primary zone and the secondary zone is not triggered. As a result, the load-balancing performance is impacted. To avoid this situation, you can create multiple SLB instances to form a global load-balancing solution and achieve cross-region backup and disaster recovery. Also, you can use the instances with DNS to schedule requests so as to ensure service continuity.

Best practice:

You can deploy SLB instances and ECS instances in multiple zones within the same region or across different regions, and then use DNS to schedule requests.

The high-availability solution with multiple SLB instances

The high-availability solution with backend ECS instances

With health check enabled, SLB verifies the availability of backend ECS instances (or backend servers), and thus improves the availability of frontend services by minimizing downtime that is caused by health issues of ECS instances.

After you enable the health check feature, when an ECS instance is detected unhealthy, SLB distributes new requests to other healthy ECS instances. SLB will only send requests to this backend ECS instance when it is restored and considered healthy. For more information, see Health check overview.

Best practice:

Make sure health check is enabled and properly configured. For more information, see Configure health check.