Server Load Balancer (SLB) checks the service availability of backend servers (ECS instances) by performing health checks. Health checks improve the overall availability of your frontend service, and avoid service impacts caused by exceptions of backend ECS instances.

After you enable the health check function, SLB stops distributing requests to the instance that is discovered unhealthy and restarts forwarding requests to the instance only when it is declared healthy.

If your service is highly sensitive to traffic load, frequent health checks may impact your service. To reduce the impact on your service, you can reduce the health check frequency, increase the health check interval, or change a layer-7 health check to a layer-4 one based on the service conditions. To guarantee the service availability, we do not recommend disabling the health check function.

Health check process

SLB is deployed in clusters. Data forwarding and health checks are handled at the same time by node servers in the LVS cluster and Tengine cluster.

The node servers in the cluster independently perform health checks in parallel, according to health check configurations. If an LVS node server detects that a backend ECS instance fails, the LVS node server no longer sends new client requests to this ECS instance. This operation is synchronized among all node servers.

The IP address range used for health checks is 100.64.0.0/10. Make sure that backend ECS instances do not block this CIDR block. You do not need to configure a security group rule to allow access from this CIDR block. However, if you have configured security rules such as iptables, you must allow access from this CIDR block. (100.64.0.0/10 is reserved by Alibaba Cloud. Other users cannot use any IP address in this CIDR block and therefore no security risks exist.)

Health checks of HTTP/HTTPS listeners

For layer-7 (HTTP or HTTPS) listeners, SLB checks the status of backend servers by sending HTTP HEAD requests, as shown in the following figure.

For HTTPS listeners, certificates are managed in SLB. HTTPS is not used for data exchange (including health check data and service interaction data) between SLB and backend ECS instances so that the system performance is improved.

The health check process of a layer-7 listener is as follows:

  1. A Tengine node server sends an HTTP HEAD request to the intranet IP address, health check port, and health check path of a backend server according to the health check settings.
  2. After receiving the request, the backend server returns an HTTP status code based on the running status.
  3. If the Tengine node server does not receive the response from the backend server within the specified response timeout period, the backend server is declared as unhealthy.
  4. If the Tengine node server receives a response from the backend ECS instance within the specified response timeout period, the node server compares the response with the configured status code. If the status code is the same, the backend server is declared as healthy. Otherwise, the backend server is declared as unhealthy.

Health checks of TCP listeners

For TCP listeners, SLB checks the status of backend servers by establishing TCP connections, as the following figure shows.

The health check process of a TCP listener is as follows:

  1. The LVS node server sends a TCP SYN packet to the intranet IP address and health check port of a backend ECS instance.
  2. After receiving the request, the backend server returns a TCP SYN and ACK packet if the corresponding port is listening normally.
  3. If the LVS node server does not receive the packet from the backend ECS instance within the specified response timeout period, the server determines that the service does not respond and health check fails. Then, the server sends an RST packet to the backend ECS instance to terminate the TCP connection.
  4. If the LVS node server receives the packet from the backend ECS instance within the specified response timeout period, the server determines that the service runs properly and the health check succeeds. Then, the server sends an RST packet to the backend ECS instance to terminate the TCP connection.
Note In general, TCP three-way handshakes are conducted to establish a TCP connection. After the LVS node server receives the SYN and ACK data packet from the backend ECS instance, the LVS node server sends an ACK data packet, and then immediately sends an RST data packet to terminate the TCP connection.

This process may cause backend server to think an error occurred in the TCP connection, such as an abnormal exit, and then report a corresponding error message, such as Connection reset by peer.

Solution:

  • Use HTTP health checks.
  • If you have enabled the function of obtaining real IP addresses, you can ignore the connection errors caused by accessing the preceding SLB CIDR block.

Health checks of UDP listeners

For UDP listeners, SLB checks the status of backend servers by sending UDP packets, as shown in the following figure.

The health check process of a UDP listener is as follows:

  1. The LVS node server sends a UDP packet to the intranet IP address and health check port of the ECS instance according to health check configurations.
  2. If the corresponding port of the ECS instance is not listening normally, the system returns an ICMP error message, such as port XX unreachable. Otherwise, no message is sent.
  3. If the LVS node server receives the ICMP error message within the response timeout period, the ECS instance is declared as unhealthy.
  4. If the LVS node server does not receive any message within the response timeout period, the ECS instance is declared as healthy.
Note For UDP health checks, the health check result may fail to reflect the real status of a backend ECS instance in the following situation:

If the ECS instance uses a Linux operating system, the speed of sending ICMP messages in high traffic hours is limited due to the anti-ICMP attack protection function of Linux. In this case, even if an exception occurs to the ECS instance, SLB may declare the backend server as healthy because the error message port XX unreachable is not returned. Then, the health check result deviates from the actual service status.

Solution:

Specify a pair of request and response for UDP health checks. If the specified response is returned, the ECS instance is considered healthy. Otherwise, the ECS instance is considered unhealthy. To achieve this, you must configure the client accordingly.

Health check time window

The health check function effectively improves the availability of your service. However, to avoid the impact of switching caused by frequent health check failures on system availability, status is switched (health check succeeded or failed) only when the health check succeeds or fails for a specified number of times in the time window. The health check time window is determined by the following three factors:

  • Health check interval: how often the health check is performed.
  • Response timeout: the length of time to wait for a response.
  • Health check threshold: the number of consecutive successes or failures of health checks.

The health check time window is calculated as follows:

  • Health check failure time window = Response timeout × Unhealthy threshold + Health check interval × (Unhealthy threshold -1)
  • Health check success time window = Response time of a successful health check × Healthy threshold + Health check interval × (Healthy threshold - 1)
    Note The success response time of a health check is the duration from the time when the health check request is sent to the time when the response is received. When TCP health checks are used, the time is short and almost negligible because the health check only checks whether the port is alive. For HTTP health checks, the time depends on the performance and load of the application server and is generally within seconds.

The health check result has the following impact on request forwarding:

  • If the health check of the target ECS instance fails, new requests are distributed to other ECS instance. The client access is normal.
  • If the health check of the target ECS instance succeeds, new requests are distributed to it. The client access is normal.
  • If a request arrives during a health check failure window, the request is still sent to the ECS instance because the ECS instance is being checked and has not been declared unhealthy. Then, the client access fails.