Working principles and types of health checks - Server Load Balancer

CLB uses health checks to determine the availability of backend servers. When health checks are enabled, if a backend server becomes unhealthy, CLB distributes requests to other healthy backend servers. When the server recovers, CLB resumes routing traffic to it. This mechanism improves service availability by preventing disruptions from individual server failures.

Important

If your workloads are sensitive to load, frequent health checks can affect normal service access. Depending on your business requirements, you can reduce the impact by lowering the health check frequency, increasing the health check interval, or switching from Layer 7 to Layer 4 checks. However, to ensure continuous service availability, we do not recommend disabling health checks.

How health checks work

A health check confirms the status of a server by sending requests at regular intervals.

CLB is deployed in a cluster. Its nodes handle both data forwarding and health checks. If a backend server fails a health check, CLB stops distributing new client requests to it.

CLB health checks use the CIDR block 100.64.0.0/10. Ensure your backend servers do not block this CIDR block. You do not need to add an allow rule to your security group for this CIDR block. However, if you have configured other security policies, such as iptables, you must allow traffic from 100.64.0.0/10. This is a reserved CIDR block for Alibaba Cloud and poses no security risk.

HTTP/HTTPS listener health checks

Layer 7 listeners (HTTP/HTTPS) perform health checks by using HEAD or GET requests.

For an HTTPS listener, certificates are managed by CLB. CLB and backend servers exchange data over HTTP to improve performance.

The health check process for a Layer 7 listener is as follows:

A node sends an HTTP HEAD request to a backend server based on the listener configuration.
The backend server returns an HTTP status code.
If no response is received within the response timeout, the health check fails.
If a response is received within the response timeout, CLB compares the response's status code with the configured healthy status codes. If the status code matches, the check succeeds. Otherwise, it fails.

Important

By default, CLB health checks treat only HTTP 2xx and 3xx status codes as healthy. If a backend server returns a 4xx status code (such as 400, 403, 404, or 429) or a 5xx status code (such as 500, 502, or 503), the health check fails.

We recommend creating a dedicated health check endpoint, such as /health, that returns an HTTP 200 status code, rather than adding 4xx or 5xx codes to the list of healthy status codes.

TCP listener health checks

To improve health check efficiency for a Layer 4 TCP listener, health checks use a custom TCP probe to obtain status information, as shown in the following figure.

The health check process for a TCP listener is as follows:

A node in the Layer 4 cluster sends a TCP SYN packet to the internal IP address and health check port of a backend server based on the health check configuration.
If the port on the backend server is actively listening, it responds with a SYN+ACK packet.
If the node does not receive a response from the backend server within the response timeout, the health check fails. The node then sends an RST packet to terminate the TCP connection.
If the node receives a response from the backend server within the response timeout, the health check succeeds. The node then sends an RST packet to terminate the TCP connection.

Note

This mechanism might cause the backend server to log TCP connection errors, such as Connection reset by peer, in its application logs.

Solutions:

Use HTTP-based health checks for the TCP listener.
Enable client IP preservation on the backend server, and then configure it to ignore connection errors from the CLB service CIDR block.

UDP listener health checks

For a Layer 4 UDP listener, a health check uses a UDP probe to get status information, as shown in the following figure.

The health check process for a UDP listener is as follows:

A node in the Layer 4 cluster sends a UDP packet to the internal IP address and health check port of a backend server based on the health check configuration.
If the backend server's port is not listening, its operating system returns an ICMP error message, such as port XX unreachable. If the port is listening, no message is returned.
If the node receives this ICMP error message within the response timeout, the health check fails.
If the node receives no response from the backend server within the response timeout, the health check succeeds.

Note

A health check for a UDP service may not accurately reflect the true status of the service in the following scenario:

If the backend server runs Linux, the operating system's ICMP flood protection may limit the rate at which it sends ICMP messages during periods of high concurrency. In this case, even if the service is down, CLB may not receive the port XX unreachable error. CLB then incorrectly marks the check as successful because no ICMP error was received, leading to a mismatch between the health check status and the actual service status.

Solution:

Configure CLB to send a specific string to the backend server. The check succeeds only after it receives a specific response. This method requires changes to the backend application.

Health check time window

Health checks improve service availability. To prevent instability from frequent status changes, CLB changes a backend server's status only after it consecutively passes or fails multiple checks. This health check window depends on three factors:

Health check interval: The time between two consecutive health checks.
Response timeout: The maximum time to wait for a response from the server.
Health check threshold: The number of consecutive successful or failed health checks required to change the server's status.

The time window is calculated using the following formulas:

Time window for a health check to fail = response timeout × unhealthy threshold + health check interval × (unhealthy threshold - 1)
Time window for a health check to succeed = (Response time for a successful check × healthy threshold) + health check interval × (healthy threshold - 1)
Note
The response time for a successful check is the duration between sending a health check request and receiving its response. For a TCP health check, this time is very short and can be ignored. For an HTTP health check, this time depends on the performance and load of the backend server but is typically less than one second.

How health check status affects request forwarding:

If a backend server fails a health check, new requests are not distributed to it. This does not affect client access.
If a backend server passes a health check, new requests are distributed to it, and client access is normal.
If a backend server has an issue and is within the time window for a health check to fail but has not yet reached the unhealthy threshold (by default, three consecutive failures), requests are still distributed to it. This can cause client requests to fail.

Example: Response timeout and health check interval

Consider the following health check configuration:

Response timeout: 5 seconds
Health check interval: 2 seconds
Healthy threshold: 3
Unhealthy threshold: 3

Time window for a health check to fail = response timeout × unhealthy threshold + health check interval × (unhealthy threshold - 1). This calculates to 5 × 3 + 2 × (3 - 1) = 19 seconds. This means that 19 seconds will pass from the first failed health check before the server is marked as unhealthy.

Time window for a health check to succeed = (Response time for a successful check × healthy threshold) + health check interval × (healthy threshold - 1). Assuming a 1-second response time, this calculates to (1 × 3) + 2 × (3 - 1) = 7 seconds. This means that 7 seconds will pass from the first successful health check before an unhealthy server is marked as healthy.

Note

The response time for a successful check is the duration between sending a health check request and receiving its response. For a TCP health check, this time is very short and can be ignored because it only probes whether the port is active. For an HTTP health check, this time depends on the performance and load of the backend server but is typically less than one second.

Domain name for HTTP health checks

When you use an HTTP health check, you can optionally set a domain name. Some backend servers require the Host header to be present in the request and will validate its content. If you configure a domain name, CLB adds it to the Host header of the health check request. If you do not configure a domain name, the server might reject the health check request, causing the check to fail.

Therefore, if your backend server validates the Host header, you must configure a domain name to ensure that health checks work correctly.

References

You must configure a health check when you add a listener. For instructions, see Configure and manage CLB health checks.
For frequently asked questions about health checks, see the CLB health check FAQ.