All Products
Search
Document Center

Serverless App Engine:Best practices for health checks

Last Updated:Dec 18, 2023

This topic describes the types, parameters, and recommended parameter values of health checks that are supported by Serverless App Engine (SAE).

Background information

SAE provides the following types of health checks based on Kubernetes:

  • Liveness: checks whether a container needs to be restarted.

  • Readiness: checks whether a container needs to receive traffic.

Core parameters:

  • Check Method: TCP, HTTP, and CMD.

  • Latency: the period of time for which the probe must wait before the probe performs the first health check.

  • Check Interval: the interval at which the probe checks a container. This parameter is related to the sensitivity of detection.

  • Timeout period: the timeout period of the health check.

  • Success Threshold: the minimum number of consecutive successful container checks after the probe fails. The Success Threshold parameter must be set to 1 for a liveness probe.

  • Failure Threshold: the number of consecutive container check failures.

Recommended configurations (Quick settings)

Parameter

Description

Liveness

If you set the Check Method parameter to TCP Port Check, you must set the Latency parameter to a value that is close to the startup time of the application. The Success Threshold parameter is set to 1 and the Failure Threshold parameter is set to 3. You cannot change the values of the Success Threshold and Failure Threshold parameters.

Readiness

If you set the Check Method parameter to HTTP Request Check, you must set the Latency parameter to a value that is greater than the startup time of the application. The Success Threshold parameter is set to 1 and the Failure Threshold parameter is set to 1. You cannot change the values of the Success Threshold and Failure Threshold parameters.

Timeout Period

Default value: 1. Unit: seconds. In most cases, you can use the default value. If the expected API response time is greater than 1 second, you can specify a greater value.

Check Interval

The check interval is used to control the sensitivity of detection. Theoretically, a high-frequency check does not affect your business. A shorter check interval indicates a higher detection sensitivity. If the check interval of a liveness probe is shorter than expected, application containers may be restarted.

You can use the following formula to calculate the check interval of a liveness probe: Check interval = Maximum tolerable failure duration of a node/3. For example, if a faulty instance can wait for up to 30 seconds without being restarted, set the Check Interval parameter to 10 seconds. You can set the check interval of a readiness probe to 1 second. Specify the check interval of a liveness probe based on your business scenario. If you do not have special requirements, use the default value 30 seconds.

Parameter settings (Advanced settings)

Latency

The Latency parameter is important for a liveness configuration. The latency is used to specify the time when a health check probe detects the status of an application container. If the liveness probe performs a health check at an inappropriate point in time because the latency is shorter than expected, the container is repeatedly restarted.

For example, a Java application may require 2 minutes to be started. The application cannot be started if you use the following default settings: the latency is 10 seconds, the check interval is 30 seconds, and the Failure Threshold parameter is set to 3. In this example, three health checks are performed by the liveness probe before the application is started. If health checks are performed before an application is started, the consecutive failed checks occur. In this case, the application is repeatedly restarted.

If you do not know the startup time of an application the first time you deploy the application, we recommend that you set the Latency parameter to a longer period, such as 5 minutes. After the application is started, you can change the latency based on the approximate startup time that is recorded in business logs. However, a long latency may extend the release time.

dg_different_latency_settings_of_liveness_check

TCP and HTTP check methods

You must specify different check methods for the readiness configuration and liveness configuration of an application. If you specify the same check method for the readiness configuration and liveness configuration, some risks may occur. The liveness probe and readiness probe have different functions. In some cases, an application is unavailable due to traffic congestion. You need to only remove the traffic from the current instance without the need to restart the container. After the current request is processed, the instance can reload traffic and receive requests. If the container is restarted, the following issues may occur:

  • The existing requests cannot be processed.

  • Traffic may not be loaded for a longer period of time, or an avalanche may occur due to the restart of the current instance.

The Spring Boot framework provides a built-in health check feature for Java applications. You can use the feature to check the connection and heartbeat of multiple components, such as Redis and Nacos, and determine whether the application needs to be restarted. Specific exceptions such as network jitters and the unavailability of related component services cannot be used as the basis to determine whether the current application needs to be restarted.

To prevent unexpected instance restarts due to downstream network jitters, you must be familiar with the Liveness and Readiness parameters and select different check methods for the probes. If you cannot call an API operation to check the status of an application, select the TCP check method for a liveness probe and the HTTP check method for a readiness probe. The restart operation requires caution. The TCP check method that supports high fault tolerance is suitable for a liveness probe. You can use the HTTP check method to monitor the status of an application. This method is suitable for a readiness probe.

Note

For example, a port can be established over TCP, but may not return the expected status code 200 to an HTTP request.

Failure threshold and success threshold

A liveness probe needs to perform only the restart action if the health check fails. The Success Threshold parameter is set to 1 and the value cannot be changed. A threshold that is greater than 1 is meaningless because no other explicit actions can be performed. The default failure threshold is 3. If three consecutive container check failures occur, the container fails the liveness health check and the container is restarted.

A readiness probe performs the following types of actions: receive traffic if the health check is successful and remove traffic if the health check fails. To ensure the availability of applications, we recommend that you set the success threshold and the failure threshold to 1. If a container fails the health check, traffic is removed from the container. This helps prevent traffic loss at the earliest opportunity. If a container passes the health check (the application is considered ready to receive traffic), traffic is redirected to the container. Fast traffic switches may increase the average network load of the remaining instances. We recommend that you configure a metric-based auto scaling policy.

Note

In this example, the traffic is the inbound traffic that is generated when an SAE application that is associated with a Server Load Balancer (SLB) instance is accessed, which is different from the traffic that is generated when internal calls are performed in microservices applications.

dg_threshold_of_health_checks