Configure service health checks for a cloud-native API gateway - API Gateway

You can configure health checks for a service to monitor the status of its backend services. If a service instance node becomes abnormal, you can take it offline or isolate it to ensure service availability. You can also configure a panic threshold to maintain basic service functionality during extreme situations. This topic describes the service health check feature and provides the steps to configure it.

Scenarios

Active health check: Automatically takes abnormal instance nodes offline and restores them after they recover. The system actively sends requests, such as TCP connections or HTTP GET requests, to probe service nodes and determine their availability. This feature improves service availability when the backend service has multiple replicas.

Passive health check: Analyzes node health based on the failure rate of actual traffic requests. If a node has a high failure rate, the system temporarily isolates it. The node is automatically restored after it recovers.

Panic threshold: Prevents fault propagation across the entire cluster when the system load increases or some nodes fail. This avoids a system-wide service failure.

Procedure

Note

TCP health checks are enabled by default when you create a service.

Log on to the API Gateway console.
In the left-side navigation pane, click Cloud-native API Gateway > Instance. In the top navigation bar, select a region.
On the Instance page, click the target instance ID.
In the left-side navigation pane, click Service. Then, click the Services tab.

Find the service and click Health Check Settings in the Actions column. Turn on the desired health check switch and set the parameters as needed.

Configure an active health check

In the Configure Health Check panel, turn on the Active Health Check switch, configure the parameters, and click OK. The following table describes the parameters.

Configuration item	Example value	Description
Health Check Protocol	HTTP	TCP: A TCP health check sends SYN handshake messages to detect whether the server port is active. HTTP: An HTTP health check sends requests that simulate browser access to check whether the server application is healthy.
Health Check Path	/	The URI of the page file for the health check. Check a static page.
Normal Status Codes	http_2xx	The HTTP status code that indicates a successful health check.
Response Timeout Period	2	The maximum timeout for a health check response. A timeout is considered a failure.
Health Check Interval	2	The interval between two consecutive health checks. Unit: seconds.
Healthy Threshold in Health Check	2	The number of consecutive successful health checks required for a server to transition from Failed to Normal.
Unhealthy Threshold in Health Check	2	The number of consecutive failed health checks required for a server to transition from Normal to Failed.

Configure a passive health check

In the Configure Health Check panel, turn on the Passive health check switch, configure the parameters, and click OK. The following table describes the parameters.

Configuration item	Example value	Description
Failure Rate Threshold	80	When the percentage of failed requests for a node reaches this threshold, the system triggers the outlier detection mechanism for that node. Unit: %.
Detection interval	30	The system calculates the request failure rate of a node at the specified interval, such as every 30 seconds. Unit: seconds.
Initial Isolation Duration	30	The initial duration, such as 30 seconds, for which a node is isolated after being ejected. The isolation duration is calculated using the formula: k × base_ejection_time. The initial value of k is 1. Each ejection extends the isolation duration (k is incremented by 1). If the node passes consecutive health checks, the isolation duration is gradually shortened (k is decremented by 1). Unit: seconds.

Note

To use the passive health check feature, you must upgrade the Deep Packet Inspection (DPI) engine to version 2.1.9 or later.

When you update the passive health check configuration, the passive health check status is reset, and all isolated nodes are restored.

Panic threshold

The panic threshold prevents fault propagation across the entire cluster when the system load increases or some nodes fail. This avoids a system-wide service failure. This mechanism balances availability and correctness to ensure basic service functionality in extreme situations.

The behavior is as follows:

When the percentage of healthy nodes in the cluster is higher than the panic threshold, the health check mechanism works as expected. Requests are routed only to nodes marked as healthy. Failed or ejected nodes no longer receive traffic.
When the percentage of healthy nodes in the cluster is less than or equal to the panic threshold, the system enters "panic mode". The health check mechanism is temporarily bypassed, and requests are forwarded evenly to all nodes, including those marked as unhealthy or ejected.

This configuration is designed to prevent cascading failures that can occur when a few remaining healthy nodes become overloaded by handling all traffic after many nodes become abnormal. By resuming calls to some "unhealthy" nodes, the overall fault tolerance and availability of the service are improved.

Note

To maximize service availability in extreme scenarios, the panic threshold is set to 1% by default. When the percentage of healthy nodes drops to this threshold or below, the system switches to panic mode and forwards requests to all nodes.

You can adjust this threshold based on your business scenarios and disaster recovery capabilities to achieve the best balance between stability and service correctness.

Troubleshoot health check failures

General cases of abnormal health checks

Perform the following steps:

If a TCP health check fails, a connection to the corresponding node cannot be established. Verify the following:
- Check whether the node exists.
- Check whether an excessive number of concurrent connections are established.
If an HTTP health check fails, switch to a TCP health check and verify that a connection can be established. If the TCP health check is normal, verify that the configured health check path is correct. You can test access using tools such as cURL or Postman.

Troubleshoot health check failures when you add a service for the first time

Perform the following steps:

Check whether the virtual private cloud (VPC) that you purchased is the same as the VPC in which the Cloud-native API Gateway instance is deployed or check whether the environment in which the service resides is connected to the VPC in which the Cloud-native API Gateway instance is deployed using Cloud Enterprise Network (CEN) or physical connections. If the VPC that you purchased is not the same as the VPC in which the Cloud-native API Gateway instance is deployed and the two VPCs are not connected to each other, the IP address of the Cloud-native API Gateway instance cannot be accessed.
Note
Cloud-native API Gateway instances do not support on-premises services that are registered with Nacos and ZooKeeper instances.
Check whether the VPC that you purchased is the same as the VPC in which the Cloud-native API Gateway instance is deployed. If the VPC that you purchased is not the same as the VPC in which the Cloud-native API Gateway instance is deployed and the two VPCs are not connected to each other, the IP address of the Cloud-native API Gateway instance cannot be accessed.
Verify that security group authorization is configured. If the service source is an ACK service, you must authorize the security group of the container cluster. For more information, see Configure security group rules.
If the IP address of the unhealthy instance is an Internet IP address, verify that an Internet NAT gateway is enabled for the VPC where the gateway is located.