Configure AI Gateway Health Checks to Ensure High Availability - API Gateway

You can configure health checks for a service to monitor the status of its backend services. If a service instance node becomes abnormal, you can take it offline or isolate it. This practice ensures the availability of interfaces routed to the service. You can also configure a panic threshold to maintain the basic service capability of the system in extreme situations. This topic describes the service health check feature and explains how to configure it.

Scenarios

Active health checks: This feature automatically takes individual abnormal instance nodes offline. Active health checks send requests, such as TCP connections or HTTP GET requests, to probe whether service nodes are alive and determine their availability. The nodes are automatically brought back online after they recover. This feature improves the availability of interfaces routed to the service when the backend service is deployed with multiple replicas.

Passive health checks: This feature dynamically analyzes the health of nodes based on the failure rate of actual traffic requests. If a node behaves abnormally, such as having a high failure rate, it is temporarily isolated. The node is automatically re-enabled after it recovers.

Panic threshold: This feature prevents fault propagation across the entire cluster when the system load increases or some nodes fail. This helps avoid a systemic failure of the service.

Procedure

Note

TCP health checks are enabled by default when you create a service.

Log on to the AI Gateway console.
In the navigation pane on the left, click AI Gateway > Instance. In the navigation bar at the top, select a region.
On the Instance page, click the ID of the gateway instance you want to manage.
In the navigation pane on the left, click Service. Then, click the Services tab.

In the Actions column of the target service, click Health Check Configuration. For the relevant health check type, select Enable and complete the configuration.

Configure active health checks

In the Configure Health Check panel, turn on the Enable Active Health Check switch, configure the parameters, and then click OK. The following table describes the configuration items.

Configuration Item	Example Value	Description
Health Check Protocol	HTTP	TCP health checks send SYN handshake messages to detect whether server ports are alive. HTTP health checks send requests that simulate browser access to check whether server applications are healthy.
Health Check Path	/	The URI of the page file for the health check. Use a static page for the check.
Normal Status Code	http_2xx	The HTTP status code that indicates a successful health check.
Health Check Response Timeout Period	2	The maximum timeout period for each health check response. A timeout indicates an unhealthy status.
Health Check Interval	2	The time interval between two consecutive health checks.
Healthy Threshold	2	The number of consecutive successful health checks required for an unhealthy Elastic Compute Service (ECS) instance to be considered healthy.
Unhealthy Threshold	2	The number of consecutive failed health checks required for a healthy ECS instance to be considered unhealthy.

Configure passive health checks

In the Configure Health Check panel, turn on the Enable Passive Health Check switch, configure the parameters, and then click OK. The following table describes the configuration items.

Configuration Item	Example Value	Description
Failure Rate Threshold	80	When the percentage of failed requests for a node reaches this threshold, the system triggers the ejection mechanism for that node.
Detection Interval	30	The system calculates the request failure rate of a node at the specified interval, such as every 30 seconds.
Initial Isolation Duration	30	The initial duration, such as 30 seconds, for which a node is isolated after being ejected. The isolation duration is calculated using the formula: k × base_ejection_time. The initial value of k is 1. Each ejection extends the isolation duration by incrementing k. If consecutive checks are successful, the isolation duration is gradually shortened by decrementing k.

Note

To use the passive health check feature, you must upgrade the engine to version 2.1.10 or later.

When you update the passive health check configuration, the passive health check status is reset, and all isolated nodes are re-enabled.

Panic threshold

The panic threshold prevents fault propagation across the entire cluster when the system load increases or some nodes fail. This helps avoid a systemic failure of the service. This mechanism balances availability and correctness to ensure basic service capability in extreme situations.

The behavior is as follows:

If the percentage of healthy nodes in the cluster is higher than the panic threshold, the health check mechanism works as expected. Requests are routed only to nodes that are marked as healthy. Failed or ejected nodes no longer receive traffic.
If the percentage of healthy nodes in the cluster is less than or equal to the panic threshold, the system enters "panic mode". The health check mechanism is temporarily bypassed, and requests are forwarded evenly to all nodes, including those marked as unhealthy or ejected.

This configuration is designed to prevent the few remaining healthy nodes from being overloaded with all traffic when many nodes become abnormal, which helps avoid cascading failures. By resuming calls to some "unhealthy" nodes, the overall fault tolerance and availability of the service are improved.

Note

To maximize service availability in extreme scenarios, the default panic threshold is set to 1%. When the percentage of healthy nodes drops to this threshold or below, the system switches to panic mode and forwards requests to all nodes.

You can adjust this threshold based on your business scenarios and disaster recovery capabilities. This helps achieve the best balance between stability and service correctness.

Troubleshoot health check exceptions

General health check exceptions

Follow these steps to troubleshoot:

If a TCP health check fails, it means that a connection cannot be established with the corresponding node. Confirm the following:
- Whether the node exists.
- Whether the number of concurrent connections is too high for the node to handle.
If an HTTP health check fails, switch to a TCP health check and confirm whether a connection can be established. If the TCP health check is successful, verify that the configured health check path is correct. You can use tools such as curl or Postman to test access.

Health check exceptions when adding a service for the first time

Follow these steps to troubleshoot:

Confirm that the VPC of the purchased gateway is the same as the VPC of the service instance. Alternatively, confirm that the service environment is connected to the gateway VPC through Cloud Enterprise Network (CEN) or a leased line. If the VPCs are different and not connected, the gateway cannot access the instance IP address.
Note
The gateway does not support on-premises services that are registered with Nacos and ZooKeeper instances.
Confirm that the gateway and the service instance are in the same VPC. If the VPCs are different and are not connected, the gateway cannot access the instance's IP address.
Confirm that security group authorization is granted. If the service source is an ACK service, grant authorization to the security group of the container cluster. For more information, see Set security group rules.
If the IP address of the unhealthy instance is an Internet IP address, confirm whether an Internet NAT gateway is enabled for the VPC where the gateway resides.