Ensure EAS Service Reliability with Kubernetes Health Check Probes - Platform For AI

Configure liveness, readiness, and startup probes using HTTP GET, TCP socket, or custom commands to monitor container health and prevent traffic to failed instances.

Limitation

Health checks are available only for services deployed using custom images that include health check logic.

How it works

EAS uses the Kubernetes health check mechanism with probes and health check methods to monitor service health and availability.

Probe types:

Probe type	Description
Liveness probes	Determines if a container is running. If a liveness probe fails, kubelet kills the container and applies the restart policy. If no liveness probe exists, kubelet assumes the probe always returns `Success`.
Readiness probes	Determines if a container is ready to serve requests. Only ready Pods can receive traffic. The association between a Service and its Endpoints depends on Pod readiness: If a Pod is not ready, its IP address is removed from the Endpoint list. When the Pod becomes ready, its IP address is added back.
Startup probes	Determines when a container's application has started. This delays liveness and readiness checks until the container fully initializes, preventing termination of slow-starting containers.

Health check methods:

Health check method	Description
`http_get`	Sends an HTTP GET request to check service health. The response status code determines success.
`tcp_socket`	Opens a TCP connection to check service health.
`exec`	Executes a specified command inside the container. The check succeeds if the command exits with code 0.

Prepare a custom image

Use a web framework to encapsulate your prediction logic. The following example uses the Flask framework and an app.py file:

import json
from flask import Flask, request, make_response

app = Flask(__name__)

@app.route('/', methods = ['GET','POST'])
def process_handle_func():
    """
       Parse the request body based on your requirements.
    """
    data = request.get_data().decode('utf-8')
    body = json.loads(data)
    res = process(body)
    """
       Set the response based on your requirements.
    """
    response = make_response(res)
    response.status_code = 200
    return response

def process(data):
    """
       Your prediction logic
    """
    return 'result'

if __name__ == '__main__':
    """
    Note: Set host to '0.0.0.0'. Otherwise, the health check fails during service deployment.
    The port must match the port specified in the JSON deployment configuration file for the service.
    """
    app.run(host='0.0.0.0', port=8000)

Write a simple Dockerfile to copy the prediction code and install required packages. The following is an example Dockerfile:

# This example uses Python.
FROM registry.cn-shanghai.aliyuncs.com/eas/bashbase-amd64:0.0.1
COPY ./process_code  /eas
RUN /xxx/pip install required_packages
CMD ["/xxx/python", "/eas/xxx/app.py"]

For steps on building a custom image, see Build Images on a Container Registry Enterprise Edition Instance. For more information, see Custom images. Alternatively, store your code in a NAS file system or Git repository and attach it to the service instance via storage mount during deployment. For more information, see Storage mount. This topic describes how to configure health checks during service deployment.

Configure health checks

Custom deployment

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

In the Environment Information section, configure the key parameters. For other parameters, see Deploy a custom inference service.

Parameter	Description
Image Configuration	Select Image Address and enter the custom image address. For example, `registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz`.
Command	Entry command for the image. Only single commands are supported, not complex scripts. This command must match the Dockerfile command. For example, `/data/eas/ENV/bin/python /data/eas/app.py`. Enter a port number. This is the local HTTP port the image listens on after starting, such as 8000. Important The EAS engine listens on fixed ports 8080 and 9090. Your container port cannot be 8080 or 9090. This port must match the port specified in the xxx.py file referenced by the run command.
Health Check	Turn on the Health Check switch, configure the parameters, and click OK. For parameter details, see the Health check parameters table. Note You can add up to three health checks, each with a unique probe type.

Health check parameters

Parameter	Description
Probe Type	The probe types: Liveness Probe: Checks if the container is in a normal running state. Readiness Probe: Ensures the container has finished initialization and is ready to handle requests. Startup Probe: Designed for applications that require long initialization. This probe prevents the system from incorrectly marking the container as failed due to slow startup. For more information about how each probe works, see How it works.
Check Method	The health check methods: http_get: Invokes an HTTP GET method using the container's IP address, port, and path. The container is healthy if the response status code is ≥200 and <400. tcp_socket: Performs a TCP check using the container's IP address and port. The container is healthy if a TCP connection can be established. exec (Custom health check): Executes a specified command in the container. The health check succeeds if the command exits with code 0.
Call Path	This parameter is available only when http_get is selected for Check Method. The access URL for the HTTP server check has prefix `http://localhost` and a customizable suffix that defaults to `/`.
Port Number	This parameter is available only when http_get or tcp_socket is selected for Check Method. Port number to check, for example, 8000.
Command	This parameter is available only when exec is selected for Check Method. Enter the command to run. The console automatically converts your input into the required format and adds it to the deployment JSON.
Latency for Check Initialization	Time in seconds to wait after the container starts before the first health check. Default: 15.
Check Interval	Interval in seconds between health checks. Default: 10. A short interval increases Pod overhead, while a long interval delays failure detection.
Check Timeout Period	Timeout for the health check in seconds. Default: 1. If the check times out, it is marked as failed.
Check Success Threshold	Number of consecutive failures after a success required to mark the container as failed. Default: 3 for readiness probe, 1 for liveness and startup probes.
Check Failure Threshold	Number of consecutive successes required after a failure to mark the container as healthy. Default: 1.

After configuring the parameters, click Deploy.

JSON deployment

Create a JSON file named service.json. The following is an example of the file content.

{
    "metadata": {
        "name": "test",
        "instance": 1,
        "enable_webservice": true
    },
    "cloud": {
        "computing": {
            "instance_type": "ml.gu7i.c16m60.1-gu30"
        }
    },
    "containers": [
        {
            "image":"registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz",
            "env":[
                {
                    "name":"VAR_NAME",
                    "value":"var_value"
                }
            ],
            "liveness_check":{
                "http_get":{
                    "path":"/",
                    "port":8000
                },
                "initial_delay_seconds":3,
                "period_seconds":3,
                "timeout_seconds":1,
                "success_threshold":2,
                "failure_threshold":4
            },
            "command":"/data/eas/ENV/bin/python /data/eas/app1.py",
            "port":8000
        }
    ]
}

The following table describes the key parameters. For other parameters, see JSON deployment.

Parameter		Description
image		Address of the custom image used to deploy the model service. EAS does not provide public network access. Use a VPC internal registry address for deployment. For example, `registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz`.
env	name	Name of the environment variable.
env	value	Value of the environment variable.
command		Entry command for an image. Supports only single command format, not complex scripts. For example: `/data/eas/ENV/bin/python /data/eas/app.py`.
port		Network port that the process in the image listens on. For example, 8000. Important This port must match the port configured in the xxx.py file specified in the command.
liveness_check Note This example uses a liveness probe. You can also use `health_check` for a readiness probe or `startup_check` for a startup probe.	http_get	Uses the HTTP GET method to check the specified port. Parameters: http_get.path: Access path of the HTTP server to check, which is prefixed with `http://localhost`, has a user-defined suffix, and defaults to `/`. http_get.port: Port number of the HTTP server to check. Two other health check methods: `tcp_socket`: Performs a TCP check using the container's IP address and port. The check succeeds if a TCP connection can be established. Configuration example: `"tcp_socket":{ "port":8000 }` `exec`: Executes a command in the container. The check succeeds if the command exits with code 0. Configuration example: `"exec":{ "command":[ "your_script", "with_args" ] }`
	initial_delay_seconds	Delay in seconds after the container starts before the first health check runs. Default: 0.
	period_seconds	Interval in seconds between health checks. Default: 10. A short interval increases Pod overhead, while a long interval delays failure detection.
	timeout_seconds	Number of seconds after which the health check times out. Default: 1. A timeout is marked as failure.
	failure_threshold	Number of consecutive failures after a success required to mark the container as failed. Default: 3 for readiness probe, 1 for liveness and startup probes.
	success_threshold	Number of consecutive successes required after a failure to mark the container as successful. Default: 1.