All Products
Search
Document Center

Platform For AI:Configure health checks

Last Updated:Mar 25, 2026

Configure liveness, readiness, and startup probes using HTTP GET, TCP socket, or custom commands to monitor container health and prevent traffic to failed instances.

Limitation

Health checks are available only for services deployed using custom images that include health check logic.

How it works

EAS uses the Kubernetes health check mechanism with probes and health check methods to monitor service health and availability.

  • Probe types:

    Probe type

    Description

    Liveness probes

    Determines if a container is running. If a liveness probe fails, kubelet kills the container and applies the restart policy. If no liveness probe exists, kubelet assumes the probe always returns Success.

    Readiness probes

    Determines if a container is ready to serve requests. Only ready Pods can receive traffic. The association between a Service and its Endpoints depends on Pod readiness:

    • If a Pod is not ready, its IP address is removed from the Endpoint list.

    • When the Pod becomes ready, its IP address is added back.

    Startup probes

    Determines when a container's application has started. This delays liveness and readiness checks until the container fully initializes, preventing termination of slow-starting containers.

  • Health check methods:

    Health check method

    Description

    http_get

    Sends an HTTP GET request to check service health. The response status code determines success.

    tcp_socket

    Opens a TCP connection to check service health.

    exec

    Executes a specified command inside the container. The check succeeds if the command exits with code 0.

Prepare a custom image

Use a web framework to encapsulate your prediction logic. The following example uses the Flask framework and an app.py file:

import json
from flask import Flask, request, make_response

app = Flask(__name__)

@app.route('/', methods = ['GET','POST'])
def process_handle_func():
    """
       Parse the request body based on your requirements.
    """
    data = request.get_data().decode('utf-8')
    body = json.loads(data)
    res = process(body)
    """
       Set the response based on your requirements.
    """
    response = make_response(res)
    response.status_code = 200
    return response

def process(data):
    """
       Your prediction logic
    """
    return 'result'

if __name__ == '__main__':
    """
    Note: Set host to '0.0.0.0'. Otherwise, the health check fails during service deployment.
    The port must match the port specified in the JSON deployment configuration file for the service.
    """
    app.run(host='0.0.0.0', port=8000)

Write a simple Dockerfile to copy the prediction code and install required packages. The following is an example Dockerfile:

# This example uses Python.
FROM registry.cn-shanghai.aliyuncs.com/eas/bashbase-amd64:0.0.1
COPY ./process_code  /eas
RUN /xxx/pip install required_packages
CMD ["/xxx/python", "/eas/xxx/app.py"] 

For steps on building a custom image, see Build Images on a Container Registry Enterprise Edition Instance. For more information, see Custom images. Alternatively, store your code in a NAS file system or Git repository and attach it to the service instance via storage mount during deployment. For more information, see Storage mount. This topic describes how to configure health checks during service deployment.

Configure health checks

Custom deployment

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. In the Environment Information section, configure the key parameters. For other parameters, see Deploy a custom inference service.

    Parameter

    Description

    Image Configuration

    Select Image Address and enter the custom image address. For example, registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz.

    Command

    Entry command for the image. Only single commands are supported, not complex scripts. This command must match the Dockerfile command. For example, /data/eas/ENV/bin/python /data/eas/app.py.

    Enter a port number. This is the local HTTP port the image listens on after starting, such as 8000.

    Important
    • The EAS engine listens on fixed ports 8080 and 9090. Your container port cannot be 8080 or 9090.

    • This port must match the port specified in the xxx.py file referenced by the run command.

    Health Check

    Turn on the Health Check switch, configure the parameters, and click OK. For parameter details, see the Health check parameters table.

    Note

    You can add up to three health checks, each with a unique probe type.

    Health check parameters

    Parameter

    Description

    Probe Type

    The probe types:

    • Liveness Probe: Checks if the container is in a normal running state.

    • Readiness Probe: Ensures the container has finished initialization and is ready to handle requests.

    • Startup Probe: Designed for applications that require long initialization. This probe prevents the system from incorrectly marking the container as failed due to slow startup.

    For more information about how each probe works, see How it works.

    Check Method

    The health check methods:

    • http_get: Invokes an HTTP GET method using the container's IP address, port, and path. The container is healthy if the response status code is ≥200 and <400.

    • tcp_socket: Performs a TCP check using the container's IP address and port. The container is healthy if a TCP connection can be established.

    • exec (Custom health check): Executes a specified command in the container. The health check succeeds if the command exits with code 0.

    Call Path

    This parameter is available only when http_get is selected for Check Method.

    The access URL for the HTTP server check has prefix http://localhost and a customizable suffix that defaults to /.

    Port Number

    This parameter is available only when http_get or tcp_socket is selected for Check Method.

    Port number to check, for example, 8000.

    Command

    This parameter is available only when exec is selected for Check Method.

    Enter the command to run. The console automatically converts your input into the required format and adds it to the deployment JSON.

    Latency for Check Initialization

    Time in seconds to wait after the container starts before the first health check. Default: 15.

    Check Interval

    Interval in seconds between health checks. Default: 10. A short interval increases Pod overhead, while a long interval delays failure detection.

    Check Timeout Period

    Timeout for the health check in seconds. Default: 1. If the check times out, it is marked as failed.

    Check Success Threshold

    Number of consecutive failures after a success required to mark the container as failed. Default: 3 for readiness probe, 1 for liveness and startup probes.

    Check Failure Threshold

    Number of consecutive successes required after a failure to mark the container as healthy. Default: 1.

  4. After configuring the parameters, click Deploy.

JSON deployment

Create a JSON file named service.json. The following is an example of the file content.

{
    "metadata": {
        "name": "test",
        "instance": 1,
        "enable_webservice": true
    },
    "cloud": {
        "computing": {
            "instance_type": "ml.gu7i.c16m60.1-gu30"
        }
    },
    "containers": [
        {
            "image":"registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz",
            "env":[
                {
                    "name":"VAR_NAME",
                    "value":"var_value"
                }
            ],
            "liveness_check":{
                "http_get":{
                    "path":"/",
                    "port":8000
                },
                "initial_delay_seconds":3,
                "period_seconds":3,
                "timeout_seconds":1,
                "success_threshold":2,
                "failure_threshold":4
            },
            "command":"/data/eas/ENV/bin/python /data/eas/app1.py",
            "port":8000
        }
    ]
}

The following table describes the key parameters. For other parameters, see JSON deployment.

Parameter

Description

image

Address of the custom image used to deploy the model service.

EAS does not provide public network access. Use a VPC internal registry address for deployment. For example, registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz.

env

name

Name of the environment variable.

value

Value of the environment variable.

command

Entry command for an image. Supports only single command format, not complex scripts. For example: /data/eas/ENV/bin/python /data/eas/app.py.

port

Network port that the process in the image listens on. For example, 8000.

Important

This port must match the port configured in the xxx.py file specified in the command.

liveness_check

Note

This example uses a liveness probe. You can also use health_check for a readiness probe or startup_check for a startup probe.

http_get

Uses the HTTP GET method to check the specified port. Parameters:

  • http_get.path: Access path of the HTTP server to check, which is prefixed with http://localhost, has a user-defined suffix, and defaults to /.

  • http_get.port: Port number of the HTTP server to check.

Two other health check methods:

  • tcp_socket: Performs a TCP check using the container's IP address and port. The check succeeds if a TCP connection can be established. Configuration example:

    "tcp_socket":{
        "port":8000
    }
  • exec: Executes a command in the container. The check succeeds if the command exits with code 0. Configuration example:

    "exec":{
        "command":[
            "your_script",
            "with_args"
        ]
    }

initial_delay_seconds

Delay in seconds after the container starts before the first health check runs. Default: 0.

period_seconds

Interval in seconds between health checks. Default: 10. A short interval increases Pod overhead, while a long interval delays failure detection.

timeout_seconds

Number of seconds after which the health check times out. Default: 1. A timeout is marked as failure.

failure_threshold

Number of consecutive failures after a success required to mark the container as failed. Default: 3 for readiness probe, 1 for liveness and startup probes.

success_threshold

Number of consecutive successes required after a failure to mark the container as successful. Default: 1.