Custom deployment via console - Platform For AI - Alibaba Cloud Documentation Center

EAS (Elastic Algorithm Service) custom deployment provides a flexible and comprehensive way to host AI inference services, letting you package any algorithm or model into an online service.

Note

We recommend that you first try scenario-based deployment for use cases like LLMs and ComfyUI. If that approach does not meet your needs, use custom deployment.

Quick start: Deploy a simple web service

This section shows how to quickly deploy a simple web service by using image-based deployment.

Step 1: Prepare the code file

Save the following Flask application code as an app.py file. Note that the service listens on port 8000.

app.py

from flask import Flask

app = Flask(__name__)

@app.route('/hello')
def hello_world():
    # Your model inference or other business logic goes here.
    return 'Hello World'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Step 2: Upload the code to OSS

Upload the app.py file to an OSS bucket. Make sure that the OSS bucket and the EAS workspace are in the same region. For example, upload the file to the oss://examplebucket/code/ directory.

Step 3: Configure and deploy the service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the configuration page, configure the key parameters in the Environment Information and Resource Information sections as follows:
- Deployment Method: Select Image-based Deployment.
- Image Configuration: Select the Alibaba Cloud Image python-inference:3.9-ubuntu2004.
- Storage Mounting: Mount the OSS directory that contains the app.py file to the /mnt/data/ path in the container.
  - URI: Select the OSS directory where your code is located. In this example, the directory is oss://examplebucket/code/.
  - Mount Path: Specify a local path in the container for the directory. In this example, the path is /mnt/data/.
- Command: Because app.py has been mounted to the /mnt/data/ directory of the container, the startup command is: python /mnt/data/app.py.
- Third-party Library Settings: The sample code depends on the flask library, which is not included in the selected official image. You can add flask to the Third-party Libraries. EAS automatically installs the library when the service starts.
- Resource Configuration: Allocate appropriate compute resources for the service. For this simple example, a small CPU instance is sufficient.
  - Resource Type: Public resources.
  - Resource Specification: Select ecs.c7.large.
After you complete the configuration, click Deploy. When the service status changes to Running, you can invoke the service.

Additional configuration

Runtime environment

In the Environment Information section, configure the runtime environment and dependencies for your service.

Parameter	Description
Image Configuration	The base runtime environment for the service. You can use official images provided by PAI, or use your own image by selecting Custom Image or entering an Image Address. For more information, see Custom images. Note If the image includes a Web UI, select Enable Web App. EAS automatically starts the web server, allowing you to access the front-end page directly.
Mount storage	Mount models, code, or data from cloud storage services like OSS and NAS to a local path in the container. This practice decouples code and data from the environment and simplifies independent updates. For more information, see storage mounting.
Mount dataset	To manage versions of your models or data, you can use the Dataset feature for mounting. For more information, see Create and manage datasets.
Command	Set the startup command for the image, such as `python /run.py`.
Port Number	Set the service's listening port. This is optional in some scenarios. For example, if your service subscribes to a message queue instead of depending on traffic from the EAS gateway, specifying a port is optional. Important The EAS engine reserves ports 8080 and 9090. To avoid startup conflicts, do not use these ports when you deploy a new service.
Third-party Library Settings	If you only need to install a few extra Python libraries, you can add the library names directly or specify a Path of requirements.txt to avoid rebuilding the image.
Environment Variables	Set environment variables for the service instance as key-value pairs.

Compute resources

In the Resource Information section, you can configure compute resources for your service, including selecting a resource type and instance specification, configuring the system disk, and setting the number of replicas and scheduling policies. For more information, see resource configuration.

Service networking

VPC: Configure a VPC if your service needs to access public network resources, access databases or message queues within a VPC, or if you want clients to call the service directly through a VPC. For more information, see Accessing public or internal network resources.
Service invocation: After deployment, choose an appropriate invocation method based on your business scenario, such as a gateway, NLB, Nacos, or gRPC. For more information, see service invocation.

Service security

In the Features section, configure authentication and data security for your service.

Parameter	Description
Custom Authentication	If you do not want to use the system-generated token, you can customize the authentication token for service access.
Configure Secure Encryption Environment	Integrate with the system trust management service to securely encrypt data, models, and code during deployment and invocation, enabling trusted inference. This feature primarily applies to mounted storage files. Enable this feature after you configure storage mounting. For more information, see Secure encrypted inference service.
Instance RAM Role	Associating an Instance RAM Role with an instance allows the service's code to use Security Token Service (STS) temporary credentials to access other cloud resources. This method eliminates the need for fixed AccessKeys, reducing the risk of key leakage. For more information, see Configure an EAS RAM role.
AI Safety Guardrail	This feature performs content safety checks on the input and output of LLM inference services to intercept harmful content. It supports OpenAI Completions (text-to-text), Chat Completions (text-to-text), and Image Generation (text-to-image) APIs.. Note This feature is available only in the China (Shanghai) region.
Visibility	Controls the visibility of the service in the service list: Visible in Workspace: All members of the workspace can view the service. This is suitable for team collaboration. Owner Only: Only the user who created the service can view it. This is suitable for personal testing or sensitive services.

Service stability

In the Features section, you can configure the following stability-related settings:

Service Response Timeout Period: Configure the timeout period for each request. The default value is 5 seconds. For time-consuming scenarios such as large model inference or streaming output, increase this value to prevent requests from being truncated.
Health Check: The system periodically probes instance liveness, automatically replacing abnormal instances to enable fault self-healing. For more information, see health check.
Compute monitoring & fault tolerance: Monitors the health status of compute resources for distributed inference services in real time, enabling automatic fault detection and intelligent self-healing. For more information, see Compute power monitoring and fault tolerance.
Deployment and update strategies: Use features like canary release, rolling updates, graceful shutdown, and update schedules to ensure uninterrupted service during service version upgrades. For more information, see Release management.

Service performance

Use storage acceleration and intelligent scheduling to improve service performance, accelerate startup speed, increase throughput, and reduce latency.

Distributed cache acceleration: Caches model or data files from mounted storage, such as OSS, to local instance storage to reduce I/O latency. For more information, see Model cache acceleration.
Model Weight Service (MoWS): Significantly improves scaling efficiency and service startup speed for large-scale instance deployments by caching model weights locally and sharing them across instances. For more information, see Model Weight Service.
LLM Intelligent Router: For LLM services with multiple backend instances, this feature dynamically distributes requests based on the backend load. This balances the compute power and GPU memory usage across instances and improves cluster resource utilization. For more information, see LLM intelligent router deployment.

Service observability

In the Features section, enable the following observability features to gain insights into service status and troubleshoot issues:

Parameter

Description

Save Call Records

Persist all service request and response records in MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.

MaxCompute: Select an existing MaxCompute Project (or create one if you do not have one) and specify a name for the MaxCompute Table. The system automatically creates this table in the selected project upon service deployment.
Simple Log Service: Select an existing Simple Log Service Project (or create one if you do not have one) and specify a name for the logstore. The system automatically creates this Logstore in the selected project upon service deployment.

Tracing Analysis

Some official images have a built-in collection component that allows you to enable tracing with a single click. For other images, you can integrate an ARMS agent through simple configuration for end-to-end monitoring of the service call chain. For more information, see Enable tracing for LLM applications in EAS. To configure tracing:

In the Command field, add aliyun-bootstrap -a install && aliyun-instrument python app.py to install the agent and start the application with the ARMS Python agent. Replace app.py with your application's main file.
In the Third-party Library Configuration section, add aliyun-bootstrap to download the agent installer from the PyPI repository.

Asynchronous and elastic services

Asynchronous inference: For long-running inference scenarios such as AIGC and video processing, we recommend that you enable an Asynchronous Queue. The client receives an immediate response and can retrieve the final result by polling or using a callback. For more information, see Deploy an asynchronous inference service.
Elastic Job Service: In the Features section, enable Task Mode to deploy the inference service as an on-demand job service. This is suitable for batch data processing and scheduled tasks. Resources are automatically released after task completion to save costs. For more information, see Elastic Job Service.

JSON configuration

In the Service Configuration section, you can view and directly edit the complete JSON configuration for the current UI settings.

Note

For automation and fine-grained configuration, you can also define and deploy services directly by using a JSON file. For more information, see Deploy services by using JSON.