All Products
Search
Document Center

Platform For AI:Console-based custom deployment

Last Updated:Oct 27, 2025

The custom deployment of Elastic Algorithm Service (EAS) provides a flexible environment for hosting any AI model or algorithm as a scalable inference service. The service supports a wide range of workloads, from large language models (LLMs) to custom code. This guide is for advanced users who need full control over the runtime and configuration. For a streamlined setup with common applications like LLMs and ComfyUI, consider scenario-based deployment first.

How it works

An EAS service essentially runs in one or more isolated container instances. EAS deploys services based on the following core components:

  • Environment image: A read-only package containing the operating system, foundational libraries (such as CUDA), language runtimes (such as Python), and other required dependencies. Options include using official PAI-provided images or building custom images to meet specific application requirements.

  • Code and model files: These files include business logic code and model weights. The best practice is to store them on Object Storage Service (OSS) or File Storage NAS (NAS) and access them by mounting. This decouples your code and models from the runtime environment, allowing you to iterate on them independently by simply updating the files in storage, without rebuilding the image.

  • Storage mounting: When the service starts, EAS mounts the specified OSS or NAS path to a local directory inside the container. This allows the code in the container to access code and models on external storage as if they were local files.

  • Run command: The first command to execute after the container starts. This is typically a command to start an HTTP server, such as python app.py.

The workflow is as follows:

  1. EAS pulls the specified image to create a container.

  2. EAS mounts the external storage to a specified path in the container.

  3. EAS executes the run command inside the container.

  4. After the command is successfully executed, the service starts listening on the specified port and processing inference requests.

image
Note

EAS supports two deployment methods: image-based and processor-based. Image-based deployment is the recommended method, as it offers maximum flexibility and maintainability. The processor-based method is a legacy option with significant limitations.

Usage notes

  • Inactive services (not in a running state for 180 consecutive days) are subject to automatic deletion.

  • When invoking a service through a gateway, the request body cannot exceed 1 MB.

  • Avoid using ports 8080 and 9090 because they are reserved by the EAS engine.

Procedure

This section shows how to quickly deploy a simple web service using the image-based deployment method.

Step 1: Prepare the code file

Save the following Flask application code as an app.py file. Note that the service listens on port 8000.

app.py

from flask import Flask

app = Flask(__name__)

@app.route('/hello')
def hello_world():
    # TODO: Implement model inference or other business logic
    return 'Hello World'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Step 2: Upload the code to OSS

Upload the app.py file to an OSS Bucket. Make sure that the OSS Bucket and the EAS workspace are in the same region. For example, upload the file to the oss://examplebucket/code/ directory.

Step 3: Configure and deploy the service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. On the configuration page, configure the key parameters in the Environment Information and Resource Information sections as follows:

    • Deployment Method: Select Image-based Deployment. In the Image Configuration field, select the Alibaba Cloud Image python-inference:3.9-ubuntu2004.

    • Mount Storage: Click OSS and mount the OSS directory that contains app.py to the /mnt/data/ path in the container.

      • Uri: The path to the OSS directory that contains the code, such as oss://examplebucket/code/.

      • Mount Path: The path in the container to the mounted directory. For example, /mnt/data/.

    • Command: The startup command is python /mnt/data/app.py because app.py has been mounted to the /mnt/data/ directory of the container.

    • Third-party Library Settings: The sample code depends on the flask library, which is not included in the official image. Add flask to the Third-party Libraries to have it installed automatically on startup.

    • Resource configuration: Configure the computing resources for the service. For this example, a small CPU instance is sufficient.

      • Resource Type: Public Resources.

      • Resource Type: ecs.c7.large.

  4. After completing the configuration, click Deploy. The deployment is complete when the service status changes to Running. Then proceed to invoke the service.

References

Manage environment and dependencies

In the Environment Information section, configure the environments and dependencies for the service.

Parameter

Description

Image Configuration

The image serves as the basic runtime environment for the service. Use an official image provided by PAI or a custom-built image by selecting Custom Image or entering an Image Address. For more information, see Custom Images.

Note

If the image contains a WebUI, select Enable Web App, and EAS automatically starts a web server to provide direct access to the frontend page.

Mount Storage

Mounting models, code, or data from cloud storage services such as OSS and NAS to a local path in a container decouples the code or data from the environment, allowing for independent updates. For more information, see Storage mounting.

Mount Dataset

If you want to perform version management for your models or data, use the Dataset to mount them. For more information, see Create and manage datasets.

Command, Port Number

Set the image startup command, such as python /run.py, and the port on which the service listens.

Important

The EAS engine listens on fixed ports 8080 and 9090. Avoid using these two ports.

Third-Party Library Settings

If you only need to install a few additional Python libraries, add the library names directly or specify a Path of Requirements.txt to avoid rebuilding the image.

Environment Variables

Set environment variables for the service instance as key-value pairs.

For GPU-accelerated instances, specify the GPU driver version through Features > Resource Configuration > GPU Driver Version to meet the runtime requirements of specific models or frameworks.

Configure computing resources

In the Resource Information section, configure a service's computing resources.

Parameter

Description

Resource Type

Select public resources, EAS resource groups, or resource quotas.

Note

Enable GPU Sharing to deploy multiple model services on a single GPU card and improve GPU resource utilization by sharing computing power. This feature is suitable for use cases with small models or low inference payloads. It can be enabled only when you use an EAS resource group or a resource quota. For more information, see GPU Sharing.

Instances

Configure multiple instances to avoid the risk of a single point of failure.

Deployment Resources

When using public resources for supported specifications, enable Bidding and set a maximum bid price to preempt idle resources at a price much lower than that of regular instances. This feature is suitable for inference tasks that are not sensitive to interruptions.

Configure a System Disk

  • If you use public resources, EAS provides a 30 GiB system disk for free. If you need more capacity, you are charged for the actual usage.

  • If you use an EAS resource group or a resource quota, a 60 GiB system disk is enabled by default. If you change the capacity, the space is allocated from the host.

Elastic Resource Pool

This feature enables cloud bursting. When traffic exceeds the capacity of your dedicated resources (EAS resource groups or quotas), the service automatically bursts to on-demand public resources to handle the spike. During scale-in, these public instances are released first to minimize costs. For more information, see Elastic resource pool.

High-priority Resource Rescheduling

After you enable this feature, the system periodically tries to migrate service instances from low-priority resources, such as public resources or regular instances, to high-priority resources, such as dedicated resource groups or spot instances. This optimizes costs and resource allocation. This feature can solve the following problems:

  • During a rolling update, it prevents new instances from being temporarily scheduled to the public resource group because old instances are still occupying resources.

  • When you use both spot instances and regular instances, the system periodically checks for and migrates instances to more economical spot instances if they are available.

Service registration and network

EAS provides flexible service registration and network configuration options to meet different business integration needs. For more information, see Service Invocation.

Parameter

Description

Select Gateway

By default, services are exposed through a complimentary Shared Gateway. For advanced capabilities, including custom domain names and fine-grained access control, an upgrade to a Dedicated Gateway is available as a paid option. For more information, see Invoke through a dedicated gateway.

Important

When you call a service through a gateway, the request body cannot exceed 1 MB.

VPC Configuration

By configuring a VPC, vSwitch, and security group, enable direct access to services within the VPC or allow services to access Internet resources. For more information, see Network configuration.

Associate NLB

Associate a service with a Network Load Balancer (NLB) instance to achieve more flexible and fine-grained load balancing. For more information, see Invoke services using associated NLB instances.

Service Discovery Nacos

Register services to the Microservices Registry to enable automatic service discovery and synchronization in a microservices model. For more information, see Call a service using Nacos.

To enable high-performance RPC communication, activate gRPC support for the service gateway through Features > Advanced Networking.

Service security

To enhance service security, configure the following parameters in the Features section:

Parameter

Description

Custom Authentication

If you do not want to use the system-generated token, customize the authentication token for service access here.

Configure Secure Encryption Environment

Integrates with confidential computing services to run inference within a secure enclave, encrypting data, models, and code while in use. This feature is primarily for mounted storage files. You must mount the storage before you enable this feature. For more information, see Secure Encrypted Inference Service.

Instance RAM Role

By associating a RAM role with an instance, the code within the service can use STS temporary credentials to access other cloud resources. This eliminates the need to configure a fixed AccessKey and reduces the risk of key leakage. For more information, see Configure an EAS RAM role.

Ensure service stability and high availability

The group feature in the Basic Information section lets you group multiple service versions or services that use heterogeneous resources. These groups can then be used with traffic management policies to implement phased releases. For more information, see Phased Release.

To ensure the stability and reliability of services in your production environment, configure the following parameters in the Features sections:

Parameter

Description

Service Response Timeout Period

Configure an appropriate timeout for each request. The default is 5 seconds. This prevents slow requests from occupying service resources for a long time.

Health Check

When you configure a Health Check for a service, the system periodically checks the health status of its instances. If an instance becomes abnormal, the system automatically launches a new instance to enable self-healing. For more information, see Health check.

Graceful Shutdown

Configure the Graceful Shutdown Time to ensure that during service updates or scale-ins, an instance has sufficient time to finish processing received requests before it exits. This prevents request processing from being interrupted. Enable Send SIGTERM for more fine-grained exit handling at the application layer. For more information, see Rolling updates and graceful exit.

Rolling Update

By configuring Number of Instances Exceeding Expectation and Maximum Number of Unavailable Instances, have fine-grained control over the instance replacement policy during the service update process and complete the version upgrade without interrupting the service. For more information, see Rolling updates and graceful exit.

Performance optimization

The following configurations are crucial for improving service performance. These configurations can accelerate startup, increase throughput, and reduce latency, especially for resource-intensive applications such as large models.

Parameter

Description

Storage Acceleration

Distributed Cache Acceleration

Caches models or data files from mounted storage, such as OSS, to the local instance to improve read speeds and reduce I/O latency. For more information, see Cache data to local directories.

Model Weights Service (MoWS)

Significantly improves scaling efficiency and service startup speed in large-scale instance deployment scenarios through local caching and cross-instance sharing of model weights. For more information, see Model Weight Service.

Resource Configuration

Shared Memory

Configure shared memory for an instance. This allows multiple processes within the container to directly read and write to the same memory region, avoiding the overhead of data copying and transmission. This is suitable for scenarios that require efficient inter-process communication.

Distributed Inference

A single inference instance is deployed on multiple machines to jointly complete an inference task. This solves the problem that ultra-large models cannot be deployed on a single machine. For more information, see Multi-machine distributed inference.

Intelligent Scheduling

LLM Intelligent Router

When an LLM service has multiple backend instances, the LLM Intelligent Router dynamically distributes requests based on backend load to balance computing power and GPU memory usage across all instances and improve cluster resource utilization. For more information, see LLM intelligent router.

Service observation and diagnostics

To gain insights into your service status and quickly troubleshoot problems, enable the following features in the Features section:

Parameter

Description

Save Call Records

Persistently save all request and response records of the service in MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.

  • MaxCompute

    • MaxCompute Project: Select an existing project. If the list is empty, click Create MaxCompute Project to create one. For more information, see Create a project in the MaxCompute console.

    • MaxCompute Table: Specify the data table name. The system automatically creates this table in the selected project when you deploy the service.

  • Simple Log Service

    • Simple Log Service Project: Select an existing log project. If the list is empty, click Create Simple Log Service Project to create one. For more information, see Manage Project.

    • Logstore: Specify a name for the logstore. When you deploy the service, the system automatically creates this logstore in the selected project.

Tracing Analysis

Some official images include a built-in collection component, allowing you to enable tracing with a single click. For other images, integrate an ARMS probe through simple configuration to achieve end-to-end monitoring of service invocations. For more information, see Enable Tracing for LLM-based Applications in EAS. The configuration method is as follows:

  • Add aliyun-bootstrap -a install && aliyun-instrument python app.py to the command to install the probe and start the application using the ARMS Python probe. Replace app.py with the main file used to provide the prediction service.

  • Add aliyun-bootstrap to the Third-party Library Settings to download the probe installer from the PyPI repository.

Asynchronous and elastic services

  • Asynchronous inference: For long-running inference scenarios, such as AIGC and video processing, enable the Asynchronous Queue. This allows a client to receive an immediate response after initiating a call and obtain the final result using polling or a callback. For more information, see Deploy an asynchronous inference service.

  • Elastic Job service: In the Features section, enable Task Mode to run inference workloads as on-demand, serverless jobs. Resources are automatically provisioned for the task and released upon completion to save costs. For more information, see Elastic Job Service.

Modify configuration parameters in the JSON file

In the Service Configuration section, view and directly edit the complete JSON for the current UI configuration.

Note

For automated and fine-grained configuration scenarios, you can also use a JSON file to directly define and deploy services. For more information, see JSON deployment.