The custom deployment of Elastic Algorithm Service (EAS) provides a flexible environment for hosting any AI model or algorithm as a scalable inference service. The service supports a wide range of workloads, from large language models (LLMs) to custom code. This guide is for advanced users who need full control over the runtime and configuration. For a streamlined setup with common applications like LLMs and ComfyUI, consider scenario-based deployment first.
How it works
An EAS service essentially runs in one or more isolated container instances. EAS deploys services based on the following core components:
Environment image: A read-only package containing the operating system, foundational libraries (such as CUDA), language runtimes (such as Python), and other required dependencies. Options include using official PAI-provided images or building custom images to meet specific application requirements.
Code and model files: These files include business logic code and model weights. The best practice is to store them on Object Storage Service (OSS) or File Storage NAS (NAS) and access them by mounting. This decouples your code and models from the runtime environment, allowing you to iterate on them independently by simply updating the files in storage, without rebuilding the image.
Storage mounting: When the service starts, EAS mounts the specified OSS or NAS path to a local directory inside the container. This allows the code in the container to access code and models on external storage as if they were local files.
Run command: The first command to execute after the container starts. This is typically a command to start an HTTP server, such as
python app.py.
The workflow is as follows:
EAS pulls the specified image to create a container.
EAS mounts the external storage to a specified path in the container.
EAS executes the run command inside the container.
After the command is successfully executed, the service starts listening on the specified port and processing inference requests.
EAS supports two deployment methods: image-based and processor-based. Image-based deployment is the recommended method, as it offers maximum flexibility and maintainability. The processor-based method is a legacy option with significant limitations.
Usage notes
Inactive services (not in a running state for 180 consecutive days) are subject to automatic deletion.
When invoking a service through a gateway, the request body cannot exceed 1 MB.
Avoid using ports 8080 and 9090 because they are reserved by the EAS engine.
Procedure
This section shows how to quickly deploy a simple web service using the image-based deployment method.
Step 1: Prepare the code file
Save the following Flask application code as an app.py file. Note that the service listens on port 8000.
Step 2: Upload the code to OSS
Upload the app.py file to an OSS Bucket. Make sure that the OSS Bucket and the EAS workspace are in the same region. For example, upload the file to the oss://examplebucket/code/ directory.
Step 3: Configure and deploy the service
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the configuration page, configure the key parameters in the Environment Information and Resource Information sections as follows:
Deployment Method: Select Image-based Deployment. In the Image Configuration field, select the Alibaba Cloud Image
python-inference:3.9-ubuntu2004.Mount Storage: Click OSS and mount the OSS directory that contains
app.pyto the/mnt/data/path in the container.Uri: The path to the OSS directory that contains the code, such as
oss://examplebucket/code/.Mount Path: The path in the container to the mounted directory. For example,
/mnt/data/.
Command: The startup command is
python /mnt/data/app.pybecauseapp.pyhas been mounted to the/mnt/data/directory of the container.Third-party Library Settings: The sample code depends on the
flasklibrary, which is not included in the official image. Addflaskto the Third-party Libraries to have it installed automatically on startup.Resource configuration: Configure the computing resources for the service. For this example, a small CPU instance is sufficient.
Resource Type: Public Resources.
Resource Type:
ecs.c7.large.
After completing the configuration, click Deploy. The deployment is complete when the service status changes to Running. Then proceed to invoke the service.
References
Manage environment and dependencies
In the Environment Information section, configure the environments and dependencies for the service.
Parameter | Description |
Image Configuration | The image serves as the basic runtime environment for the service. Use an official image provided by PAI or a custom-built image by selecting Custom Image or entering an Image Address. For more information, see Custom Images. Note If the image contains a WebUI, select Enable Web App, and EAS automatically starts a web server to provide direct access to the frontend page. |
Mount Storage | Mounting models, code, or data from cloud storage services such as OSS and NAS to a local path in a container decouples the code or data from the environment, allowing for independent updates. For more information, see Storage mounting. |
Mount Dataset | If you want to perform version management for your models or data, use the Dataset to mount them. For more information, see Create and manage datasets. |
Command, Port Number | Set the image startup command, such as Important The EAS engine listens on fixed ports 8080 and 9090. Avoid using these two ports. |
Third-Party Library Settings | If you only need to install a few additional Python libraries, add the library names directly or specify a Path of Requirements.txt to avoid rebuilding the image. |
Environment Variables | Set environment variables for the service instance as key-value pairs. |
For GPU-accelerated instances, specify the GPU driver version through to meet the runtime requirements of specific models or frameworks.
Configure computing resources
In the Resource Information section, configure a service's computing resources.
Parameter | Description |
Resource Type | Select public resources, EAS resource groups, or resource quotas. Note Enable GPU Sharing to deploy multiple model services on a single GPU card and improve GPU resource utilization by sharing computing power. This feature is suitable for use cases with small models or low inference payloads. It can be enabled only when you use an EAS resource group or a resource quota. For more information, see GPU Sharing. |
Instances | Configure multiple instances to avoid the risk of a single point of failure. |
Deployment Resources | When using public resources for supported specifications, enable Bidding and set a maximum bid price to preempt idle resources at a price much lower than that of regular instances. This feature is suitable for inference tasks that are not sensitive to interruptions. |
Configure a System Disk |
|
Elastic Resource Pool | This feature enables cloud bursting. When traffic exceeds the capacity of your dedicated resources (EAS resource groups or quotas), the service automatically bursts to on-demand public resources to handle the spike. During scale-in, these public instances are released first to minimize costs. For more information, see Elastic resource pool. |
High-priority Resource Rescheduling | After you enable this feature, the system periodically tries to migrate service instances from low-priority resources, such as public resources or regular instances, to high-priority resources, such as dedicated resource groups or spot instances. This optimizes costs and resource allocation. This feature can solve the following problems:
|
Service registration and network
EAS provides flexible service registration and network configuration options to meet different business integration needs. For more information, see Service Invocation.
Parameter | Description |
Select Gateway | By default, services are exposed through a complimentary Shared Gateway. For advanced capabilities, including custom domain names and fine-grained access control, an upgrade to a Dedicated Gateway is available as a paid option. For more information, see Invoke through a dedicated gateway. Important When you call a service through a gateway, the request body cannot exceed 1 MB. |
VPC Configuration | By configuring a VPC, vSwitch, and security group, enable direct access to services within the VPC or allow services to access Internet resources. For more information, see Network configuration. |
Associate NLB | Associate a service with a Network Load Balancer (NLB) instance to achieve more flexible and fine-grained load balancing. For more information, see Invoke services using associated NLB instances. |
Service Discovery Nacos | Register services to the Microservices Registry to enable automatic service discovery and synchronization in a microservices model. For more information, see Call a service using Nacos. |
To enable high-performance RPC communication, activate gRPC support for the service gateway through .
Service security
To enhance service security, configure the following parameters in the Features section:
Parameter | Description |
Custom Authentication | If you do not want to use the system-generated token, customize the authentication token for service access here. |
Configure Secure Encryption Environment | Integrates with confidential computing services to run inference within a secure enclave, encrypting data, models, and code while in use. This feature is primarily for mounted storage files. You must mount the storage before you enable this feature. For more information, see Secure Encrypted Inference Service. |
Instance RAM Role | By associating a RAM role with an instance, the code within the service can use STS temporary credentials to access other cloud resources. This eliminates the need to configure a fixed AccessKey and reduces the risk of key leakage. For more information, see Configure an EAS RAM role. |
Ensure service stability and high availability
The group feature in the Basic Information section lets you group multiple service versions or services that use heterogeneous resources. These groups can then be used with traffic management policies to implement phased releases. For more information, see Phased Release.
To ensure the stability and reliability of services in your production environment, configure the following parameters in the Features sections:
Parameter | Description |
Service Response Timeout Period | Configure an appropriate timeout for each request. The default is 5 seconds. This prevents slow requests from occupying service resources for a long time. |
Health Check | When you configure a Health Check for a service, the system periodically checks the health status of its instances. If an instance becomes abnormal, the system automatically launches a new instance to enable self-healing. For more information, see Health check. |
Graceful Shutdown | Configure the Graceful Shutdown Time to ensure that during service updates or scale-ins, an instance has sufficient time to finish processing received requests before it exits. This prevents request processing from being interrupted. Enable Send SIGTERM for more fine-grained exit handling at the application layer. For more information, see Rolling updates and graceful exit. |
Rolling Update | By configuring Number of Instances Exceeding Expectation and Maximum Number of Unavailable Instances, have fine-grained control over the instance replacement policy during the service update process and complete the version upgrade without interrupting the service. For more information, see Rolling updates and graceful exit. |
Performance optimization
The following configurations are crucial for improving service performance. These configurations can accelerate startup, increase throughput, and reduce latency, especially for resource-intensive applications such as large models.
Parameter | Description | |
Storage Acceleration | Distributed Cache Acceleration | Caches models or data files from mounted storage, such as OSS, to the local instance to improve read speeds and reduce I/O latency. For more information, see Cache data to local directories. |
Model Weights Service (MoWS) | Significantly improves scaling efficiency and service startup speed in large-scale instance deployment scenarios through local caching and cross-instance sharing of model weights. For more information, see Model Weight Service. | |
Resource Configuration | Shared Memory | Configure shared memory for an instance. This allows multiple processes within the container to directly read and write to the same memory region, avoiding the overhead of data copying and transmission. This is suitable for scenarios that require efficient inter-process communication. |
Distributed Inference | A single inference instance is deployed on multiple machines to jointly complete an inference task. This solves the problem that ultra-large models cannot be deployed on a single machine. For more information, see Multi-machine distributed inference. | |
Intelligent Scheduling | LLM Intelligent Router | When an LLM service has multiple backend instances, the LLM Intelligent Router dynamically distributes requests based on backend load to balance computing power and GPU memory usage across all instances and improve cluster resource utilization. For more information, see LLM intelligent router. |
Service observation and diagnostics
To gain insights into your service status and quickly troubleshoot problems, enable the following features in the Features section:
Parameter | Description |
Save Call Records | Persistently save all request and response records of the service in MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.
|
Tracing Analysis | Some official images include a built-in collection component, allowing you to enable tracing with a single click. For other images, integrate an ARMS probe through simple configuration to achieve end-to-end monitoring of service invocations. For more information, see Enable Tracing for LLM-based Applications in EAS. The configuration method is as follows:
|
Asynchronous and elastic services
Asynchronous inference: For long-running inference scenarios, such as AIGC and video processing, enable the Asynchronous Queue. This allows a client to receive an immediate response after initiating a call and obtain the final result using polling or a callback. For more information, see Deploy an asynchronous inference service.
Elastic Job service: In the Features section, enable Task Mode to run inference workloads as on-demand, serverless jobs. Resources are automatically provisioned for the task and released upon completion to save costs. For more information, see Elastic Job Service.
Modify configuration parameters in the JSON file
In the Service Configuration section, view and directly edit the complete JSON for the current UI configuration.
For automated and fine-grained configuration scenarios, you can also use a JSON file to directly define and deploy services. For more information, see JSON deployment.