Custom deployment in Elastic Algorithm Service (EAS) provides flexible and comprehensive hosting for AI inference services, allowing you to deploy any algorithm or model as an online service.
First, try Scenario-based deployment for use cases such as LLMs and ComfyUI. If this option does not meet your needs, use a custom deployment.
Quick start: Deploy a simple web service
This section shows how to quickly deploy a simple web service using the image-based deployment method.
Step 1: Prepare the code file
Save the following Flask application code as an app.py file. Note that the service listens on port 8000.
Step 2: Upload the code to OSS
Upload the app.py file to an OSS Bucket. Make sure that the OSS Bucket and the EAS workspace are in the same region. For example, upload the file to the oss://examplebucket/code/ directory.
Step 3: Configure and deploy the service
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the configuration page, configure the key parameters in the Environment Information and Resource Information sections as follows:
Deployment Method: Select Image-based Deployment.
Image Configuration: select the Alibaba Cloud Image
python-inference:3.9-ubuntu2004.Mount Storage: Click OSS and mount the OSS directory that contains
app.pyto the/mnt/data/path in the container.Uri: The path to the OSS directory that contains the code, such as
oss://examplebucket/code/.Mount Path: The path in the container to the mounted directory. For example,
/mnt/data/.
Command: The startup command is
python /mnt/data/app.pybecauseapp.pyhas been mounted to the/mnt/data/directory of the container.Third-party Library Settings: The sample code depends on the
flasklibrary, which is not included in the official image. Addflaskto the Third-party Libraries to have it installed automatically on startup.Resource Configuration: Configure the computing resources for the service. For this example, a small CPU instance is sufficient.
Resource Type: Public Resources.
Resource Type:
ecs.c7.large.
After completing the configuration, click Deploy. The deployment is complete when the service status changes to Running. Then proceed to invoke the service.
References
Manage environment and dependencies
In the Environment Information section, configure the environments and dependencies for the service.
Parameter | Description |
Image Configuration | The image serves as the basic runtime environment for the service. Use an official image provided by PAI or a custom-built image by selecting Custom Image or entering an Image Address. For more information, see Custom Images. Note If the image contains a WebUI, select Enable Web App, and EAS automatically starts a web server to provide direct access to the frontend page. |
Mount storage | Mounting models, code, or data from cloud storage services such as OSS and NAS to a local path in a container decouples the code or data from the environment, allowing for independent updates. For more information, see Storage mounting. |
Mount dataset | If you want to perform version management for your models or data, use the Dataset to mount them. For more information, see Create and manage datasets. |
Command | Set the image startup command, such as |
Port Number | Specifies the listening port for the service. This parameter is optional in some scenarios. For example, the port number is optional if a service receives messages by subscribing to a message queue in the runtime image instead of relying on traffic from the EAS gateway. Important EASThe engine reserves ports 8080 and 9090. To prevent port conflicts that can cause service startup failures, do not use these ports when you deploy a service. |
Third-party Library Settings | If you only need to install a few additional Python libraries, add the library names directly or specify a Path of requirements.txt to avoid rebuilding the image. |
Environment Variables | Set environment variables for the service instance as key-value pairs. |
For GPU-accelerated instances, specify the GPU driver version through Features > Resource Configuration > GPU driver version to meet the runtime requirements of specific models or frameworks.
Configure computing resources
In the Resource Information section, configure a service's computing resources.
Parameter | Description |
Resource Type | Select public resources, EAS resource groups, or resource quotas. Note Enable GPU Sharing to deploy multiple model services on a single GPU card and improve GPU resource utilization by sharing computing power. This feature is suitable for use cases with small models or low inference payloads. It can be enabled only when you use an EAS resource group or a resource quota. For more information, see GPU Sharing. |
Replicas | Configure multiple instances to avoid the risk of a single point of failure. |
Deployment | When using public resources for supported specifications, enable bidding and set a maximum bid price to preempt idle resources at a price much lower than that of regular instances. This feature is suitable for inference tasks that are not sensitive to interruptions. |
Configure a System Disk |
|
Elastic Resource Pool | This feature enables cloud bursting. When traffic exceeds the capacity of your dedicated resources (EAS resource groups or quotas), the service automatically bursts to on-demand public resources to handle the spike. During scale-in, these public instances are released first to minimize costs. For more information, see Elastic resource pool. |
Specify Node Scheduling | This setting applies only when you use EAS resource groups or resource quotas.
|
High-priority Resource Rescheduling | After you enable this feature, the system periodically tries to migrate service instances from low-priority resources, such as public resources or regular instances, to high-priority resources, such as dedicated resource groups or spot instances. This optimizes costs and resource allocation. This feature can solve the following problems:
|
Service registration and network
EAS provides flexible service registration and network configuration options to meet different business integration needs. For more information, see Service Invocation.
Parameter | Description |
Select Gateway | By default, services are exposed through a complimentary Shared Gateway. For advanced capabilities, including custom domain names and fine-grained access control, an upgrade to a Dedicated Gateway is available as a paid option. For more information, see Invoke through a dedicated gateway. Important When you call a service through a gateway, the request body cannot exceed 1 MB. |
VPC | By configuring a VPC, vSwitch, and security group, enable direct access to services within the VPC or allow services to access Internet resources. For more information, see Enable EAS to access the internet and private resources. |
Associate NLB | Associate a service with a Network Load Balancer (NLB) instance to achieve more flexible and fine-grained load balancing. For more information, see Invoke services using associated NLB instances. |
Service Discovery Nacos | Register services to the Microservices Registry to enable automatic service discovery and synchronization in a microservices model. For more information, see Call a service using Nacos. |
To enable high-performance RPC communication, activate gRPC support for the service gateway through Features > Advanced Networking > Enable gRPC.
Service security
To enhance service security, configure the following parameters in the Features section:
Parameter | Description |
Custom Authentication | If you do not want to use the system-generated token, customize the authentication token for service access here. |
Configure Secure Encryption Environment | Integrates with confidential computing services to run inference within a secure enclave, encrypting data, models, and code while in use. This feature is primarily for mounted storage files. You must mount the storage before you enable this feature. For more information, see Secure Encrypted Inference Service. |
Instance RAM Role | After associating a RAM role with an instance, the code within the service can use STS temporary credentials to access other cloud resources. This eliminates the need to configure a fixed AccessKey and reduces the risk of key leakage. For more information, see Configure an EAS RAM role. |
Ensure service stability and high availability
The group feature in the Basic Information section lets you group multiple service versions or services that use heterogeneous resources. These groups can then be used with traffic management policies to implement phased releases. For more information, see Phased Release.
To ensure the stability and reliability of services in your production environment, configure the following parameters in the Features sections:
Parameter | Description |
Service Response Timeout Period | Configure an appropriate timeout for each request. The default is 5 seconds. This prevents slow requests from occupying service resources for a long time. |
Health Check | When you configure a Health Check for a service, the system periodically checks the health status of its instances. If an instance becomes abnormal, the system automatically launches a new instance to enable self-healing. For more information, see Health check. |
Compute monitoring & fault tolerance | The platform monitors the real-time health of the computing power for distributed inference services to automatically detect faults and perform intelligent self-healing. This ensures high service availability and stability. For more information, see Computing power check and fault tolerance. |
Graceful Shutdown | Configure the Graceful Shutdown Time to ensure that during service updates or scale-ins, an instance has sufficient time to finish processing received requests before it exits. This prevents request processing from being interrupted. Enable Send SIGTERM for more fine-grained exit handling at the application layer. For more information, see Rolling updates and graceful exit. |
Rolling Update | By configuring Exceeds the expected number of replicas and Maximum Unavailable Replicas, have fine-grained control over the instance replacement policy during the service update process and complete the version upgrade without interrupting the service. For more information, see Rolling updates and graceful exit. |
Performance optimization
The following configurations are crucial for improving service performance. These configurations can accelerate startup, increase throughput, and reduce latency, especially for resource-intensive applications such as large models.
Parameter | Description | |
Storage Acceleration | Distributed cache acceleration | Caches models or data files from mounted storage, such as OSS, to the local instance to improve read speeds and reduce I/O latency. For more information, see Cache data to local directories. |
Model Weight Service (MoWS) | Significantly improves scaling efficiency and service startup speed in large-scale instance deployment scenarios through local caching and cross-instance sharing of model weights. For more information, see Model Weight Service. | |
Resource Configuration | Shared Memory | Configure shared memory for an instance. This allows multiple processes within the container to directly read and write to the same memory region, avoiding the overhead of data copying and transmission. This is suitable for scenarios that require efficient inter-process communication. |
Distributed Inference | A single inference instance is deployed on multiple machines to jointly complete an inference task. This solves the problem that ultra-large models cannot be deployed on a single machine. For more information, see Multi-machine distributed inference. | |
Intelligent Scheduling | LLM Intelligent Router | When an LLM service has multiple backend instances, the LLM Intelligent Router dynamically distributes requests based on backend load to balance computing power and GPU memory usage across all instances and improve cluster resource utilization. For more information, see LLM intelligent router. |
Service observation and diagnostics
To gain insights into your service status and quickly troubleshoot problems, enable the following features in the Features section:
Parameter | Description |
Save Call Records | Persistently save all request and response records of the service in MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.
|
Tracing Analysis | Some official images include a built-in collection component, allowing you to enable tracing with a single click. For other images, integrate an ARMS probe through simple configuration to achieve end-to-end monitoring of service invocations. For more information, see Enable Tracing for LLM-based Applications in EAS. The configuration method is as follows:
|
Asynchronous and elastic services
Asynchronous inference: For long-running inference scenarios, such as AIGC and video processing, enable the Asynchronous Queue. This allows a client to receive an immediate response after initiating a call and obtain the final result using polling or a callback. For more information, see Deploy an asynchronous inference service.
Elastic Job service: In the Features section, enable Task Mode to run inference workloads as on-demand, serverless jobs. Resources are automatically provisioned for the task and released upon completion to save costs. For more information, see Elastic Job Service.
Modify configuration parameters in the JSON file
In the Service Configuration section, view and directly edit the complete JSON for the current UI configuration.
For automated and fine-grained configuration scenarios, you can also use a JSON file to directly define and deploy services. For more information, see JSON deployment.