EAS custom deployment lets you package any algorithm or model as an online inference service using custom containers, processors, or framework-specific configurations.
First, try Deploy pre-built AI services for use cases such as LLMs and ComfyUI. If this option does not meet your needs, use a custom deployment.
Quick start: Deploy a simple web service
Use image-based deployment to quickly deploy a simple web service.
Step 1: Prepare the code file
Save the following Flask application code as an app.py file. The service listens on port 8000.
Step 2: Upload the code to OSS
Upload the app.py file to an OSS Bucket. Ensure the OSS Bucket and EAS workspace are in the same region. For example, upload to oss://examplebucket/code/.
Step 3: Configure and deploy the service
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
-
Configure the key parameters in the Environment Information and Resource Information sections:
-
Deployment Method: Select Image-based Deployment.
-
Image Configuration: select the Alibaba Cloud Image
python-inference:3.9-ubuntu2004. -
Mount Storage: Click OSS and mount the OSS directory that contains
app.pyto the/mnt/data/path in the container.-
Uri: The path to the OSS directory that contains the code, such as
oss://examplebucket/code/. -
Mount Path: The path in the container to the mounted directory. For example,
/mnt/data/.
-
-
Command: The startup command is
python /mnt/data/app.pysinceapp.pyis mounted to/mnt/data/. -
Third-party Library Settings: The sample code depends on the
flasklibrary, which is not included in the official image. Addflaskto the Third-party Libraries for automatic installation on startup. -
Resource Configuration: Configure the computing resources for the service. For this example, a small CPU instance is sufficient.
-
Resource Type: Public Resources.
-
Resource Specification:
ecs.c7.large.
-
-
-
After completing configuration, click Deploy. Deployment is complete when the service status changes to Running. Then invoke the service.
References
Manage environment and dependencies
In the Environment Information section, configure service environments and dependencies.
|
Parameter |
Description |
|
Image Configuration |
Provides the runtime environment. Use an official PAI image or a custom-built image by selecting Custom Image or entering an Image Address. Custom Images. Note
If the image contains a WebUI, select Enable Web App, and EAS automatically starts a web server to provide direct access to the frontend page. |
|
Mount storage |
Mount models, code, or data from OSS or NAS to container paths for independent updates. Storage mounting. |
|
Mount dataset |
For model or data versioning, use Dataset mounting. For more information, see Create and manage datasets. |
|
Command |
Set the image startup command, such as |
|
Port Number |
The listening port. Optional in some scenarios. For example, skip this if the service receives messages through a message queue in the runtime image instead of the EAS gateway. Important
EASThe engine reserves ports 8080 and 9090. To prevent port conflicts that can cause service startup failures, do not use these ports when you deploy a service. |
|
Third-party Library Settings |
To install additional Python libraries without rebuilding the image, add library names directly or specify a Path of requirements.txt. |
|
Environment Variables |
Set environment variables for the service instance as key-value pairs. |
For GPU-accelerated instances, specify the GPU driver version through Features > Resource Configuration > GPU driver version to match your model or framework requirements.
Configure computing resources
In the Resource Information section, configure a service's computing resources.
|
Parameter |
Description |
|
Resource Type |
Select public resources, EAS resource groups, or resource quotas. Note
Enable GPU Sharing to run multiple services on a single GPU. Suitable for small models or low inference payloads. Available only with EAS resource groups or resource quotas. GPU Sharing. |
|
Replicas |
Configure multiple instances to avoid the risk of a single point of failure. |
|
Deployment |
When using public resources for supported specifications, enable bidding and set a maximum bid price to preempt idle resources at a price much lower than that of regular instances. Suitable for tasks that tolerate interruptions. |
|
Configure a System Disk |
|
|
Elastic Resource Pool |
Enables cloud bursting. When traffic exceeds the capacity of your dedicated resources (EAS resource groups or quotas), the service bursts to public resources. During scale-in, burst instances are released first to minimize costs. For more information, see Elastic resource pool. |
|
Specify Node Scheduling |
This setting applies only when you use EAS resource groups or resource quotas.
|
|
High-priority Resource Rescheduling |
Periodically migrates instances from low-priority resources (public or regular) to high-priority resources (dedicated groups or spot instances) to optimize costs. Use cases:
|
Service registration and network
EAS supports flexible service registration and network options. Service Invocation.
|
Parameter |
Description |
|
Select Gateway |
By default, services use a free Shared Gateway. For custom domains and fine-grained access control, upgrade to a paid Dedicated Gateway. Invoke through a dedicated gateway. Important
When you call a service through a gateway, the request body cannot exceed 1 MB. |
|
VPC |
Configure a VPC, vSwitch, and security group to enable direct VPC access or allow services to reach Internet resources. For more information, see Configure network access. |
|
Associate NLB |
Associate a service with a Network Load Balancer (NLB) instance for flexible load balancing. Invoke services using associated NLB instances. |
|
Service Discovery Nacos |
Register services to the Microservices Registry to enable automatic service discovery and synchronization in a microservices model. For more information, see Call a service using Nacos. |
To enable high-performance RPC communication, activate gRPC support for the service gateway through Features > Advanced Networking > Enable gRPC.
Service security
Configure security parameters in the Features section:
|
Parameter |
Description |
|
Custom Authentication |
Customize the authentication token instead of using the system-generated one. |
|
Configure Secure Encryption Environment |
Integrates with confidential computing services to run inference within a secure enclave, encrypting data, models, and code in use. Requires storage mounting before enablement. Secure Encrypted Inference Service. |
|
Instance RAM Role |
Associate a RAM role with an instance to use STS temporary credentials for accessing other cloud resources, eliminating fixed AccessKey configuration and reducing the risk of key leakage. Configure an EAS RAM role. |
Ensure service stability and high availability
Use the group feature in the Basic Information section to group service versions or services that use heterogeneous resources, then apply traffic management policies for phased releases. Phased Release.
To ensure the stability and reliability of services in your production environment, configure the following parameters in the Features sections:
|
Parameter |
Description |
|
Service Response Timeout Period |
Set the per-request timeout (default: 5 seconds) to prevent slow requests from blocking resources. |
|
Health Check |
Configure a Health Check to periodically check instance health and automatically launch a new instance to replace abnormal ones. For more information, see Configure health checks. |
|
Compute monitoring & fault tolerance |
Monitors the real-time computing health for distributed inference services with automatic fault detection and intelligent self-healing. For more information, see Compute monitoring and fault tolerance. |
|
Graceful Shutdown |
Set the Graceful Shutdown Time so instances finish in-flight requests during updates or scale-ins. This prevents request processing from being interrupted. Enable Send SIGTERM for application-level exit handling. Rolling updates and graceful exit. |
|
Rolling Update |
Configure Exceeds the expected number of replicas and Maximum Unavailable Replicas for fine-grained control over instance replacement during the version upgrade process without downtime. Rolling updates and graceful exit. |
Performance optimization
These configurations accelerate startup, increase throughput, and reduce latency, especially for resource-intensive applications such as large models.
|
Parameter |
Description |
|
|
Storage Acceleration |
Distributed cache acceleration |
Caches mounted storage files locally to improve read speeds and reduce I/O latency. Cache data to local directories. |
|
Model Weight Service (MoWS) |
Significantly improves scaling efficiency and service startup speed in large-scale instance deployment scenarios through local caching and cross-instance sharing of model weights. Model Weight Service. |
|
|
Resource Configuration |
Shared Memory |
Allocates shared memory for efficient inter-process communication within the container, allowing multiple processes to directly read and write to the same memory region and avoiding data copy overhead. |
|
Distributed Inference |
Deploys a single inference instance across multiple machines for models that exceed single-machine capacity. For more information, see Multi-node distributed inference. |
|
|
Intelligent Scheduling |
LLM Intelligent Router |
For multi-instance LLM services, the LLM Intelligent Router distributes requests based on backend load to balance computing power and GPU memory usage across instances. LLM intelligent router. |
Service observation and diagnostics
To monitor service health and troubleshoot issues, enable the following in the Features section:
|
Parameter |
Description |
|
Save Call Records |
Persistently save all request and response records to MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.
|
|
Tracing Analysis |
Some official images include a built-in collection component, allowing you to enable tracing with a single click. For other images, integrate an ARMS probe for end-to-end monitoring. Enable Tracing for LLM-based Applications in EAS. Configuration:
|
Asynchronous and elastic services
-
Asynchronous inference: For long-running tasks such as AIGC and video processing, enable the Asynchronous Queue. This allows a client to receive an immediate response and obtain the final result through polling or a callback. For more information, see Deploy an asynchronous inference service.
-
Elastic Job service: In the Features section, enable Task Mode to run inference workloads as on-demand, serverless jobs. Resources are automatically provisioned for the task and released upon completion to save costs. For more information, see Elastic Job Service.
Modify configuration parameters in the JSON file
In the Service Configuration section, view and directly edit the complete JSON for the current UI configuration.
For automated and fine-grained configuration scenarios, you can also use a JSON file to directly define and deploy services. For more information, see JSON deployment.