Custom deployment through the console - Platform For AI - Alibaba Cloud Documentation Center

Custom deployment in Elastic Algorithm Service (EAS) provides flexible and comprehensive hosting for AI inference services, allowing you to deploy any algorithm or model as an online service.

Note

First, try Scenario-based deployment for use cases such as LLMs and ComfyUI. If this option does not meet your needs, use a custom deployment.

Quick start: Deploy a simple web service

This section shows how to quickly deploy a simple web service using the image-based deployment method.

Step 1: Prepare the code file

Save the following Flask application code as an app.py file. Note that the service listens on port 8000.

app.py

from flask import Flask

app = Flask(__name__)

@app.route('/hello')
def hello_world():
    # TODO: Implement model inference or other business logic
    return 'Hello World'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Step 2: Upload the code to OSS

Upload the app.py file to an OSS Bucket. Make sure that the OSS Bucket and the EAS workspace are in the same region. For example, upload the file to the oss://examplebucket/code/ directory.

Step 3: Configure and deploy the service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the configuration page, configure the key parameters in the Environment Information and Resource Information sections as follows:
- Deployment Method: Select Image-based Deployment.
- Image Configuration: select the Alibaba Cloud Image python-inference:3.9-ubuntu2004.
- Mount Storage: Click OSS and mount the OSS directory that contains app.py to the /mnt/data/ path in the container.
  - Uri: The path to the OSS directory that contains the code, such as oss://examplebucket/code/.
  - Mount Path: The path in the container to the mounted directory. For example, /mnt/data/.
- Command: The startup command is python /mnt/data/app.py because app.py has been mounted to the /mnt/data/ directory of the container.
- Third-party Library Settings: The sample code depends on the flask library, which is not included in the official image. Add flask to the Third-party Libraries to have it installed automatically on startup.
- Resource Configuration: Configure the computing resources for the service. For this example, a small CPU instance is sufficient.
  - Resource Type: Public Resources.
  - Resource Type: ecs.c7.large.
After completing the configuration, click Deploy. The deployment is complete when the service status changes to Running. Then proceed to invoke the service.

References

Manage environment and dependencies

In the Environment Information section, configure the environments and dependencies for the service.

Parameter	Description
Image Configuration	The image serves as the basic runtime environment for the service. Use an official image provided by PAI or a custom-built image by selecting Custom Image or entering an Image Address. For more information, see Custom Images. Note If the image contains a WebUI, select Enable Web App, and EAS automatically starts a web server to provide direct access to the frontend page.
Mount storage	Mounting models, code, or data from cloud storage services such as OSS and NAS to a local path in a container decouples the code or data from the environment, allowing for independent updates. For more information, see Storage mounting.
Mount dataset	If you want to perform version management for your models or data, use the Dataset to mount them. For more information, see Create and manage datasets.
Command	Set the image startup command, such as `python /run.py`.
Port Number	Specifies the listening port for the service. This parameter is optional in some scenarios. For example, the port number is optional if a service receives messages by subscribing to a message queue in the runtime image instead of relying on traffic from the EAS gateway. Important EASThe engine reserves ports 8080 and 9090. To prevent port conflicts that can cause service startup failures, do not use these ports when you deploy a service.
Third-party Library Settings	If you only need to install a few additional Python libraries, add the library names directly or specify a Path of requirements.txt to avoid rebuilding the image.
Environment Variables	Set environment variables for the service instance as key-value pairs.

For GPU-accelerated instances, specify the GPU driver version through Features > Resource Configuration > GPU driver version to meet the runtime requirements of specific models or frameworks.

Configure computing resources

In the Resource Information section, configure a service's computing resources.

Parameter	Description
Resource Type	Select public resources, EAS resource groups, or resource quotas. Note Enable GPU Sharing to deploy multiple model services on a single GPU card and improve GPU resource utilization by sharing computing power. This feature is suitable for use cases with small models or low inference payloads. It can be enabled only when you use an EAS resource group or a resource quota. For more information, see GPU Sharing.
Replicas	Configure multiple instances to avoid the risk of a single point of failure.
Deployment	When using public resources for supported specifications, enable bidding and set a maximum bid price to preempt idle resources at a price much lower than that of regular instances. This feature is suitable for inference tasks that are not sensitive to interruptions.
Configure a System Disk	If you use public resources, EAS provides a 30 GiB system disk for free. If you need more capacity, you are charged for the actual usage. If you use an EAS resource group or a resource quota, a 60 GiB system disk is enabled by default. If you change the capacity, the space is allocated from the host.
Elastic Resource Pool	This feature enables cloud bursting. When traffic exceeds the capacity of your dedicated resources (EAS resource groups or quotas), the service automatically bursts to on-demand public resources to handle the spike. During scale-in, these public instances are released first to minimize costs. For more information, see Elastic resource pool.
Specify Node Scheduling	This setting applies only when you use EAS resource groups or resource quotas. If you specify nodes, the service uses only those nodes. If you do not specify any nodes, the service can use any non-excluded node.
High-priority Resource Rescheduling	After you enable this feature, the system periodically tries to migrate service instances from low-priority resources, such as public resources or regular instances, to high-priority resources, such as dedicated resource groups or spot instances. This optimizes costs and resource allocation. This feature can solve the following problems: During a rolling update, it prevents new instances from being temporarily scheduled to the public resource group because old instances are still occupying resources. When you use both spot instances and regular instances, the system periodically checks for and migrates instances to more economical spot instances if they are available.

Service registration and network

EAS provides flexible service registration and network configuration options to meet different business integration needs. For more information, see Service Invocation.

Parameter	Description
Select Gateway	By default, services are exposed through a complimentary Shared Gateway. For advanced capabilities, including custom domain names and fine-grained access control, an upgrade to a Dedicated Gateway is available as a paid option. For more information, see Invoke through a dedicated gateway. Important When you call a service through a gateway, the request body cannot exceed 1 MB.
VPC	By configuring a VPC, vSwitch, and security group, enable direct access to services within the VPC or allow services to access Internet resources. For more information, see Enable EAS to access the internet and private resources.
Associate NLB	Associate a service with a Network Load Balancer (NLB) instance to achieve more flexible and fine-grained load balancing. For more information, see Invoke services using associated NLB instances.
Service Discovery Nacos	Register services to the Microservices Registry to enable automatic service discovery and synchronization in a microservices model. For more information, see Call a service using Nacos.

To enable high-performance RPC communication, activate gRPC support for the service gateway through Features > Advanced Networking > Enable gRPC.

Service security

To enhance service security, configure the following parameters in the Features section:

Parameter	Description
Custom Authentication	If you do not want to use the system-generated token, customize the authentication token for service access here.
Configure Secure Encryption Environment	Integrates with confidential computing services to run inference within a secure enclave, encrypting data, models, and code while in use. This feature is primarily for mounted storage files. You must mount the storage before you enable this feature. For more information, see Secure Encrypted Inference Service.
Instance RAM Role	After associating a RAM role with an instance, the code within the service can use STS temporary credentials to access other cloud resources. This eliminates the need to configure a fixed AccessKey and reduces the risk of key leakage. For more information, see Configure an EAS RAM role.

Ensure service stability and high availability

The group feature in the Basic Information section lets you group multiple service versions or services that use heterogeneous resources. These groups can then be used with traffic management policies to implement phased releases. For more information, see Phased Release.

To ensure the stability and reliability of services in your production environment, configure the following parameters in the Features sections:

Parameter	Description
Service Response Timeout Period	Configure an appropriate timeout for each request. The default is 5 seconds. This prevents slow requests from occupying service resources for a long time.
Health Check	When you configure a Health Check for a service, the system periodically checks the health status of its instances. If an instance becomes abnormal, the system automatically launches a new instance to enable self-healing. For more information, see Health check.
Compute monitoring & fault tolerance	The platform monitors the real-time health of the computing power for distributed inference services to automatically detect faults and perform intelligent self-healing. This ensures high service availability and stability. For more information, see Computing power check and fault tolerance.
Graceful Shutdown	Configure the Graceful Shutdown Time to ensure that during service updates or scale-ins, an instance has sufficient time to finish processing received requests before it exits. This prevents request processing from being interrupted. Enable Send SIGTERM for more fine-grained exit handling at the application layer. For more information, see Rolling updates and graceful exit.
Rolling Update	By configuring Exceeds the expected number of replicas and Maximum Unavailable Replicas, have fine-grained control over the instance replacement policy during the service update process and complete the version upgrade without interrupting the service. For more information, see Rolling updates and graceful exit.

Performance optimization

The following configurations are crucial for improving service performance. These configurations can accelerate startup, increase throughput, and reduce latency, especially for resource-intensive applications such as large models.

Parameter		Description
Storage Acceleration	Distributed cache acceleration	Caches models or data files from mounted storage, such as OSS, to the local instance to improve read speeds and reduce I/O latency. For more information, see Cache data to local directories.
Storage Acceleration	Model Weight Service (MoWS)	Significantly improves scaling efficiency and service startup speed in large-scale instance deployment scenarios through local caching and cross-instance sharing of model weights. For more information, see Model Weight Service.
Resource Configuration	Shared Memory	Configure shared memory for an instance. This allows multiple processes within the container to directly read and write to the same memory region, avoiding the overhead of data copying and transmission. This is suitable for scenarios that require efficient inter-process communication.
Resource Configuration	Distributed Inference	A single inference instance is deployed on multiple machines to jointly complete an inference task. This solves the problem that ultra-large models cannot be deployed on a single machine. For more information, see Multi-machine distributed inference.
Intelligent Scheduling	LLM Intelligent Router	When an LLM service has multiple backend instances, the LLM Intelligent Router dynamically distributes requests based on backend load to balance computing power and GPU memory usage across all instances and improve cluster resource utilization. For more information, see LLM intelligent router.

Service observation and diagnostics

To gain insights into your service status and quickly troubleshoot problems, enable the following features in the Features section:

Parameter

Description

Save Call Records

Persistently save all request and response records of the service in MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.

MaxCompute
- MaxCompute Project: Select an existing project. If the list is empty, click Create MaxCompute Project to create one. For more information, see Create a project in the MaxCompute console.
- MaxCompute Table: Specify the data table name. The system automatically creates this table in the selected project when you deploy the service.
Simple Log Service
- Simple Log Service Project: Select an existing log project. If the list is empty, click Go to Create SLS Project to create one. For more information, see Manage Project.
- logstore: Specify a name for the logstore. When you deploy the service, the system automatically creates this logstore in the selected project.

Tracing Analysis

Some official images include a built-in collection component, allowing you to enable tracing with a single click. For other images, integrate an ARMS probe through simple configuration to achieve end-to-end monitoring of service invocations. For more information, see Enable Tracing for LLM-based Applications in EAS. The configuration method is as follows:

Add aliyun-bootstrap -a install && aliyun-instrument python app.py to the command to install the probe and start the application using the ARMS Python probe. Replace app.py with the main file used to provide the prediction service.
Add aliyun-bootstrap to the Third-party Library Settings to download the probe installer from the PyPI repository.

Asynchronous and elastic services

Asynchronous inference: For long-running inference scenarios, such as AIGC and video processing, enable the Asynchronous Queue. This allows a client to receive an immediate response after initiating a call and obtain the final result using polling or a callback. For more information, see Deploy an asynchronous inference service.
Elastic Job service: In the Features section, enable Task Mode to run inference workloads as on-demand, serverless jobs. Resources are automatically provisioned for the task and released upon completion to save costs. For more information, see Elastic Job Service.

Modify configuration parameters in the JSON file

In the Service Configuration section, view and directly edit the complete JSON for the current UI configuration.

Note

For automated and fine-grained configuration scenarios, you can also use a JSON file to directly define and deploy services. For more information, see JSON deployment.