All Products
Search
Document Center

Platform For AI:Console-based custom deployment

Last Updated:May 27, 2026

EAS custom deployment lets you package any algorithm or model as an online inference service using custom containers, processors, or framework-specific configurations.

Note

First, try Deploy pre-built AI services for use cases such as LLMs and ComfyUI. If this option does not meet your needs, use a custom deployment.

Quick start: Deploy a simple web service

Use image-based deployment to quickly deploy a simple web service.

Step 1: Prepare the code file

Save the following Flask application code as an app.py file. The service listens on port 8000.

app.py

from flask import Flask

app = Flask(__name__)

@app.route('/hello')
def hello_world():
    # TODO: Implement model inference or other business logic
    return 'Hello World'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Step 2: Upload the code to OSS

Upload the app.py file to an OSS Bucket. Ensure the OSS Bucket and EAS workspace are in the same region. For example, upload to oss://examplebucket/code/.

Step 3: Configure and deploy the service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. Configure the key parameters in the Environment Information and Resource Information sections:

    • Deployment Method: Select Image-based Deployment.

    • Image Configuration: select the Alibaba Cloud Image python-inference:3.9-ubuntu2004.

    • Mount Storage: Click OSS and mount the OSS directory that contains app.py to the /mnt/data/ path in the container.

      • Uri: The path to the OSS directory that contains the code, such as oss://examplebucket/code/.

      • Mount Path: The path in the container to the mounted directory. For example, /mnt/data/.

    • Command: The startup command is python /mnt/data/app.py since app.py is mounted to /mnt/data/.

    • Third-party Library Settings: The sample code depends on the flask library, which is not included in the official image. Add flask to the Third-party Libraries for automatic installation on startup.

    • Resource Configuration: Configure the computing resources for the service. For this example, a small CPU instance is sufficient.

      • Resource Type: Public Resources.

      • Resource Specification: ecs.c7.large.

  4. After completing configuration, click Deploy. Deployment is complete when the service status changes to Running. Then invoke the service.

References

Manage environment and dependencies

In the Environment Information section, configure service environments and dependencies.

Parameter

Description

Image Configuration

Provides the runtime environment. Use an official PAI image or a custom-built image by selecting Custom Image or entering an Image Address. Custom Images.

Note

If the image contains a WebUI, select Enable Web App, and EAS automatically starts a web server to provide direct access to the frontend page.

Mount storage

Mount models, code, or data from OSS or NAS to container paths for independent updates. Storage mounting.

Mount dataset

For model or data versioning, use Dataset mounting. For more information, see Create and manage datasets.

Command

Set the image startup command, such as python /run.py.

Port Number

The listening port. Optional in some scenarios.

For example, skip this if the service receives messages through a message queue in the runtime image instead of the EAS gateway.

Important

EASThe engine reserves ports 8080 and 9090. To prevent port conflicts that can cause service startup failures, do not use these ports when you deploy a service.

Third-party Library Settings

To install additional Python libraries without rebuilding the image, add library names directly or specify a Path of requirements.txt.

Environment Variables

Set environment variables for the service instance as key-value pairs.

For GPU-accelerated instances, specify the GPU driver version through Features > Resource Configuration > GPU driver version to match your model or framework requirements.

Configure computing resources

In the Resource Information section, configure a service's computing resources.

Parameter

Description

Resource Type

Select public resources, EAS resource groups, or resource quotas.

Note

Enable GPU Sharing to run multiple services on a single GPU. Suitable for small models or low inference payloads. Available only with EAS resource groups or resource quotas. GPU Sharing.

Replicas

Configure multiple instances to avoid the risk of a single point of failure.

Deployment

When using public resources for supported specifications, enable bidding and set a maximum bid price to preempt idle resources at a price much lower than that of regular instances. Suitable for tasks that tolerate interruptions.

Configure a System Disk

  • If you use public resources, EAS provides a 30 GiB system disk for free. If you need more capacity, you are charged for the actual usage.

  • If you use an EAS resource group or a resource quota, a 60 GiB system disk is enabled by default. If you change the capacity, the space is allocated from the host.

Elastic Resource Pool

Enables cloud bursting. When traffic exceeds the capacity of your dedicated resources (EAS resource groups or quotas), the service bursts to public resources. During scale-in, burst instances are released first to minimize costs. For more information, see Elastic resource pool.

Specify Node Scheduling

This setting applies only when you use EAS resource groups or resource quotas.

  • If you specify nodes, the service uses only those nodes.

  • If you do not specify any nodes, the service can use any non-excluded node.

High-priority Resource Rescheduling

Periodically migrates instances from low-priority resources (public or regular) to high-priority resources (dedicated groups or spot instances) to optimize costs. Use cases:

  • Prevents new instances from being scheduled to the public resource group during rolling updates.

  • Migrates regular instances to spot instances when available to reduce costs.

Service registration and network

EAS supports flexible service registration and network options. Service Invocation.

Parameter

Description

Select Gateway

By default, services use a free Shared Gateway. For custom domains and fine-grained access control, upgrade to a paid Dedicated Gateway. Invoke through a dedicated gateway.

Important

When you call a service through a gateway, the request body cannot exceed 1 MB.

VPC

Configure a VPC, vSwitch, and security group to enable direct VPC access or allow services to reach Internet resources. For more information, see Configure network access.

Associate NLB

Associate a service with a Network Load Balancer (NLB) instance for flexible load balancing. Invoke services using associated NLB instances.

Service Discovery Nacos

Register services to the Microservices Registry to enable automatic service discovery and synchronization in a microservices model. For more information, see Call a service using Nacos.

To enable high-performance RPC communication, activate gRPC support for the service gateway through Features > Advanced Networking > Enable gRPC.

Service security

Configure security parameters in the Features section:

Parameter

Description

Custom Authentication

Customize the authentication token instead of using the system-generated one.

Configure Secure Encryption Environment

Integrates with confidential computing services to run inference within a secure enclave, encrypting data, models, and code in use. Requires storage mounting before enablement. Secure Encrypted Inference Service.

Instance RAM Role

Associate a RAM role with an instance to use STS temporary credentials for accessing other cloud resources, eliminating fixed AccessKey configuration and reducing the risk of key leakage. Configure an EAS RAM role.

Ensure service stability and high availability

Use the group feature in the Basic Information section to group service versions or services that use heterogeneous resources, then apply traffic management policies for phased releases. Phased Release.

To ensure the stability and reliability of services in your production environment, configure the following parameters in the Features sections:

Parameter

Description

Service Response Timeout Period

Set the per-request timeout (default: 5 seconds) to prevent slow requests from blocking resources.

Health Check

Configure a Health Check to periodically check instance health and automatically launch a new instance to replace abnormal ones. For more information, see Configure health checks.

Compute monitoring & fault tolerance

Monitors the real-time computing health for distributed inference services with automatic fault detection and intelligent self-healing. For more information, see Compute monitoring and fault tolerance.

Graceful Shutdown

Set the Graceful Shutdown Time so instances finish in-flight requests during updates or scale-ins. This prevents request processing from being interrupted. Enable Send SIGTERM for application-level exit handling. Rolling updates and graceful exit.

Rolling Update

Configure Exceeds the expected number of replicas and Maximum Unavailable Replicas for fine-grained control over instance replacement during the version upgrade process without downtime. Rolling updates and graceful exit.

Performance optimization

These configurations accelerate startup, increase throughput, and reduce latency, especially for resource-intensive applications such as large models.

Parameter

Description

Storage Acceleration

Distributed cache acceleration

Caches mounted storage files locally to improve read speeds and reduce I/O latency. Cache data to local directories.

Model Weight Service (MoWS)

Significantly improves scaling efficiency and service startup speed in large-scale instance deployment scenarios through local caching and cross-instance sharing of model weights. Model Weight Service.

Resource Configuration

Shared Memory

Allocates shared memory for efficient inter-process communication within the container, allowing multiple processes to directly read and write to the same memory region and avoiding data copy overhead.

Distributed Inference

Deploys a single inference instance across multiple machines for models that exceed single-machine capacity. For more information, see Multi-node distributed inference.

Intelligent Scheduling

LLM Intelligent Router

For multi-instance LLM services, the LLM Intelligent Router distributes requests based on backend load to balance computing power and GPU memory usage across instances. LLM intelligent router.

Service observation and diagnostics

To monitor service health and troubleshoot issues, enable the following in the Features section:

Parameter

Description

Save Call Records

Persistently save all request and response records to MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.

  • MaxCompute

    • MaxCompute Project: Select an existing project. If the list is empty, click Create MaxCompute Project to create one. For more information, see Create a project in the MaxCompute console.

    • MaxCompute Table: Specify the data table name. The system automatically creates this table in the selected project when you deploy the service.

  • Simple Log Service

    • Simple Log Service Project: Select an existing log project. If the list is empty, click Go to Create SLS Project to create one. For more information, see Manage Project.

    • logstore: Specify a name for the logstore. When you deploy the service, the system automatically creates this logstore in the selected project.

Tracing Analysis

Some official images include a built-in collection component, allowing you to enable tracing with a single click. For other images, integrate an ARMS probe for end-to-end monitoring. Enable Tracing for LLM-based Applications in EAS. Configuration:

  • Add aliyun-bootstrap -a install && aliyun-instrument python app.py to the command to install the probe and start the application using the ARMS Python probe. Replace app.py with the main file used to provide the prediction service.

  • Add aliyun-bootstrap to the Third-party Library Settings to download the probe installer from the PyPI repository.

Asynchronous and elastic services

  • Asynchronous inference: For long-running tasks such as AIGC and video processing, enable the Asynchronous Queue. This allows a client to receive an immediate response and obtain the final result through polling or a callback. For more information, see Deploy an asynchronous inference service.

  • Elastic Job service: In the Features section, enable Task Mode to run inference workloads as on-demand, serverless jobs. Resources are automatically provisioned for the task and released upon completion to save costs. For more information, see Elastic Job Service.

Modify configuration parameters in the JSON file

In the Service Configuration section, view and directly edit the complete JSON for the current UI configuration.

Note

For automated and fine-grained configuration scenarios, you can also use a JSON file to directly define and deploy services. For more information, see JSON deployment.