Platform for AI (PAI) provides Elastic Algorithm Service (EAS) to serve as a one-stop platform for model development, training, and deployment. EAS provides model services for online inference. EAS allows you to deploy model services in the public resource group or dedicated resource groups. EAS loads models based on heterogeneous hardware (CPUs and GPUs) and responds to data requests in real time.
EAS service architecture
EAS is an online model service platform that allows you to deploy models as online inference services or AI-powered web applications in a few clicks. EAS provides features such as automatic scaling and blue-green deployment. The features allow you to use highly concurrent, stable, and cost-effective online algorithm model services at lower costs. EAS also supports resource group management, versioning, and resource monitoring, which helps you deploy model services in your business. You can use EAS in various AI inference scenarios, such as real-time inference and near real-time asynchronous inference. EAS also provides comprehensive O&M capabilities.
Click to view the detailed description of the EAS architecture.
Infrastructure: EAS supports heterogeneous hardware (CPUs or GPUs) and provides General Unit (GU) specifications and preemptible instances intended for AI scenarios. This helps you reduce costs and improve efficiency.
Container scheduling: EAS helps you manage cluster resources in a more efficient manner during business peaks and off-peaks through the following features:
Automatic scaling: The system automatically adjusts the number of service instances to handle business peaks and off-peaks. Automatic scaling helps you manage the computing resources of online services to avoid resource waste.
Scheduled scaling: The system can automatically adjust the number of service instances to the specified number. Schedule scaling helps you avoid resource waste in scenarios where the service load can be estimated.
Elastic resource pool: If the dedicated resource group that you use to deploy services is fully occupied, the system automatically adds pay-as-you-go instances to the public resource group during scale-outs. Elastic resource pool helps you ensure service stability.
Model deployment: EAS provides multiple features to help you simplify the deployment process, real-time service monitoring, and optimal resource management. The following features related to service deployment and publishing are supported:
One-click stress testing: During stress testing, the system automatically adds loads to find the upper load limit for a service. You can also view real-time monitoring data that is accurate to seconds and the testing report.
Canary release: You can add multiple services to a single canary release group, in which certain services are used in the production environment, while others are used in the canary release environment. You can also switch the traffic that is distributed to each service. This way, you can perform canary release in a more flexible manner.
Real-time monitoring: After you deploy a service, you can view the metrics on the Monitoring page to obtain the service status. The metrics include the queries per second (QPS), response time, and CPU utilization.
Traffic mirroring: You can use traffic mirroring to mirror the traffic of the current service to the destination service in proportion. The current service is not interrupted during this process. Traffic mirroring is used to test the performance and reliability of new services.
Inference: EAS supports the following inference capabilities:
Real-time synchronous inference: suitable for scenarios such as customized search recommendations and intelligent conversation. Real-time synchronous inference features high throughput and low latency and does not affect the online businesses. The system can also adapt to deployment models based on business requirements to achieve the optimal effect.
Near real-time asynchronous inference: suitable for scenarios such as text-image generation, and video processing. Message inference is integrated into the inference service. This enables service scaling based on business requirements and saves the needs for O&M.
You can deploy a model by using an image or a processor in EAS.
Model deployment by using an image (recommended)
When you use an image to deploy a model, EAS pulls the environment image from the Container Registry (ACR) and mounts storage services, such as Object Storage Service (OSS) and Apsara File Storage NAS to obtain the required preparations. The preparations include the running environment, model, and other related files, such as the code that is used to process the model.
The following figure describes the workflow of deploying a model by using an image in EAS.
Take note of the following items:
EAS supports two methods of model deployment by using an image: Deploy Service by Using Image and Deploy Web App by Using Image.
Deploy Service by Using Image: Use an image to deploy a model service. After you deploy the service, you can call the service by calling an API operation.
Deploy Web App by Using Image: Use an image to deploy web application. After you deploy the application, you can access the application by using a link.
For more information about the two deployment methods, see the Step 2: Deploy the service section of this topic.
PAI provides multiple official images for model deployment. You can also develop a model and create an image based on your business requirements. You must upload the created image to ACR for easy deployment.
We recommend that you upload the model and the code files that are used to process the models to storage services in the cloud and mount the storage services instead of packaging the model into a custom image. This allows you to update the model in a convenient manner.
When you use an image to deploy model services, we recommend that you build an HTTP server. After you deploy services in EAS, EAS forwards the call requests to the HTTP server that you developed. You cannot specify ports 8080 and 9090 for the HTTP server because the EAS engine listens on ports 8080 and 9090.
If you use a custom image for deployment, you must upload the image to ACR before you use the image. Otherwise, the system may fail to pull the image during model development. If you use Data Science Workshop (DSW) for model development and training, you must upload the image to ACR before you can use the image in EAS.
If you want to use your custom images or to prefetch data in other scenarios, you can use AI Computing Asset Management in PAI to manage them in a centralized manner. CPFS datasets of NAS are not supported in EAS.
Model deployment by using a processor
After you prepare the model and processor file, you can upload the files to the storage services, such as OSS or NAS, and mount the storage services to EAS. EAS can obtain the files for deployment.
The following figure describes the workflow of deploying a model by using a processor in EAS.
Take note of the following items:
PAI provides multiple official processors for model deployment. You can also develop a model and a custom processor file based on your business requirements and upload the model and the processor to OSS or NAS.
We recommend that you develop and store the model and the processor file separately. You can configure the mount path when you deploy the model and use the get_model_path parameter in the processor file to obtain the specified model path. This allows you to update the model in a convenient manner.
When you use a processor to deploy a model service, EAS automatically pulls an official image based on your inference framework to deploy the service, and deploys an HTTP server based on the processor file to receive service calls.
If you use a processor to deploy a model service, make sure that the inference framework of the model and the processor file meet the requirements of the development environment. This deployment method is not as flexible and efficient as the deployment method that uses an image. Therefore, we recommend that you use an image to deploy the model.
EAS uses resource groups to isolate resources in a cluster. When you create a model service, you can choose to deploy the model service in the default public resource group or in a dedicated resource group that you purchased.
For more information about EAS resource groups, see Overview of EAS resource groups.
Model services are resident services that are deployed based on model files and online prediction logic. You can create, update, start, stop, scale out, and scale in model services.
Model files are offline models that are obtained after offline training. Different frameworks provide models in different formats. In general, a model file is deployed together with a processor to provide a model service.
A processor is a package that contains online prediction logic. A processor is deployed together with a model file to provide a model service. EAS provides built-in processors for Predictive Model Markup Language (PMML), TensorFlow SavedModel, and Caffe models.
If the built-in processors of EAS cannot meet your service deployment requirements, you can use custom processors to flexibly deploy services. EAS allows you to develop custom processors for C++, Java, and Python.
Service instances are entities that allow you to deploy and manage model services. You can deploy multiple service instances for each service. This helps increase the maximum number of concurrent requests that a service can handle. Service instances are deployed in resource groups. If a resource group contains multiple Elastic Compute Service (ECS) instances, EAS automatically distributes the service instances to different ECS instances. This ensures high service availability.
High-speed direct connection
High-speed direct connection is a network access mode supported by EAS. After the resource groups of EAS are connected to your virtual private cloud (VPC), clients can directly access the model services in EAS through your VPC without passing through gateways. This greatly improves access performance and reduces latency.
Limits on regions
EAS is supported in the zones in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Heyuan), China (Chengdu), China (Hong Kong), Singapore, Indonesia (Jakarta), India (Mumbai), US (Silicon Valley), US (Virginia), and Germany (Frankfurt).
EAS resource group
EAS allows you to deploy model services in the public resource group or dedicated resource groups. For more information, see Billing of EAS.
If you use the public resource group, you are charged based on the amount of resources that are used by your model services.
If you use a dedicated resource group, you are charged based on the ECS instances in the resource group. Both the subscription and pay-as-you-go billing methods are supported.
(Optional) Related cloud services:
You can use OSS or NAS to permanently store data during model service deployment in EAS. For more information about the billing of related cloud services, see OSS billing overview and NAS billing overview.
Internet NAT gateway:
You can use an Internet endpoint to access the model service that you deploy free of charge. If the model service that you deploy requires access to the Internet, you must activate NAT Gateway. For more information about how to connect to the Internet and configure a whitelist, see Configure Internet access and a whitelist. For more information about the billing of Internet NAT gateways, see NAT Gateway billing overview.
Step 1: Preparations
Prepare inference resources.
You can select an EAS resource group based on your business requirements. EAS provides a public resource group and dedicated resource groups. To use dedicated resource groups, you must first purchase and configure resource groups. For more information, see Overview of EAS resource groups.
Prepare the model and code files that are used to process the mode.
You need to prepare the files such as the trained model and the code processing files, and upload the files to the specified cloud services based on the deployment method. EAS allows you to deploy a model service by using an image or a processor. The storage of preparation files varies based on the deployment method that you use. For more information, see the Deployment methods section in this topic.
Step 2: Deploy the service
In terms of deployment tools, EAS allows you to deploy and manage model services in the console or by using command-line interface (CLI). The deployment procedure varies based on the tools that you use. The following table describes the detailed instructions.
Use the PAI console
Use the CLI
Manage online model services
You can manage model services in EAS. For more information, see Model service deployment by using the PAI console.
The following operations are supported:
View model calling information.
View logs, monitoring information, and service deployment information.
Scale in, scale out, start, stop, and delete model services.
Use the EASCMD client to manage model services. For more information, see Run commands to use the EASCMD client.
When you use a dedicated resource group to deploy a model service, you can configure a storage mount to store the required data. For more information, see Mount storage to services (advanced).
In terms of deployment methods, EAS allows you to deploy model services by using an image or a processor. The following table describes the detailed instructions.
Deploy Service by Using Image (recommended)
Scenario: Use an image to deploy a model service.
You can use an image to ensure consistency between the model development and training environments and the deployment and running environments.
EAS provides official images that are suitable for various scenarios. You can use an official image to implement push-button deployment.
You can also use a custom image without modification to deploy a model service in a convenient manner.
Deploy Web App by Using Image (recommended)
Scenario: Use an image to deploy a model service or a web application.
EAS provides multiple preset official images, such as Stable-Diffusion-Webui and Chat-LLM-Webui, and provides frameworks such as Gradio, Flask, and FastAPI to build HTTP servers.
Deploy Service by Using Model and Processor
EAS provides built-in processors for commonly-used model frameworks, such as PMML and XGBOOST. You can use a built-in processor to start the service in a quick manner.
You can also build custom processors to implement more flexible business logic.
Step 3: Debug and perform stress testing
After you deploy the service, you can use the online debugging feature to send an HTTP request to the service to verify the service performance.
For more information about how to debug and perform stress testing, see Debug a service online.
Step 4: Monitor services and service scaling
After the model service is up and running, you can activate service monitoring and alerting to monitor the resource usage of the service.
You can also enable the horizontal or scheduled auto-scaling feature to manage the computing resources of the service.
For more information, see Service monitoring.
Step 5: Call the service
Deploy a model as an API service: You can call the service to perform online and asynchronous model inference. EAS allows you to call a service over a public endpoint, VPC endpoint, and the VPC direct connection channel. You can also build customized request data based on the processor. We recommend that you use the SDK provided by PAI to test and call model services. For more information, see SDK for Java.
Deploy a model as a web UI application: You can use the console to open the web application page in a browser and interact with the model inference service.
Step 6: Perform asynchronous inference
The queue service and asynchronous inference features are required in scenarios where inference takes a long time. In scenarios where the service receives a large number of requests, you can create a queue service to store the requests in the queue service. After the inference service processes the requests, the results are output to the output queue and returned through asynchronous queries. This method prevents unprocessed requests from being discarded due to the large number of requests. EAS also allows you to send request data to the queue service in multiple ways, and automatically scale the inference service by monitoring the amount of data in the queue. This effectively controls the number of service instances. For more information, see Asynchronous inference services.
For more EAS use cases, see EAS use cases.
PAI provides DSW as an integrated development environment (IDE) in the cloud. DSW provides interactive development environments for developers of different levels. For more information, see DSW overview.
PAI provides Machine Learning Designer as a visualized modeling service that supports large-scale distributed trainings, such as traditional machine learning, deep learning, reinforcement learning, and unified stream and batch training. Machine Learning Designer encapsulates hundreds of machine learning algorithms. For more information, see Overview of Designer.