Platform for AI (PAI) provides a one-stop platform for model development and deployment. The Elastic Algorithm Service (EAS) module of PAI allows you to deploy models as online inference services by using the public resource group or dedicated resource groups. The models are loaded on heterogeneous hardware (CPUs and GPUs) to generate responses in real time.
EAS architecture
EAS is a model serving platform that allows you to deploy models as online inference services or AI-powered web applications with a few clicks. EAS provides features such as automatic scaling and blue-green deployment, which reduces the costs of developing stable online model services that can handle a large number of concurrent requests. EAS also provides features that help you manage model services, including resource group management, model versioning, and resource monitoring. You can use EAS in various AI inference scenarios, such as real-time synchronous inference and near-real-time asynchronous inference, and leverage the comprehensive O&M capabilities of EAS.
Deployment methods
You can deploy a model by using an image or a processor in EAS.
Use an image (recommended)
If you deploy a model by using an image, EAS pulls the image that contains the runtime environment from Container Registry (ACR) and mounts model files and code from storage services such as Object Storage Service (OSS) and Apsara File Storage NAS (NAS).
The following figure shows the workflow of deploying a model by using an image in EAS.
Take note of the following items:
You can use one of the following methods when you deploy a model by using an image:
Deploy Service by Using Image: You can call the service by using API operations after deployment.
Deploy Web App by Using Image: You can access the web application by using a link after deployment.
For information about the differences between the two methods, see the "Step 2: Deploy a model" section of this topic.
PAI provides multiple prebuilt images to accelerate model deployment. You can also create a custom image and upload the image to ACR.
We recommend that you upload the model files and the code files that contain the preprocessing or postprocessing logic to storage services. This way, you can mount the files to the runtime environment. Compared with packaging the files into a custom image, this method allows you to update the model in a convenient manner.
When you deploy a model by using an image, we recommend that you build an HTTP server to receive requests that are forwarded by EAS. The HTTP server cannot receive requests on ports 8080 and 9090 because the EAS engine listens on these ports.
If you use a custom image, you must upload the image to ACR before you use the image during deployment. Otherwise, EAS may fail to pull the image. If you use Data Science Workshop (DSW) to develop a model, you must upload the image to ACR before you use the image in EAS.
If you want to reuse your custom images or warm-up data in other scenarios, you can manage the images or data in a centralized manner by using the AI Computing Asset Management module of PAI. EAS does not support mounting CPFS datasets from NAS.
Use a processor
If you deploy a model by using a processor, prepare the model files and processor files, upload the files to storage services such as OSS or NAS before deployment, and then mount the files to EAS during deployment.
The following figure shows the workflow of deploying a model by using a processor in EAS.
Take note of the following items:
PAI provides multiple prebuilt images to accelerate model deployment. You can also create a custom image based on your business requirements and upload the image to ACR.
We recommend that you develop and store the model file and the processor file separately. You can call the get_model_path() method in the processor file to obtain the path of the model file. This allows you to update the model in a convenient manner.
When you deploy a model by using a processor, EAS automatically pulls an official image based on the inference framework of the model and deploys an HTTP server based on the processor file to receive service requests.
When you deploy a model by using a processor, make sure that the inference framework of the model and the processor file meet the requirements of the development environment. This method is less flexible and efficient. We recommend that you deploy a model by using an image.
Terms
Term | Description |
Resource group | EAS uses resource groups to isolate resources in a cluster. You can deploy a model by using the default public resource group or a dedicated resource group that you purchased.
For more information about EAS resource groups, see Overview of EAS resource groups. |
Model service | A model service consists of model files and online prediction logic. You can create, update, start, stop, and scale a model service. |
Model file | A model file contains an offline model that are obtained after offline training. The format of the model file varies based on the framework. In most cases, a model file is deployed together with a processor to provide a model service. |
Processor | A processor is a package that contains online prediction logic. In most cases, a processor is deployed together with a model file to provide a model service. EAS provides prebuilt processors for common model frameworks, such as Predictive Model Markup Language (PMML), TensorFlow, and Caffe. |
Custom processor | If the prebuilt processors of EAS cannot meet your business requirements, you can develop custom processors by using C++, Java, or Python. |
Service instance | Each service instance independently handles requests. You can deploy multiple service instances to increase the maximum number of concurrent requests that a service can handle. If your resource group contains multiple machines, EAS automatically distributes the service instances to different machines to ensure service availability. |
High-speed direct connection | EAS supports high-speed direct connection, which is enabled by connecting the resource group that is used for deployment to your virtual private cloud (VPC). After you enable high-speed direct connection, clients can bypass gateways and directly access the model service. This significantly improves performance and reduces latency. |
Supported regions
EAS is available in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Heyuan), China (Chengdu), China (Hong Kong), Singapore, Indonesia (Jakarta), US (Silicon Valley), US (Virginia), and Germany (Frankfurt).
Billing methods
EAS resource groups:
EAS allows you to deploy a model service by using the public resource group or a dedicated resource group. For more information, see Billing of EAS.
If you use the public resource group, you are charged based on the amount of resources that are used by your model services.
If you use a dedicated resource group, you are charged for the dedicated resources based on the subscription or pay-as-you-go billing method.
(Optional) Related Alibaba Cloud services:
Storage:
You can use OSS or NAS to permanently store data that are mounted to the runtime environment during deployment in EAS. For information about the billing methods, see Billing overview (OSS) and Billing overview (NAS).
NAT Gateway:
You can use a public endpoint to access the model service free of charge. However, if the model service requires access to the Internet, you must activate NAT Gateway. For information about how to configure Internet access and a whitelist, see Configure Internet access and a whitelist. For information about the billing rules of Internet NAT gateways, see Billing overview.
Procedure
Step 1: Prepare for deployment
Prepare computing resources.
Select a resource group based on your business requirements. EAS provides the public resource group and dedicated resource groups. To use dedicated resource groups, you must purchase and configure the resources. For more information, see Overview of EAS resource groups.
Prepare the required files.
Prepare the files that contain the trained model and the processing logic, and then upload the files to storage services based on the deployment method you use. For information about the recommended storage services for each deployment method provided by EAS, see the "Deployment methods" section of this topic.
Step 2: Deploy a service
The following table describes the deployment tools.
Operation
GUI tools
CLI tools
Deploy services
Use the PAI console or Machine Learning Designer to deploy a service with a few clicks. For more information, see Deploy a model service in the PAI console or Deploy a model service by using Machine Learning Designer.
Use DSW or the EASCMD client to deploy a service. For more information, see Deploy model services by using EASCMD or DSW.
Manage services
Manage model services on the EAS-Online Model Services page. For more information, see Deploy a model service in the PAI console.
The following operations are supported:
View invocation information.
View logs, monitoring information, and service deployment information.
Scale, start, stop, and delete model services.
Use the EASCMD client to manage model services. For more information, see Run commands to use the EASCMD client.
If you use a dedicated resource group to deploy a model service, you can mount the required data from storage services. For more information, see Mount storage to services (advanced).
The following table describes the deployment methods.
Deployment method
Description
Reference
Deploy Service by Using Image (recommended)
Scenario: Use an image to deploy a model service.
Benefits:
Images ensure consistency between the model development environment and the runtime environments.
Prebuilt images for common scenarios allow you to complete deployment with a few clicks.
Custom images can be used for deployment without the need for modification.
Deploy Web App by Using Image (recommended)
Scenario: Use an image to deploy a web application.
Benefits:
Prebuilt images for common scenarios, such as Stable-Diffusion-Webui and Chat-LLM-Webui, allow you to complete deployment with a few clicks. You can build an HTTP server by using frameworks such as Gradio, Flask, and FastAPI.
Custom images can be used for deployment without the need for modification.
Deploy Service by Using Model and Processor
EAS provides prebuilt processors for common model frameworks, such as PMML and XGBOOST, to accelerate deployment.
If the prebuilt processors cannot meet your business requirements, you can build custom processors to obtain greater flexibility.
Step 3: Debug and perform stress testing
After you deploy a service, you can use the online debugging feature to send HTTP requests to verify the service performance.
For information about how to debug and perform stress testing, see Debug a service online.
Step 4: Monitor a service
After you complete debugging and stress testing, you can use the service monitoring feature to monitor the resource usage of the service.
You can also enable the automatic scaling or scheduled scaling feature to manage the computing resources of the service.
For more information, see Service monitoring.
Step 5: Call a service
If you deploy a model as an API service, you can call API operations to perform real-time inference and asynchronous inference. EAS allows you to call a service by using a public endpoint, a VPC endpoint, or the VPC direct connection channel. You can also create custom request data based on the processor of the model service. We recommend that you use the SDKs provided by PAI to test and call a service. For more information, see SDK for Java.
If you deploy a model as a web application, you can find the link to the application in the PAI console, open the link in a browser, and use the UI to access the model service in an interactive manner.
Step 6: Perform asynchronous inference
You can use the queue service to implement asynchronous inference based on your business requirements. When your inference service receives a large number of requests, create an input queue to store the requests. After the requests are processed, save the results to the output queue and asynchronously return the results. This prevents unprocessed requests from being discarded. EAS supports multiple methods of sending request data to the queue service and automatically scales the inference service by monitoring the amount of data in the queue. This effectively controls the number of service instances. For more information about asynchronous inference, see Asynchronous inference services.
References
For information about EAS use cases, see EAS use cases.
The DSW module of PAI is a cloud-based and interactive integrated development environment (IDE) for machine learning. You can use Notebooks to read data, develop algorithms, and train and deploy models in an efficient manner. For more information, see DSW overview.
The Machine Learning Designer module of PAI is a visualized modeling tool that provides hundreds of algorithm components. The module supports large-scale distributed training for traditional machine learning, deep learning, and reinforcement learning. The module also supports combining streaming training and batch training. For more information, see Overview of Machine Learning Designer.