EAS architecture and usage - Platform For AI - Alibaba Cloud Documentation Center

Once a model is trained, you can use Elastic Algorithm Service (EAS) to quickly deploy it as an online inference service or an AI web application. EAS supports heterogeneous resources and combines features like Automatic scaling, One-click stress testing, Canary release, and Real-time monitoring to ensure stable, continuous service in high-concurrency scenarios at a lower cost.

EAS features

Details of EAS features

Infrastructure layer: Supports heterogeneous hardware (CPU or GPU), provides AI-dedicated GU instance types, and offers preemptible (Spot) instances to balance performance and cost-effectiveness.
Container scheduling layer: Uses features like automatic scaling, scheduled scaling, and elastic resource pool to dynamically adapt to changing workloads and improve resource utilization.
- Automatic scaling: Automatically adjusts the number of instances based on real-time load to handle unpredictable traffic spikes and prevent resource idling or overload.
- Scheduled scaling: Suitable for services with regular business cycles, such as morning peaks or promotional events. You can configure scaling policies in advance to precisely control resource allocation.
- Elastic resource pool: If resources in a dedicated resource group are fully occupied, the system automatically schedules new instances to a pay-as-you-go public resource group to ensure service stability.
Model deployment layer: Integrates a full-lifecycle workflow of deployment, stress testing, and monitoring to simplify O&M and improve deployment reliability.
- One-click stress testing: Supports dynamic pressure ramping, automatically detects service limits, and provides real-time, second-level monitoring data and stress test reports to quickly evaluate service capabilities.
- Canary release: Allows you to add multiple services to the same canary group and flexibly allocate traffic between production and canary versions to safely validate new versions.
- Real-time monitoring: After deployment, you can view key metrics such as QPS, response time, and CPU utilization in the console to get a comprehensive overview of service health and status.
- Traffic mirroring: Copies a proportion of online traffic to a test service to validate the performance and reliability of the new service without affecting real user requests.
Inference capability: Provides three inference modes:
- Real-time synchronous inference: Suitable for scenarios like search recommendations and conversational bots that require high throughput and low latency. The system can also recommend suitable instance types based on business needs to achieve optimal performance.
- Near-real-time asynchronous inference: Designed for long-running tasks in online services that are processed one request at a time, such as text-to-image generation and video processing. It features a built-in message queue, supports automatic scaling, and is maintenance-free.
- Offline batch inference: Ideal for scenarios that are not sensitive to response time and require batch processing, such as converting large volumes of audio data to text. It also supports preemptible instances to control costs.

Billing

Billing overview

When you use EAS to deploy services, you may be charged for computing resources, system disks, and dedicated gateways.

Computing resources: includes public resources, dedicated resources, and Lingjun resources.
(Optional) System disks: provides free quotas, including 30 GB for each instance created by using public resources and 200 GB for each instance created by using dedicated resources. You are charged for additional system disks.
(Optional) Dedicated gateways: By default, a free shared gateway is used. If you require security isolation, access control, or custom domain names, you can purchase a dedicated gateway. To use a dedicated gateway, you must manually configure it.

EAS provides the following billing methods:

Pay-as-you-go: You are charged based on service run time (not the number of requests). This billing method is suitable for uncertain and fluctuating demand scenarios.
Subscription: This billing method is more cost-effective. It is suitable for long-term stable business scenarios.

EAS provides Stable Diffusion web UI Serverless Edition and ComfyUI Serverless Edition that you can use to deploy services free of charge. You are charged only based on the actual inference duration when the service is called.

Important

If you use other Alibaba Cloud services, such as Elastic IP Address (EIP), Object Storage Service (OSS), and File Storage NAS, fees are generated.

For more information, see Billing of Elastic Algorithm Service (EAS).

Usage workflow

Step 1: Prepare

Prepare inference resources
Choose the appropriate EAS resource type based on your model size, concurrency requirements, and budget. Dedicated EAS resources or Lingjun intelligent computing resources must be purchased before use. For more information about resource selection and purchase, see Overview of EAS deployment resources.
Prepare model and code files
Prepare your trained model, code files, and other dependencies. Upload these files to a designated cloud storage service, such as Object Storage Service (OSS). You can then access the data required for service deployment by using storage mounting.

Step 2: Deploy the service

Deployment tools: You can deploy and manage services by using the PAI-EAS console, the EASCMD command line, or an SDK.
- Console: Provides custom deployment and scenario-based deployment methods. The console is user-friendly and ideal for beginners.
- EASCMD command-line tool: Supports service creation, updates, viewing, and more. It is suitable for algorithm engineers familiar with EAS deployment.
- SDK: Suitable for large-scale, unified scheduling and O&M.
Deployment methods: Supports image-based deployment (recommended) and Processor-based deployment. For the differences, see Deployment principles.

Step 3: Invoke and test the service

Deploy the model as a WebUI application: Open an interactive page in your browser from the console to directly experience the model's capabilities.
Deploy the model as an API service:
- You can send HTTP requests using online service debugging to verify that the inference feature works as expected.
- Make synchronous or asynchronous calls via an API. EAS supports multiple service invocation methods, including through a shared gateway, a dedicated gateway, and high-speed direct connections.
Use the built-in universal stress testing tool in EAS to perform One-click stress testing on the deployed service. This helps you evaluate the service's performance under pressure and understand its inference capacity. For more information about stress testing, see Automatic stress testing.

Step 4: Monitor and scale the service

After the service is running, enable service monitoring and alerting to stay informed about resource usage, performance metrics, and potential anomalies, ensuring the service runs smoothly.rvice runs smoothly.
Enable horizontal or scheduled auto-scaling to achieve real-time, dynamic management of online service compute resources. For more information, see Auto Scaling.

Step 5: Use asynchronous inference services

For time-consuming requests, such as text-to-image generation or video processing, enable Asynchronous inference services. A queue service receives requests, and after processing, the results are written to an output queue. The client then asynchronously queries the results. This prevents request backlogs and data loss, improving system throughput. EAS supports automatic scaling based on queue backlog to intelligently adjust the number of instances. For more information, see Asynchronous inference services.

Step 6: Update the service

In the inference service list, click Update in the Actions column of the target service to update the service version.

Warning

Service is temporarily interrupted during an update, which can cause dependent requests to fail. Proceed with caution.

After the service update is complete, click the current version to view Version Information or switch the service version.

Quick Start

See Quick Start for Elastic Algorithm Service (EAS).

Scenarios and examples

LLM: Deploy large language models (LLMs) | Deploy MoE models using expert parallelism and PD separation
AIGC (AI-Generated Content): AI video generation - ComfyUI deployment | AI art - SDWebUI deployment
Others: Best practices for accessing a dedicated gateway across VPCs using CEN | Best practices for PAI-EAS Spot instances

FAQ

Q: What's the difference between dedicated and public resources in EAS?

The primary difference lies in performance isolation, cost, and availability guarantees.

Public resources: Use these for development, testing, or small-scale applications where cost is a primary concern and some performance fluctuation is acceptable. These are shared resources, so you might experience resource contention during peak hours.
Dedicated resources: Use these for production-level, core business applications that demand high stability and performance. These resources are physically isolated, eliminating the risk of preemption. You must also purchase dedicated resources to lock in specific instance types with limited inventory.

The Elastic Resource Pool feature provides a hybrid approach: if your dedicated resources are fully utilized, EAS can automatically scale out to public resources to handle traffic spikes, balancing cost with service stability.

Q: Why should I use EAS instead of self-managing my model inference services?

EAS is a fully managed service that handles the operational overhead of deploying and maintaining model inference infrastructure.

By using EAS, you offload the following tasks:

Resource scheduling, fault recovery, and real-time monitoring.
Implementing complex features like auto-scaling and canary releases from scratch.

This allows your team to focus on model development rather than infrastructure management, which reduces O&M costs and accelerates time-to-market.

Q: How can I troubleshoot common errors when my EAS service fails?

For a comprehensive guide to diagnosing and resolving common deployment and runtime issues, see EAS FAQ.

Platform For AI:EAS overview