Configure deployment resources - Platform For AI - Alibaba Cloud Documentation Center

EAS provides three resource types—public resources, EAS resource groups, and resource quotas—to support various scenarios from testing to production. This topic describes how to select a resource type and configure compute resources and scheduling policies.

Choose a resource type

Resource type		Use cases	Billing	Feature comparison
Public resources		Ideal for testing or for services with fluctuating traffic (when dedicated resources are combined with an elastic resource pool).	On-demand activation. Billed on a pay-as-you-go basis. For more information, see EAS billing.	Uses shared compute resources with no separate purchase required. Resource availability is not guaranteed during peak hours. Supports CPU and GPU instances (A10, P4, P100, T4, and V100).
EAS resource group	Dedicated resource group	Suitable for scenarios that require high security or exclusive resources. Purchase a dedicated resource group to reserve scarce resources.	Purchase before use. Supports subscription and pay-as-you-go billing. For more information, see EAS billing.	Provides exclusive compute resources with resource isolation for enhanced security. Supports CPU and GPU instances (A10, P4, P100, T4, and V100). Supports GPU sharing.
EAS resource group	Virtual resource group	A logical resource group that combines multiple resource types, such as public resources, resource quotas, and dedicated resource groups.	Billing is based on the resources scheduled and used.	Deploys a single service across multiple resource types. Supports scheduling priorities.
Resource quota	General-purpose computing 2.0	Suitable for production scenarios that require dedicated resources and resource isolation.	Purchase before use. Subscription-based billing. For more information, see Billing of AI computing resources.	Integrated training and inference resource management to improve resource utilization. Supports GPU sharing.
Resource quota	Lingjun intelligent computing	Ideal for large models or scenarios that require high-performance hardware, such as RDMA high-speed interconnects and CPFS storage for intelligent computing.	Purchase before use. Subscription-based billing. For more information, see Billing of AI computing resources.

Recommendations:

Testing and development: Use public resources for pay-as-you-go billing with no upfront investment. Resource availability may be limited during peak hours. For more information, see What do I do if public resources are insufficient?
Production environments (stable traffic): Use an EAS dedicated resource group or a resource quota (General-purpose computing 2.0). These options provide dedicated resources, stable performance, and support subscription billing to lower costs.
Production environments (fluctuating traffic): Use a virtual resource group. A dedicated resource group or resource quota provides a baseline, and public resources handle traffic spikes.
Large models or specialized hardware: Use a resource quota (Lingjun intelligent computing) to access high-performance hardware.

Choose an instance type

Choose a CPU or GPU instance type based on your model size and inference workload.

Spot instance: When you use public resources, you can enable a spot mode and set a bid cap to use idle resources at a lower price than regular instances. A spot instance may be reclaimed, so it is best suited for inference tasks that can tolerate interruptions.
GPU driver version: When you choose a GPU instance, you can specify a GPU driver version in the Features > Resource Configuration section to meet the runtime requirements of a specific model or framework.

Configure the system disk

The system disk stores temporary data generated at runtime. The default configurations vary based on the resource type:

Public resources: A free system disk of 30 GiB is provided. Usage beyond 30 GiB is billed on a pay-as-you-go basis.
EAS resource group or resource quota: The default system disk size is 60 GiB. If you change the capacity, the disk is allocated from the host machine.

Configure shared memory

Shared memory allows multiple processes within a container to read from and write to the same memory area. This avoids data-copying overhead and is suitable for scenarios that require efficient inter-process communication.

If you use a multi-process inference framework, such as vLLM tensor parallel or multi-worker concurrent inference, we recommend configuring sufficient shared memory based on your model size.

Set the replica count

The replica count is the number of instances running your service. We recommend configuring multiple replicas to avoid a single point of failure.

Configure scheduling policies

When you use an EAS resource group or a resource quota, you can use the following policies to optimize resource scheduling:

Elastic Resource Pool: When your own resources are insufficient, the system automatically scales out by using pay-as-you-go public resources to handle traffic spikes. During scale-in, instances that use public resources are released first to reduce costs. For more information, see Elastic resource pool.
Specify Node Scheduling: This policy restricts the service to run on specified nodes. If no nodes are specified, all non-excluded nodes are eligible for scheduling.
High-priority Resource Rescheduling: If you enable this feature, the system periodically migrates instances from lower-priority resources, such as public resources, to higher-priority resources, such as a dedicated resource group, to optimize costs. This is useful when rolling updates temporarily schedule instances to public resources, or when you want to migrate regular instances to spot instances to reduce costs.
Resource Affinity Scheduling: When you use Lingjun intelligent computing resources from a public resource group for multi-node distributed inference, we recommend enabling resource affinity scheduling in the Features section. This schedules instances to your specified hyper-node network domain in the HPN Zone to ensure RDMA high-speed interconnection.

GPU sharing and distributed inference

GPU sharing: Splits the computing power and memory of a single GPU card among multiple service instances to improve GPU utilization and reduce deployment costs. This feature is ideal for smaller models or workloads with low inference loads. GPU sharing can be enabled only when you use an EAS resource group or a resource quota.
Multi-node distributed inference: Deploys a single service instance across multiple machines. This overcomes the hardware limits of a single node and supports the deployment and operation of ultra-large models.

FAQ

Resource usage and limitations

Q: Why is the 1-vCPU, 2-GB instance unavailable?

The 1 vCPU, 2 GB memory instance type is unavailable to ensure service stability. System components consume a portion of each node's resources. On smaller instances, this leaves insufficient resources for your service.

Q: How to estimate models per PAI-EAS instance?

The number of models you can deploy on a single PAI-EAS instance depends on the resource requirements of each model, such as CPU cores, GPU memory, and system memory. There is no predefined limit. We recommend choosing an instance type based on your model's actual needs, or deploying different models on separate instances.

Q: What is the maximum number of services that can be deployed in EAS?

The maximum number of service instances that you can deploy depends on the remaining available resources. You can view the remaining capacity for each machine in the machine list of your resource group in the console. For more information, see Use an EAS resource group.

If you allocate tasks based on CPU cores, the maximum number of instances you can deploy is (Total CPU Cores - 1) / Cores used per instance.

Q: Which EAS instance is comparable to an RTX 4090?

ecs.gn8ia-2x.8xlarge provides performance close to that of an RTX 4090.

Q: What is the maximum concurrency for a deployed model?

The maximum concurrency for a model service depends on multiple factors, including the model, use case, and resource configuration. We recommend performing stress testing to measure your service's performance.

Dedicated resource group management

Q: Why has my dedicated resource group been in the "Scaling Out" state for a long time?

This usually happens because of insufficient capacity in the current region. For subscription instances, if creation fails due to insufficient capacity, the system automatically creates a refund order and returns the payment to the original payment method.

Q: How to delete a subscription instance?

Go to the Alibaba Cloud Unsubscribe Resources page to unsubscribe from EAS subscription-based dedicated machines that you no longer need. Configure the following parameters:

Type: Select Partial refund.
Product name: Select EAS Dedicated Machine Subscription.

Click Search to find the resource you want to unsubscribe from. Then, click Unsubscribe resource in the Actions column and follow the on-screen instructions to complete the process.

Q: Is instance data kept after unsubscribing?

No, service instance data is not retained.

System disk management

Q: How do I increase the size of the system disk?

You can configure or expand the system disk for a service in one of the following ways:

Console configuration: When you create or update a service, in the Resource Information section, set the System Disk size under Configure a system disk.
JSON configuration: In the service's JSON configuration file, modify the metadata field's disk value.
```
"metadata": {"disk": "40Gi"}
```

Note

If you are using a dedicated resource group, the configured system disk size cannot exceed the system disk size of the node. If you need a larger system disk, you must release the current node and repurchase a node with a larger system disk.