If you do not have sufficient computing power, you can use the preemptible job feature of Platform for AI (PAI), which allocates computing resources by using a bidding system. In most cases, preemptible resources offer a price advantage over public pay-as-you-go resources. This allows cost-effective access to AI computing power and reduces the total cost of jobs. This topic describes how to use preemptible resources when you create a Deep Learning Containers (DLC) job.
Limits
Preemptible resources have the following limits:
Type | Lingjun resources | General-purpose computing resources |
Supported regions |
|
|
Framework type |
| PyTorch |
AIMaster-based automatic fault tolerance | Supported | Not supported |
Limits on features |
| |
Features
Using preemptible resources
You can use general-purpose computing resources or Lingjun resources to create DLC jobs. The market prices of preemptible resources change based on the supply and demand. Preemptible instances can cost up to 90% lower than pay-as-you-go instances. Preemptible resources can be preempted by all Alibaba Cloud users and are released when their protection periods end. You must take note of the following considerations when you use preemptible resources to submit DLC jobs:
When DLC fails to preempt preemptible instances due to insufficient instance resources, the preemptible jobs enter the waiting state and DLC continues to apply for preemptible resources.
After the system applies for preemptible resources, DLC jobs are created and run.
After preemptible resources are released, DLC jobs fail and stop running.
Applying for preemptible resources
When you use preemptible resources to create DLC jobs, DLC starts to preempt instance resources after you submit the jobs. If you want to create preemptible instances, the following requirements must be met:
The maximum bidding price that you configured for preemptible resources must be greater than or equal to the market price.
The inventory of preemptible resources is sufficient.
Releasing preemptible resources
Preemptible resources can be interrupted and released based on the market price, resource inventory, maximum bidding price configured for an instance during job creation, and usage duration. In the following scenarios, the preemptible resources may be released without a notification:
Lingjun resources: If the maximum bidding price of a preemptible resource is lower than the average price or the inventory of the resource is insufficient, the resource is released.
General-purpose computing resources: If the maximum bidding price of a preemptible resource is lower than the current market price or the inventory of the preemptible resource is insufficient, the preemptible resource is released.
To ensure that your preemptible jobs can continuously and stably run, you can perform the following operations:
Turn on Automatic Fault Tolerance when you create a job by using Lingjun resources. After you turn on Automatic Fault Tolerance, your job automatically enters the queue to bid for preemptible resources. For more information, see AIMaster: Elastic fault tolerance engine.
When you use general-purpose computing or Lingjun resources to create a job, you can use the EasyCkpt framework to train a PyTorch large language model (LLM). The job can frequently perform and save checkpoints, and allows interruption. For more information, see Use EasyCkpt to save and resume foundation model trainings.
Billing
Pricing
The bidding mode of preemptible jobs is used to configure the maximum bidding price specified by the SpotWithPriceLimit parameter. In scenarios in which DLC jobs are created by using preemptible resources, the market prices of preemptible resources fluctuate based on the supply and demand. If you use the same preemptible resources to submit multiple DLC jobs, the fees for the jobs in a fixed period of time may be the same. The following bidding types are supported.
NoteLingjun resources support only the spot discount bidding type.
Bidding price based on the spot discount bidding type: The maximum bidding price is based on the market price of the instance type and ranges from 10% to 90% of the market price with a 10% interval.
Bidding price based on the spot price bidding type: The maximum bidding price is in the market price range.
In the Resource Information section of the Create Job page in DLC, set Source to Preemptible Resources and view the preemptible resources and the market price ranges in the Job Resource section.
NoteFor the specific pricing of resource specifications, please refer to the console.
General-purpose computing resources

Lingjun resources

Billing mode
You are charged for preemptible resources based on the pay-as-you-go billing method. You are charged based on the market price.
Viewing bills
After you run a job, go to the Billing Details page in Expenses and Costs on the next day to view the billing details of the job for which preemptible resources are used. The pay-as-you-go fees are generated by DLC. The instance tag is
key:acs:pai:dlc:payType value:spot. For more information about how to view billing details, see View billing details.
Scenarios
Supported scenarios
To reduce costs, we recommend that you use preemptible resources in the following scenarios:
Short-term computing jobs.
Computing jobs in the debug state.
Computing jobs that are fault-tolerant.
Computing jobs that allow interruption. For example, in scenarios in which the EasyCkpt framework is used to train a PyTorch LLM, you can frequently save checkpoints and restore data from the checkpoints. For more information, see Use EasyCkpt to save and resume foundation model trainings.
Unsupported scenarios
Services that require high stability
Procedure
You can submit a preemptible job by using one of the following methods:
Use the PAI console
Go to the Create Job page.
Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job.
On the Create Job page, configure the parameters described in the following table. For information about other parameters, see Submit training jobs.
Parameter
Description
Resource Information
Resource Type
Select Lingjun AI Computing Service or General Computing.
NoteThis parameter is available only if the workspace allows you to use Lingjun resources and general-purpose computing resources.
Source
Select Preemptible Resources.
Job Resource
In the Resource Type column, click the
icon to select a preemptible resource and configure the Bid Price parameter. The bidding price is the maximum bidding price based on the original price of the instance type and ranges from 10% to 90% of the market price with a 10% interval. You can obtain the preemptible resource if your bid meets or exceeds the market price and the inventory is sufficient. VPC
VPC
Configure a virtual private cloud (VPC) if you use Lingjun resources to submit DLC jobs. Select a VPC, a vSwitch, and a security group from the drop-down lists.
Security Group
vSwitch
Fault Tolerance and Diagnosis
Automatic Fault Tolerance
If you use Lingjun resources to submit DLC jobs, we recommend that you enable Automatic Fault Tolerance. The Automatic Fault Tolerance feature allows preemptible jobs to automatically re-enter the bidding queue after resource revocation. The jobs can resume when the average market price falls below your maximum bidding price. For more information about AIMaster, see AIMaster: Elastic fault tolerance engine.
Lingjun resources
NoteFor the specific pricing of resource specifications, please refer to the console.

General-purpose computing resources
NoteFor the specific pricing of resource specifications, please refer to the console.

After you configure the parameters, click Confirm.
After you submit the job, DLC applies for preemptible resources to create and run the job. If no preemptible resources are applied for, the job enters the waiting state.
Use the SDK
Step 1: Install SDK for Python.
Install the workspace SDK.
pip install alibabacloud_aiworkspace20210204==3.0.1Install the DLC SDK.
pip install alibabacloud_pai_dlc20201203==1.4.17
Step 2: Submit a preemptible job
SpotDiscountLimit
#!/usr/bin/env python3
from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest
region_id = '<region-id>' # The ID of the region in which the DLC job resides, such as cn-hangzhou.
cred = CredClient()
workspace_id = '12****' # The ID of the workspace to which the DLC job belongs.
dlc_client = DLCClient(
Config(credential=cred,
region_id=region_id,
endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
protocol='http'))
create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
'WorkspaceId': workspace_id,
'DisplayName': 'sample-spot-job',
'JobType': 'PyTorchJob',
'JobSpecs': [
{
"Type": "Worker",
"Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
"PodCount": 1,
"EcsSpec": 'ecs.g7.xlarge',
"SpotSpec": {
"SpotStrategy": "SpotWithPriceLimit",
"SpotDiscountLimit": 0.4,
}
},
],
'UserVpc': {
"VpcId": "vpc-0jlq8l7qech3m2ta2****",
"SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
"SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
},
"UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')
SpotPriceLimit
#!/usr/bin/env python3
from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest
region_id = '<region-id>'
cred = CredClient()
workspace_id = '12****'
dlc_client = DLCClient(
Config(credential=cred,
region_id=region_id,
endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
protocol='http'))
create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
'WorkspaceId': workspace_id,
'DisplayName': 'sample-spot-job',
'JobType': 'PyTorchJob',
'JobSpecs': [
{
"Type": "Worker",
"Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
"PodCount": 1,
"EcsSpec": 'ecs.g7.xlarge',
"SpotSpec": {
"SpotStrategy": "SpotWithPriceLimit",
"SpotPriceLimit": 0.011,
}
},
],
'UserVpc': {
"VpcId": "vpc-0jlq8l7qech3m2ta2****",
"SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
"SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
},
"UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')
The following table describes the key parameters. For information about other parameters, see Use SDK for Python.
Parameter | Description |
SpotStrategy | The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit. |
SpotDiscountLimit | The spot discount bidding type. Note
|
SpotPriceLimit | The spot price bidding type. |
UserVpc | This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides. |