Cut AI Training Costs 90% by Running DLC Jobs on Preemptible Instances - Platform for AI

If you do not have sufficient computing power, you can use the preemptible job feature of Platform for AI (PAI), which allocates computing resources by using a bidding system. In most cases, preemptible resources offer a price advantage over public pay-as-you-go resources. This allows cost-effective access to AI computing power and reduces the total cost of jobs. This topic describes how to use preemptible resources when you create a Deep Learning Containers (DLC) job.

Limits

Preemptible resources have the following limits:

Type	Lingjun resources	General-purpose computing resources
Supported regions	China (Ulanqab) Singapore	China (Beijing) China (Shanghai) China (Hangzhou) China (Shenzhen) China (Ulanqab) China (Guangzhou)
Framework type	PyTorch MPIJob	PyTorch
AIMaster-based automatic fault tolerance	Supported	Not supported
Limits on features	Preemptible instances cannot be converted into subscription instances. Instance and bandwidth specifications cannot be modified. The ICP filing service is not supported. No discounts are provided for major customers.

Features

Using preemptible resources
You can use general-purpose computing resources or Lingjun resources to create DLC jobs. The market prices of preemptible resources change based on the supply and demand. Preemptible instances can cost up to 90% lower than pay-as-you-go instances. Preemptible resources can be preempted by all Alibaba Cloud users and are released when their protection periods end. You must take note of the following considerations when you use preemptible resources to submit DLC jobs:
- When DLC fails to preempt preemptible instances due to insufficient instance resources, the preemptible jobs enter the waiting state and DLC continues to apply for preemptible resources.
- After the system applies for preemptible resources, DLC jobs are created and run.
- After preemptible resources are released, DLC jobs fail and stop running.
Applying for preemptible resources
When you use preemptible resources to create DLC jobs, DLC starts to preempt instance resources after you submit the jobs. If you want to create preemptible instances, the following requirements must be met:
- The maximum bidding price that you configured for preemptible resources must be greater than or equal to the market price.
- The inventory of preemptible resources is sufficient.
Releasing preemptible resources
Preemptible resources can be interrupted and released based on the market price, resource inventory, maximum bidding price configured for an instance during job creation, and usage duration. In the following scenarios, the preemptible resources may be released without a notification:
- Lingjun resources: If the maximum bidding price of a preemptible resource is lower than the average price or the inventory of the resource is insufficient, the resource is released.
- General-purpose computing resources: If the maximum bidding price of a preemptible resource is lower than the current market price or the inventory of the preemptible resource is insufficient, the preemptible resource is released.
To ensure that your preemptible jobs can continuously and stably run, you can perform the following operations:
- Turn on Automatic Fault Tolerance when you create a job by using Lingjun resources. After you turn on Automatic Fault Tolerance, your job automatically enters the queue to bid for preemptible resources. For more information, see AIMaster: Elastic fault tolerance engine.
- When you use general-purpose computing or Lingjun resources to create a job, you can use the EasyCkpt framework to train a PyTorch large language model (LLM). The job can frequently perform and save checkpoints, and allows interruption. For more information, see Use EasyCkpt to save and resume foundation model trainings.

Billing

Pricing
The bidding mode of preemptible jobs is used to configure the maximum bidding price specified by the SpotWithPriceLimit parameter. In scenarios in which DLC jobs are created by using preemptible resources, the market prices of preemptible resources fluctuate based on the supply and demand. If you use the same preemptible resources to submit multiple DLC jobs, the fees for the jobs in a fixed period of time may be the same. The following bidding types are supported.
Note
Lingjun resources support only the spot discount bidding type.
- Bidding price based on the spot discount bidding type: The maximum bidding price is based on the market price of the instance type and ranges from 10% to 90% of the market price with a 10% interval.
- Bidding price based on the spot price bidding type: The maximum bidding price is in the market price range.
In the Resource Information section of the Create Job page in DLC, set Source to Preemptible Resources and view the preemptible resources and the market price ranges in the Job Resource section.
Note
For the specific pricing of resource specifications, please refer to the console.
- General-purpose computing resources
- Lingjun resources
Billing mode
You are charged for preemptible resources based on the pay-as-you-go billing method. You are charged based on the market price.
Viewing bills
After you run a job, go to the Billing Details page in Expenses and Costs on the next day to view the billing details of the job for which preemptible resources are used. The pay-as-you-go fees are generated by DLC. The instance tag is key:acs:pai:dlc:payType value:spot. For more information about how to view billing details, see View billing details.

Scenarios

Supported scenarios
To reduce costs, we recommend that you use preemptible resources in the following scenarios:
- Short-term computing jobs.
- Computing jobs in the debug state.
- Computing jobs that are fault-tolerant.
- Computing jobs that allow interruption. For example, in scenarios in which the EasyCkpt framework is used to train a PyTorch LLM, you can frequently save checkpoints and restore data from the checkpoints. For more information, see Use EasyCkpt to save and resume foundation model trainings.
Unsupported scenarios
Services that require high stability

Procedure

You can submit a preemptible job by using one of the following methods:

Use the PAI console

Go to the Create Job page.
1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
2. On the Deep Learning Containers (DLC) page, click Create Job.

On the Create Job page, configure the parameters described in the following table. For information about other parameters, see Submit training jobs.

Parameter		Description
Resource Information	Resource Type	Select Lingjun AI Computing Service or General Computing. Note This parameter is available only if the workspace allows you to use Lingjun resources and general-purpose computing resources.
	Source	Select Preemptible Resources.
	Job Resource	In the Resource Type column, click the icon to select a preemptible resource and configure the Bid Price parameter. The bidding price is the maximum bidding price based on the original price of the instance type and ranges from 10% to 90% of the market price with a 10% interval. You can obtain the preemptible resource if your bid meets or exceeds the market price and the inventory is sufficient.
VPC	VPC	Configure a virtual private cloud (VPC) if you use Lingjun resources to submit DLC jobs. Select a VPC, a vSwitch, and a security group from the drop-down lists.
	Security Group
	vSwitch
Fault Tolerance and Diagnosis	Automatic Fault Tolerance	If you use Lingjun resources to submit DLC jobs, we recommend that you enable Automatic Fault Tolerance. The Automatic Fault Tolerance feature allows preemptible jobs to automatically re-enter the bidding queue after resource revocation. The jobs can resume when the average market price falls below your maximum bidding price. For more information about AIMaster, see AIMaster: Elastic fault tolerance engine.

Lingjun resources

Note

For the specific pricing of resource specifications, please refer to the console.

General-purpose computing resources

Note

For the specific pricing of resource specifications, please refer to the console.

After you configure the parameters, click Confirm.
After you submit the job, DLC applies for preemptible resources to create and run the job. If no preemptible resources are applied for, the job enters the waiting state.

Use the SDK

Step 1: Install SDK for Python.

Install the workspace SDK.

pip install alibabacloud_aiworkspace20210204==3.0.1

Install the DLC SDK.

pip install alibabacloud_pai_dlc20201203==1.4.17

Step 2: Submit a preemptible job

SpotDiscountLimit

#!/usr/bin/env python3

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient

from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest

region_id = '<region-id>'  # The ID of the region in which the DLC job resides, such as cn-hangzhou. 
cred = CredClient()
workspace_id = '12****'  # The ID of the workspace to which the DLC job belongs. 

dlc_client = DLCClient(
    Config(credential=cred,
           region_id=region_id,
           endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
           protocol='http'))

create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
    'WorkspaceId': workspace_id,
    'DisplayName': 'sample-spot-job',
    'JobType': 'PyTorchJob',
    'JobSpecs': [
        {
            "Type": "Worker",
            "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
            "PodCount": 1,
            "EcsSpec": 'ecs.g7.xlarge',
            "SpotSpec": {
                "SpotStrategy": "SpotWithPriceLimit",
                "SpotDiscountLimit": 0.4,
            }
        },
    ],
    'UserVpc': {
        "VpcId": "vpc-0jlq8l7qech3m2ta2****",
        "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
        "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
    },
    "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')

SpotPriceLimit

#!/usr/bin/env python3

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient

from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest

region_id = '<region-id>'
cred = CredClient()
workspace_id = '12****'

dlc_client = DLCClient(
    Config(credential=cred,
           region_id=region_id,
           endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
           protocol='http'))

create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
    'WorkspaceId': workspace_id,
    'DisplayName': 'sample-spot-job',
    'JobType': 'PyTorchJob',
    'JobSpecs': [
        {
            "Type": "Worker",
            "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
            "PodCount": 1,
            "EcsSpec": 'ecs.g7.xlarge',
            "SpotSpec": {
                "SpotStrategy": "SpotWithPriceLimit",
                "SpotPriceLimit": 0.011,
            }
        },
    ],
    'UserVpc': {
        "VpcId": "vpc-0jlq8l7qech3m2ta2****",
        "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
        "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
    },
    "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')

The following table describes the key parameters. For information about other parameters, see Use SDK for Python.

Parameter	Description
SpotStrategy	The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.
SpotDiscountLimit	The spot discount bidding type. Note You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time. The SpotDiscountLimit parameter is valid only for Lingjun resources.
SpotPriceLimit	The spot price bidding type.
UserVpc	This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.