All Products
Search
Document Center

Platform For AI:Submit training jobs

Last Updated:May 21, 2025

Deep Learning Containers (DLC) in the Platform for AI (PAI) console allows you to quickly create standalone or distributed training jobs. Compute nodes are automatically started by Kubernetes at the underlying layer of DLC. This way, you do not need to manually purchase instances and configure environments and your usage habits remain unchanged. DLC is suitable for users who require quick startup of training jobs, supports for a variety of deep learning frameworks, and provides flexible resource configurations.

Prerequisites

  • PAI is activated, and a workspace is created by using your Alibaba Could account. To activate PAI, log on to the PAI console, select a region in the top navigation bar, and click Activate after authorization.

  • The account that you use to perform operations is granted the required permissions. If you use an Alibaba Could account, this prerequisite can be ignored. If you use a RAM user, you must assign one of the following roles to the RAM user: algorithm developer, algorithm O&M engineer, and workspace administrator.

Submit a job in the PAI console

If this is the first time you use DLC, we recommend that you submit a job in the PAI console. You can also submit a job by using SDK for Python or the command line.

  1. Go to the Create Job page.

    1. Log on to the PAI console, select the target region and workspace, and then click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. Configure the parameters in the following sections.

    • Basic Information

      In this section, configure the Job Name and Tag parameters.

    • Environment Information

      Parameter

      Description

      Node Image

      The node image. You can select Alibaba Cloud Image. You can also select one of the following values:

      • Custom Image: a custom image that you uploaded to PAI. You must make sure that you can pull images from the image repository or the images are stored on a Container Registry Enterprise Edition instance.

        Note
      • Image Address: the address of a custom or Alibaba Cloud image that can be accessed over the Internet.

        • If you enter a private image address, click enter the username and password and specify the Username and Password parameters to grant permissions on the private image repository.

        • You can also use an accelerated image in PAI.

      Data Set

      The dataset that provides data files required in model training. You can use one of the following dataset types.

      • Custom Dataset: Create a custom dataset to store data files required in model training. You can configure the Read/Write Permission parameter and select the required version in the Versions panel.

      • Public Dataset: Select an existing public dataset provided by PAI. Public datasets only support read-only mounting.

      Mount Path: the path in the DLC container, such as /mnt/data. You can run commands to query datasets based on the mount path you specified. For more information about mounting configuration, see Use cloud storage for a DLC training job.

      Important

      If you select a Cloud Parallel File Storage (CPFS) dataset, you must configure a virtual private cloud (VPC) for the DLC job. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.

      Directly Mount

      You can directly mount data sources to read data or store procedure and result files.

      • Supported data sources: OSS, General-purpose NAS, Extreme NAS, and BMCPFS. BMCPFS is available only for jobs that use Lingjun resources.

      • Advanced Settings: You can configure this parameter for different data sources to implement specific features. Examples:

        • OSS: You can add the {"mountType":"ossfs"} configurations in Advanced Settings to mount an OSS bucket by using ossfs.

        • General-purpose NAS and CPFS: You can specify the nconnect parameter in Advanced Settings to improve the throughput of NAS access in DLC containers. Sample configuration: {"nconnect":"<Sample value>"}. Replace <Sample value> with a positive integer.

      For more information, see Use cloud storage for a DLC training job.

      Startup Command

      The commands that the job runs. Shell commands are supported. DLC automatically injects PyTorch and TensorFlow general environment variables, such as MASTER_ADDR and WORLD_SIZE. You can obtain the variables by using $Environment variable name. Sample commands:

      • Run Python: python -c "print('Hello World')"

      • Start PyTorch multi-machine and multi-GPU distributed training: python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100

      • Set a shell script path as the startup command: /ml/input/config/launch.sh

      Click Show More to configure Environment Variable, Third-party Libraries, and Code Builds

      Environment Variable

      The custom environment variables in addition to PyTorch and TensorFlow general environment variables that are automatically injected. The format is Key:Value. You can configure up to 20 environment variables.

      Third-party Libraries

      If specific third-party libraries do not exist in the configured the container image, you can configure the Third-party Libraries parameter to add third-party libraries. The following methods are supported:

      • Select from List: Enter the name of a third-party library in the field.

      • Directory of requirements.txt: You must upload the requirements.txt file to the DLC container by using code, datasets or direct mounting. Then, enter the path of the requirements.txt file in the DLC container in the field.

      Code Builds

      You must upload the code build that is required for the training to the DLC container. Valid values:

      • Online configuration: If you have the access permissions on a Git repository, create a code build to associate the repository with the DLC job. This way, the DLC job can use the code build.

      • Local Upload: Click the image.png icon to upload a local code build. After the upload succeeds, set the Mount Path parameter to the specified path of the container, such as /mnt/data.

    • Resource Information

      Parameter

      Description

      Resource Type

      The resource type. Default value: General Computing Resources. You can select Lingjun Resources only for the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions.

      Source

      • Public Resources:

        • Billing method: Pay-as-you-go.

        • Scenarios: Training jobs that run on public resources may encounter queuing delays. We recommend that you use public resources in time-insensitive scenarios that involve a small number of jobs.

        • Limits: Public resources can provide up to two GPUs and eight vCPUs. To increase the resource quota, contact your sales manager.

      • Resource Quota: includes general computing resources and Lingjun resources.

        • Billing method: Subscription.

        • Scenarios: Resource quotas are suitable for scenarios that require high assurance and involve a large number of jobs.

        • Special parameters:

          • Resource Quota: Specifies the numbers of GPUs, vCPU, and other resources. For more information, see Create a resource quota.

          • Priority: Specifies the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority.

      • Preemptible Resources:

        • Billing method: Pay-as-you-go.

        • Scenarios: Preemptible resources are suitable for scenarios that require cost reduction. In most cases, preemptible resources offer additional discounts.

        • Limits: The high availability and stability of preemptible resources are not guaranteed. Resources may not be immediately available or may be released by the system. For more information, see Use a preemptible job.

      Framework

      The deep learning training framework and tool. Valid values: Tensorflow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob,  and Ray.

      Note

      If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray.

      Job Resource

      Configure the resources of the following nodes based on the framework you selected: worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes. If you select the Ray framework, you can click Add Role to create a custom Worker role. This enables different types of computing resources to work together seamlessly.

      • Use public resources: You can configure the following parameters:

        • Number of Nodes: the number of nodes on which the DLC job runs.

        • Resource Type: Select an instance. The prices of the instances of different types are displayed in the Instance Type panel. For information about the billing, see Billing of DLC.

      • Use resource quotas: In addition to the Number of Nodes, vCPUs, GPUs, Memory (GiB), and Shared Memory (GiB) parameters, you can also configure the following special parameters:

        • Node-Specific Scheduling: You can specify a computing node to run the job.

        • Idle Resources: If you enable this feature, jobs can run on idle resources of quotas that are allocated for other business jobs. This effectively improves resource utilization. However, when the Use idle resources are required to return, jobs that run on the idle resources are terminated and the idle resources are automatically returned.

        • CPU Affinity: If you enable this feature, processes in a container or pod can be bound to a specific CPU core for execution. This prevents issues such as CPU cache misses and context switches, and improves CPU utilization and application performance. This feature is suitable for scenarios that have high requirements on performance and timeliness.

      • Use preemptible resources: In addition to the Number of Nodes and Resource Type parameters, you can also configure the Bid Price parameter to specify the maximum bid price to apply for the preemptible resources. You can click the image icon to switch the bidding method.

        • Bid Price (Discount): The maximum bid price ranges from 10% to 90% of the market price with a 10% interval. You can get the preemptible resources if your bid meets or exceeds the market price and inventory is available.

        • Bid Price ($/Minutes): The maximum bid price range is based on the market price range.

      Click Show More to configure Maximum Duration and Retention Period

      Maximum Duration

      The maximum duration for which a job runs. The job is automatically stopped when the running duration of the job exceeds the maximum duration. Default value: 30. Unit: days.

      Retention Period

      The retention period of successful or failed jobs. During the retention period, the jobs continue to occupy resources. After the retention period ends, the jobs are deleted.

      Important

      DLC jobs that are deleted cannot be restored. Exercise caution when you delete a job.

    • VPC

      • If you do not configure a VPC, Internet connection and public gateways are used. Due to the limited bandwidth of public gateways, the job may be stuck or may not run as expected.

      • If you configure a VPC and select a vSwitch and a security group, the network bandwidth is increased, and both the performance stability and security are enhanced. After the configuration takes effect, the cluster on which the job runs directly accesses the services in the VPC.

        Important
        • If you use VPCs, you must make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.

        • If you use a CPFS dataset, you must configure a VPC. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.

        • If you use Lingjun preemptible resources to submit a DLC job, you must configure a VPC.

        You can also configure the Internet Gateway parameter. Valid values:

        • Public Gateway: The public bandwidth is limited. The download rate may not meet your business requirements in high concurrency or large file downloading scenarios.

        • Private Gateway: To increase the limited public bandwidth, you can use a private gateway. You need to create an Internet NAT gateway, associate an elastic IP address (EIP) with a DSW instance and configure SNAT in the VPC that is associated with the DSW instance. For more information, see Improve Internet access rate by using a private gateway.

    • Fault Tolerance and Diagnosis

      Parameter

      Description

      Automatic Fault Tolerance

      After you turn on Automatic Fault Tolerance and configure the related parameters, the system checks the jobs to identify and handle algorithmic errors of the jobs. This helps improve GPU utilization. For more information, see AIMaster: Elastic fault tolerance engine.

      Note

      After you enable Automatic Fault Tolerance, the system starts an AIMaster instance that runs together with the job instance and occupies the following resources:

      • Resource quotas: 1 CPU core and 1 GB of memory.

      • Public resources: ecs.c6.large.

      Sanity Check

      After you turn on Sanity Check, the system detects the resources that are used to run the jobs, isolates faulty nodes, and triggers automated O&M processes in the background. Sanity check effectively reduces job failures in the early stage of training and improves the training success rate.

      Note

      You can enable sanity check only for PyTorch jobs that run on Lingjun resources and use GPU.

    • Roles and Permissions

      The following table describes how to configure the Instance RAM Role parameter. For more information, see Associate a RAM role with a DLC job.

      Instance RAM Role

      Description

      Default Roles of PAI

      The default roles of PAI are developed based on the AliyunPAIDLCDefaultRole role and have only the permissions to access MaxCompute and OSS. You can use the default roles to implement fine-grained permission management. If you have the temporary credentials issued by the default roles of PAI:

      • You are granted the same permissions as the owner of a DLC job when you access MaxCompute tables.

      • When you access OSS, you can access only the bucket that is configured as the default storage path for the current workspace.

      Custom Roles

      Select or create a custom Resource Access Management (RAM) role. You are granted the same permissions as the custom role you select when you call API operations of other Alibaba Cloud services by using Security Token Service (STS) temporary credentials.

      Does Not Associate Role

      Do not associate a RAM role with the DLC job. By default, this option is selected.

After you configure the parameters, click Confirm.

What to do next

After you submit the training job, you can perform the following operations:

  • View the basic information, resource views, and logs of the job. For more information, see View training jobs.

  • Manage the training job. Clone, stop, or delete the job.

  • View the analysis report of model training results by using TensorBoard.

  • Monitor the training job and configure alert rules. For more information, see Training monitoring and alerting.

  • View detailed information about your job execution bills. For more information, see Bill details.

  • Forward logs of the DLC job from the workspace to a specific Logstore for custom analysis. For more information, see Subscribe to job logs.

  • Create a notification rule for the workspace on the Configure Event Notification tab in the PAI console to track and monitor the status of the DLC job.

  • If you have other questions about DLC jobs, see FAQ about DLC.

  • View DLC use cases.

Appendix

Submit a job by using SDK for Python or the command line

Use SDK for Python

Step 1: Install the Alibaba Cloud SDK Credentials tool

You must configure valid credential information before you call API operations to manage cloud resources by using Alibaba Cloud SDKs. Prerequisites:

  • Python 3.7 or later is installed.

  • Alibaba Cloud SDK V2.0 is installed.

pip install alibabacloud_credentials
Step 2: Obtain an AccessKey pair

In this example, an AccessKey pair is used as the access credential. To prevent account information leaks, we recommend that you configure the AccessKey pair as environment variables. The variable names for the AccessKey ID and AccessKey secret are ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET.

References:

Step 3: Install SDK for Python
  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.17
Step 4: Submit the job
Use public resources to submit the job

The following sample code provides an example on how to create and submit a job:

Sample code for creating and submitting a job

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account has the required permissions on DLC. 
    region_id = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
    cred = CredClient()

    # 1. create client;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter. 
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Obtain the image list. 
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Obtain the dataset. 
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # Create a dataset if the specified dataset does not exist. 
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # The ID of the NAS file system. 
            # General-purpose NAS: 31a8e4****. 
            # Extreme NAS: The ID must start with extreme-. Example: extreme-0015****. 
            # CPFS: The ID must start with cpfs-. Example: cpfs-125487****. 
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Obtain the source code file list. 
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Obtain the DLC node specification list. 
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job. 
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Obtain the DLC job list. 
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()
Use subscription resource quotas to submit the job
  1. Log on to the PAI console.

  2. Follow the instructions shown in the following figure to obtain your workspace ID on the Workspaces page.image.png

  3. Follow the instructions shown in the following figure to obtain the resource quota ID of your dedicated resource group.image

  4. Run the following code to create and submit the job. For information about the available public images, see Step 2: Prepare an image.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API operations. 
    region = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Specify the resource configurations of the job. You can select a public image or specify an image address. For information about the available public images, see the reference documentation. 
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Specify the execution information for the job. 
    req = CreateJobRequest(
            resource_id='<Your resource quota ID>',
            workspace_id='<Your workspace ID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job. 
    response = client.create_job(req)
    # Obtain the job ID. 
    job_id = response.body.job_id
    
    # Query the job status. 
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the commands that the job runs. 
    job.user_command
Use preemptible resources to submit jobs
  • SpotDiscountLimit (Spot discount)
    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'  # The ID of the region in which the DLC job resides, such as cn-hangzhou. 
    cred = CredClient()
    workspace_id = '12****'  # The ID of the workspace to which the DLC job belongs. 
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-spot-job',
        'JobType': 'PyTorchJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.g7.xlarge',
                "SpotSpec": {
                    "SpotStrategy": "SpotWithPriceLimit",
                    "SpotDiscountLimit": 0.4,
                }
            },
        ],
        'UserVpc': {
            "VpcId": "vpc-0jlq8l7qech3m2ta2****",
            "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
            "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
        },
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    
  • SpotPriceLimit (Spot price)
#!/usr/bin/env python3

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient

from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest

region_id = '<region-id>'
cred = CredClient()
workspace_id = '12****'

dlc_client = DLCClient(
    Config(credential=cred,
           region_id=region_id,
           endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
           protocol='http'))

create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
    'WorkspaceId': workspace_id,
    'DisplayName': 'sample-spot-job',
    'JobType': 'PyTorchJob',
    'JobSpecs': [
        {
            "Type": "Worker",
            "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
            "PodCount": 1,
            "EcsSpec": 'ecs.g7.xlarge',
            "SpotSpec": {
                "SpotStrategy": "SpotWithPriceLimit",
                "SpotPriceLimit": 0.011,
            }
        },
    ],
    'UserVpc': {
        "VpcId": "vpc-0jlq8l7qech3m2ta2****",
        "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
        "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
    },
    "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')

The following table describes the key configurations.

Parameter

Description

SpotStrategy

The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.

SpotDiscountLimit

The spot discount bidding type.

Note
  • You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time.

  • The SpotDiscountLimit parameter is valid only for Lingjun resources.

SpotPriceLimit

The spot price bidding type.

UserVpc

This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.

Use the command line

Step 1: Download the DLC client and perform user authentication

Download the DLC client for your operating system and verify your credentials. For more information, see Before you begin.

Step 2: Submit the job
  1. Log on to the PAI console.

  2. Follow the instructions shown in the following figure to obtain your workspace ID on the Workspace page.

    image.png

  3. Follow the instructions shown in the following figure to obtain the resource quota ID.

    image

  4. Create a parameter file named tfjob.params and copy the following content into the file. For information about commands that are used to submit jobs, see Commands used to submit jobs.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Your resource quota ID>
    workspace_id=<Your workspace ID>
  5. Run the following code to specify the params_file parameter and submit the DLC job to the specified workspace and resource quota.

    ./dlc submit tfjob --job_file  ./tfjob.params
  6. Run the following code to query the DLC job that you submitted.

    ./dlc get job <jobID>