Submit AI training jobs - Platform For AI - Alibaba Cloud Documentation Center

After you complete the preparations, you can submit Deep Learning Containers (DLC) jobs in the Platform for AI (PAI) console or by using SDK for Python or the command line. This topic describes how to submit a DLC job.

Prerequisites

The required resources, images, datasets, and code builds are prepared. For more information, see Before you begin.
Environment variables are configured for using the SDK for Python to submit a DLC job. For more information, see the "Install the Credentials tool" section in the Manage access credentials topic and the "Step 2: Configure environment variables" section in the Get started with Alibaba Cloud Darabonba SDK for Python topic.

Submit a job in the PAI console

Step 1: Go to the Create Job page

Log on to the PAI console.
In the left-side navigation pane, click Workspaces. Find the workspace that you want to manage and click the workspace ID.
In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). On the Distributed Training Jobs page, click Create Job. The Create Job page appears.

Step 2: Configure the parameters for the job

Basic Information

In the Basic Information section, configure the Job Name and Tag parameters.

Environment Information

In the Environment Information section, configure the key parameters. The following table describes the parameters.

Parameter	Description
Node Image	The worker node image. You can select one of the following node images: Alibaba Cloud Image: an image provided by Alibaba Cloud PAI. Such images support different Python versions and deep learning frameworks, such as TensorFlow and PyTorch. For more information, see Before you begin. Custom Image: a custom image that you uploaded to PAI. Before you select this option, you must upload your custom image to PAI. For more information about how to upload an image, see Custom images. Note If you want to use Lingjun resources, install Remote Direct Memory Access (RDMA) network to use the high-performance RDMA network of Lingjun resources. For more information, see RDMA: high-performance networks for distributed training. Image Address: a custom, community, or Alibaba Cloud image that can be accessed by using the image address. If you select Image Address, you must also specify the public URL of the Docker registry image that you want to access. If you want to specify the private URL of an image, click Enter and configure the Image Repository Username and Image Repository Password parameters to grant permissions on the private image registry. You can also use an accelerated image to accelerate model training. For more information, see Use an accelerated image in PAI.
Mount Settings	The location where job data is stored when the job is running. The dataset is used as a larger storage space for the job. Click Add to add a mount type. You can select one of the following dataset types. Custom Dataset:Select a dataset that you prepared. For more information about how to prepare a dataset, see Step 3: Prepare a dataset. Public Dataset: Select an existing public dataset provided by PAI. The mount option for the data is read-only. Object Storage Service (OSS): Mount an OSS bucket to store data. Set the Mount Path parameter to the specific path in the DLC training container, such as `/mnt/data`. DLC retrieves the required files based on the mount path you specified. Important If you select an OSS dataset or an Apsara File Storage NAS (NAS) dataset, you must grant PAI the permissions to access OSS or NAS. Otherwise, PAI cannot read or write data. For more information, see the "Grant PAI the permissions to access OSS and NAS" section in the Grant the permissions that are required to use DLC topic. If you select a Cloud Parallel File Storage (CPFS) dataset, you must configure a virtual private cloud (VPC). The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, exceptions may occur and the jobs are dequeued after you submit the jobs.
Startup Command	The command that the job runs. Shell commands are supported. For example, you can use the `python -c "print('Hello World')"` command to run Python. When you submit a job, PAI automatically injects multiple general environment variables. To obtain the values of specific environment variables, configure the `$Environment Variable Name` parameter. For more information about the general environment variables provided by DLC, see General environment variables. Note If you configure a dataset, the training results are stored in the directory on which the dataset is mounted. If you specify the output path by using variables in the command, the training results are stored in the specified path.
Environment Variable	Additional configuration information or parameters. The format is `key:value`. You can configure up to 20 environment variables.
Third-party Libraries	Valid values: Select from List: Enter the name of a third-party library in the field. Directory of requirements.txt: Enter the path of the requirements.txt file in the field. You must include the address of the third-party library in the requirements.txt file.
Code Builds	Valid values: Online Configuration Specify the location of the repository that stores the code file of the job. In this example, a code build that you prepared is selected. For information about how to create a code build, see the "Step 4: Prepare a code build" section in the Before you begin topic. Note DLC automatically downloads the code to the specified working path. Make sure that your account has permissions to access the repository. Local Upload Click the icon and follow the on-screen instructions to upload the code build. After the upload succeeds, set the Mount Path parameter to the specified path in the container, such as `/mnt/data`.

Resource Information

In the Resource Information section, configure the key parameters. The following table describes the parameters.

Parameter	Description
Instance type	This parameter is available only if the workspace allows you to use Lingjun resources and general computing resources to submit jobs in DLC. Lingjun Resources Note Lingjun resources are available only in the China (Ulanqab) and Singapore regions. General Computing Resources
Source	Public Resources and Resource Quota, which includes general computing resources and Lingjun resources, are available. Note Public resources can provide up to two GPUs and eight vCPUs. To increase the resource quota, contact your account manager.
Resource Quota	This parameter is required only if you set the Source parameter to Resource Quota. Select the resource quota that you prepared. For more information about how to prepare a resource quota, see Resource quota overview.
Priority	This parameter is available only if you set the Source parameter to Resource Quota. Specify the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority.
Framework	Specify the deep learning training framework and tool. The framework provides rich features and operations that you can use to build, train, and optimize deep learning models. Tensorflow PyTorch ElasticBatch XGBoost OneFlow MPIJob Note If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, and MPIJob.
Job Resource	Configure the following nodes based on the framework you selected: worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes. Use public resources Configure the following parameters: Number of Nodes: the number of nodes on which the DLC job runs. Instance Type: Click the icon to select an instance type. For information about the billing of resource specifications, see Billing of DLC. Use general computing resources or Lingjun resources Configure the following parameters for the nodes: Number of Nodes, vCPUs, GPUs, Memory (GiB), and Shared Memory (GiB).
Maximum Duration	You can specify the maximum duration for which a job runs. The job is automatically stopped when the uptime of the job exceeds the maximum duration. Default value: 30. Unit: days.
Retention Period	Specify the retention period of jobs after they completed or fail. After the retention period ends, the jobs are deleted. Important DLC jobs that are deleted cannot be restored. Exercise caution when you delete the jobs.

VPC

This parameter is available only if you set the Source parameter to Public Resources.

If you do not configure a VPC, Internet connection is used. Due to the limited bandwidth of the Internet, the job may not progress or may not run as expected.
To ensure sufficient network bandwidth and stable performance, we recommend that you configure a VPC.
Select a VPC, a vSwitch, and a security group in the current region. When the configuration takes effect, the cluster on which the job runs directly accesses the services in the VPC and performs access control based on the selected security group.
You can also configure the Internet Access Gateway parameter.
- Private Gateway: dedicated bandwidth. You can configure the bandwidth based on your business requirements. If you access the Internet by using a private gateway, you need to create an Internet NAT gateway, associate an elastic IP adress (EIP) with a DSW instance and configure SNAT in the VPC that is associated with the DSW instance. For more information, see Enable Internet access for a DSW instance by using a private Internet NAT gateway.
- Public Gateway: shared public bandwidth. The download rate is slow in high concurrency scenarios.

Important

Before you run a DLC job, make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.
If you select a CPFS dataset, you must configure a VPC. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, exceptions may occur and the DLC jobs are dequeued after you submit the jobs.

Fault Tolerance and Diagnosis

In the Fault Tolerance and Diagnosis section, configure the key parameters. The following table describes the parameters.

Parameter	Description
Automatic Fault Tolerance	After you turn on Automatic Fault Tolerance and configure the related parameters, the system checks the jobs to identify algorithmic errors of the jobs and improve GPU utilization. For more information, see AIMaster: elastic fault tolerance engine.
Sanity Check	After you turn on Sanity Check, the system detects the resources that are used to run the jobs, isolates faulty nodes, and triggers automated O&M processes in the background. This prevents job failure in the early stage of training and improves the training success rate. For more information, see Sanity Check. Note You can enable the sanity check feature only for jobs that run on Lingjun resources.

Role Information

In the Role Information section, configure the Instance RAM Role parameter. For more information, see Configure the DLC RAM role.

Instance RAM Role	Description
Default Role of PAI	The default role of PAI is developed based on the AliyunPAIDLCDefaultRole role and has only the permissions to access MaxCompute and OSS. You can use this role to implement fine-grained permission management. The temporary credentials issued by the default role of PAI: You are granted the same permissions as the owner of a DLC job when you access MaxCompute tables. When you access OSS, you can access only the bucket that is configured as the default storage path for the current workspace.
Custom Role	Select or create a custom Resource Access Management (RAM) role. You are granted the same permissions as the custom role you select when you call API operations of other Alibaba Cloud services by using Security Token Service (STS) temporary credentials.
Does Not Associate Role	Do not associate a RAM role with the DLC job. By default, this option is selected.

Step 3: Submit the job

Click OK to submit the job. You can go to the jobs list to view the status of the job.

Submit a job by using SDK for Python or the command line

Use SDK for Python

Step 1: Install SDK for Python

Install the workspace SDK.

pip install alibabacloud_aiworkspace20210204==3.0.1

Install the DLC SDK.

pip install alibabacloud_pai_dlc20201203==1.4.0

Step 2: Submit the job

If you want to submit a job that runs on pay-as-you-go resources, you can use public resources. Training jobs that run on public resources may encounter queuing delays. We recommend that you use public resources in time-insensitive scenarios that involve a small number of tasks.
If you want to submit a job that runs on subscription resources, you can use dedicated resources, such as general computing resources or Lingjun resources. You can use dedicated resources to ensure resource availability in high workload scenarios.

Use public resources to submit jobs

The following sample code provides an example on how to create and submit a DLC job:

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account has the required permissions on DLC. 
    region_id = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
    cred = CredClient()

    # 1. create client;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter. 
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Obtain the image list. 
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Obtain the dataset. 
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # Create a dataset if the specified dataset does not exist. 
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # The ID of the NAS file system. 
            # General-purpose NAS: 31a8e4****. 
            # Extreme NAS: The ID must start with extreme-. Example: extreme-0015****. 
            # CPFS: The ID must start with cpfs-. Example: cpfs-125487****. 
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Obtain the source code file list. 
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Obtain the DLC node specification list. 
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job. 
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
                "UseSpotInstance": False,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Obtain the DLC job list. 
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()

Use subscription resources to submit jobs

Log on to the PAI console.
Follow the instructions to obtain your workspace ID on the Workspaces page.
Follow the instructions to obtain the resource quota ID of your dedicated resource group.

The following sample code provides an example on how to create and submit a job. For information about the available public images, see the "Step 2: Prepare an image" section in the Before you begin topic.

from alibabacloud_pai_dlc20201203.client import Client
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_tea_openapi.models import Config
from alibabacloud_pai_dlc20201203.models import (
    CreateJobRequest,
    JobSpec,
    ResourceConfig, GetJobRequest
)

# Initialize a client to access the DLC API operations. 
region = 'cn-hangzhou'
# The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
# We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
# In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
cred = CredClient()
client = Client(
    config=Config(
        credential=cred,
        region_id=region,
        endpoint=f'pai-dlc.{region}.aliyuncs.com',
    )
)

# Specify the resource configurations of the job. You can select a public image or specify an image address. For information about the available public images, see the reference documentation. 
spec = JobSpec(
    type='Worker',
    image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
    pod_count=1,
    resource_config=ResourceConfig(cpu='1', memory='2Gi')
)

# Specify the execution information for the job. 
req = CreateJobRequest(
        resource_id='<Your resource quota ID>',
        workspace_id='<Your workspace ID>',
        display_name='sample-dlc-job',
        job_type='TFJob',
        job_specs=[spec],
        user_command='echo "Hello World"',
)

# Submit the job. 
response = client.create_job(req)
# Obtain the job ID. 
job_id = response.body.job_id

# Query the job status. 
job = client.get_job(job_id, GetJobRequest()).body
print('job status:', job.status)

# View the commands that the job runs. 
job.user_command

Use the command line

Step 1: Download the DLC client and perform user authentication

Download the DLC client for your operating system and verify your credentials. For more information, see Before you begin.

Step 2: Submit the job

Log on to the PAI console.
Follow the instructions shown in the following figure to obtain your workspace ID on the Workspace page.
Follow the instructions shown in the following figure to obtain the resource quota ID.

Create a parameter file named ./tfjob.params and copy the following content into the file. Change the parameter values based on your business requirements. For information about how to use the command line in the DLC client, see Supported commands.

name=test_cli_tfjob_001
workers=1
worker_cpu=4
worker_gpu=0
worker_memory=4Gi
worker_shared_memory=4Gi
worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
command=echo good && sleep 120
resource_id=<Your resource quota ID> # If you use the public resource group, you can leave this parameter empty.  
workspace_id=<Your workspace ID>

The following sample code provides an example on how to specify the params_file parameter to submit a DLC job to the specified workspace and resource group.
```
dlc submit tfjob --job_file  ./tfjob.params
```
The following sample code provides an example on how to query the DLC jobs that you created.
```
dlc get job <jobID>
```

What to do next

After you submit the job, you can perform the following operations:

View the basic information, resource view, and operation logs of the job. For more information, see View training jobs.
Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.
View the training results on TensorBoard. For more information, see Use TensorBoard to view training results in DLC.
View the billing details when the job is completed. For more information, see Bill details.
Enable the log forwarding feature to forward logs of DLC jobs from the current workspace to a specific Logstore for custom analysis. For more information, see Subscribe to job logs.
You can create a notification rule for a PAI workspace to track and monitor the status of DLC jobs. For more information, see Create a notification rule.
If you have other questions about DLC jobs, see FAQ about DLC.