All Products
Search
Document Center

Platform For AI:Submit training jobs

Last Updated:Apr 15, 2024

After you complete the preparations, you can submit Deep Learning Containers (DLC) jobs in the Platform for AI (PAI) console, or by using SDK for Python or command lines. This topic describes how to submit a DLC job.

Prerequisites

Submit a job in the console

Step 1: Go to the Create Job page

  1. Log on to the PAI console.

  2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

  3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). On the Distributed Training Jobs page, click Create Job. The Create Job page appears.

Step 2: Configure the parameters for the training job

Basic Information

In the Basic Information section, configure the following parameters.

Parameter

Description

Node Image

The image used by the nodes. Valid values:

  • Alibaba Cloud Images: an image provided by Alibaba Cloud PAI. These images support different Python versions and deep learning frameworks, such as TensorFlow and PyTorch. For more information, see Before you begin.

  • Custom Image: a custom image that you uploaded to PAI. For more information about how to upload an image, see Custom images.

    Note
  • Image Address: a custom, community, or Alibaba Cloud image that can be accessed by using the image address. If you select Image Address, you must also specify the publicly accessible URL of the Docker registry image that you want to use.

    If you want to specify the URL of a private image, click Enter and configure the Image Repository Username and Image Repository Password parameters to grant permissions on the private image registry.

    You can also use an accelerated image to accelerate model training. For more information, see Use accelerated image in PAI.

Datasets

The location where job data is stored when the job is running. The dataset is used as a larger storage space for the training job.

Select the dataset that you prepared. For information about how to create a dataset, see (Optional) Prepare a dataset.

Important
  • If you select an Object Storage Service (OSS) dataset or an Apsara File Storage NAS dataset, you need to authorize PAI to access OSS or NAS. Otherwise, PAI fails to read or write data. For more information, see the "Grant PAI the permissions to access OSS and NAS" section in the Grant the permissions that are required to use DLC topic.

  • If you select a CPFS dataset, you also need to configure a virtual private cloud (VPC). The VPC must be the same as the one configured for the CPFS dataset. Otherwise, exceptions may occur and the DLC training jobs are dequeued after you submit the jobs.

Code Builds

Valid values:

  • Online configuration

    Specify the location of the repository that stores the code file of the job. In this example, select the code build that you prepared. For information about how to create a code build, see (Optional) Prepare a code build.

    Note

    DLC automatically downloads the code to the specified working path. Make sure that your account has permissions to access the repository.

  • Local Upload

    Click the image.png icon and follow the on-screen instructions to upload the code build. After you upload the code build, set the Mount Path parameter to a specific path in the container. Example: /mnt/data.

Third-party Libraries

Valid values:

  • Select from list: enter the name of a third-party library in the field.

  • Directory of requirements.txt: enter the path of the requirements.txt file in the text field. You must include the address of the third-party library in the requirements.txt file.

Environment Variable

Additional configuration information or parameters in the key:value format. You can configure up to 20 environment variables.

Job Command

The command that the job runs. Shell commands are supported. For example, you can use the python -c "print('Hello World')" command to run Python.

When you submit a training job, PAI automatically injects multiple general environment variables. To obtain the values of specific environment variables, configure the $Environment Variable Name parameter. For more information about the default environment variables that DLC provides, see General environment variables.

Note
  • If you configured a dataset, the training results are stored in the directory where the dataset is mounted.

  • If you specified the output path by using variables in the command, the training results are stored in the specified path.

Resource Configuration

In the Resource Configuration section, configure the parameters. The following table describes the parameters.

Parameter

Description

Resource Quota

You can select the public resource group, general computing resources, or Lingjun resources that you prepared. For information about how to reserve resource quotas, see Overview.

Note

The public resource group can provide up to 2 GPUs and 8 vCPUs. To increase the resource quota, contact your account manager.

Priority

This parameter is available when you set the Resource Quota parameter to general computing resources or Lingjun resources.

Specify the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority.

Framework

Specify the deep learning training framework and tool. The frameworks provide rich features and operations for you to build, train, and optimize deep learning models. Valid values:

  • Tensorflow

  • PyTorch

  • ElasticBatch

  • XGBoost

  • OneFlow

  • MPIJob

Note

If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, and MPIJob.

Job Resource

Configure worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes based on the framework that you select.

  • Work with the public resource group

    Configure the following parameters:

    • Nodes: the number of nodes on which the DLC job is run.

    • Resource Type: Click the image.png icon to select an instance type. For information about resource fees, see Billing of general computing resources.

  • Work with general computing resources or Lingjun resources

    Configure the Nodes, vCPUs, GPUs, Memory (GB), and Shared Memory (GiB) parameters for the nodes.

Automatic Fault Tolerance

After you turn on Automatic Fault Tolerance in the Resource Configuration section, the system detects the jobs to identify errors in the algorithms of the training jobs and improves the GPU utilization. For more information, see AIMaster: elastic automatic fault tolerance engine.

Sanity Check

After you turn on Sanity Check in the Resource Configuration section, the system detects the resources that are used to run the training jobs, isolates faulty nodes, and triggers the automated O&M processes in the background. This helps prevent job failure in the early stage of the training and improves the training success rate. For more information, see Sanity Check.

Note

You can enable the sanity check feature only for training jobs that run on Lingjun resources.

Maximum Duration

Specify the maximum duration for which the job runs. The job is automatically stopped if the uptime of the job exceeds the maximum duration. Default value: 30. Unit: days.

Instance Retention Period

The retention period of the instance after the job is completed. After the duration exceeds, the jobs are deleted.

Important

DLC jobs that are deleted cannot be restored. Perform the delete operation with caution.

VPC

This parameter is available if you set the Resource Group parameter to Public Resource Group.

  • If you do not configure VPC, the Internet connection is used. Due to the limited bandwidth of the Internet, the job may be stuck or may not run as expected.

  • We recommend that you configure VPC to ensure sufficient network bandwidth and stable performance.

    Select a VPC, a vSwitch, and a security group in the current region. When the configuration takes effect, the cluster on which the job runs directly accesses the services in this VPC and performs access control based on the security group.

    Important
    • Before you run a DLC job, make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.

    • If you select a CPFS dataset, you also need to configure a VPC. The VPC must be the same as the one configured for the CPFS dataset. Otherwise, exceptions may occur and the DLC training jobs are dequeued after you submit the jobs.

Step 3: Submit the training job

Click Submit to submit the training job. You can go to the jobs list to view the status of the job. For more information about the status of the DLC job, see Appendix: DLC job status.

Submit a job by using SDK for Python or command lines

Use SDK for Python

Step 1: Install SDK for Python

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.0

Step 2: Submit a job

Use a public resource group to submit the job

You can use the following sample code to create and submit a DLC job:

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account is granted the required permissions on DLC. 
    region_id = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. To prevent security risks, we recommend that you call API operations or perform routine O&M as a RAM user. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and the security of resources within your account may be compromised. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variable to implement identity verification. 
    cred = CredClient()

    # 1. create client;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter. 
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Obtain the image list. 
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Obtain the dataset. 
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # Create a dataset when the specified dataset does not exist. 
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # The ID of the NAS file system. 
            # General-purpose NAS: 31a8e4****. 
            # Extreme NAS: The ID must start with extreme-. Example: extreme-0015****. 
            # CPFS: The ID must start with cpfs-. Example: cpfs-125487****. 
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Obtain the source code file list. 
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Obtain the DLC node specification list. 
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job. 
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
                "UseSpotInstance": False,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Obtain the DLC job list. 
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()

Use general computing resources to submit the job

  1. Log on to the PAI console.

  2. Follow the instructions shown in the following figure to obtain your workspace ID on the Workspaces page. image.png

  3. Follow the instructions shown in the following figure to obtain the resource quota ID of your dedicated resource group on the General Computing Resources page. image.png

  4. Use the following code to create and submit a job. For information about the available public images, see Step 2: Prepare an image.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API operations. 
    region = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. To prevent security risks, we recommend that you call API operations or perform routine O&M as a RAM user. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and the security of resources within your account may be compromised. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variable to implement identity verification. 
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Specify the resource configurations of the job. You can select a public image or specify an image address. For more information about available public images, see the reference documentation. 
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Specify the execution information for the job. 
    req = CreateJobRequest(
            resource_id='<Replace with your resource quota ID>',
            workspace_id='<Replace with your Workspace ID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job. 
    response = client.create_job(req)
    # Obtain the job ID. 
    job_id = response.body.job_id
    
    Query the status of the job. 
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the commands that the job runs. 
    job.user_command

Use command lines

Step 1: Download the client and perform user authentication

Download the DLC client for your operating system and authenticate your credentials. For more information, see Before you begin.

Step 2: Submit the job

  1. Log on to the PAI console.

  2. Follow the instructions as shown in the following figure to obtain your workspace ID on the Workspaces page.

    image.png

  3. Follow the instructions shown in the following figure to obtain the resource quota ID on the General Computing Resources page.

    image.png

  4. Create a parameter file named ./tfjob.params and copy the following content into the file. Replace the parameters as required. For more information about how to use command lines in the DLC client, see Supported commands.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Your resource quota ID> # This parameter can be left empty if you use a public resource group. 
    workspace_id=<Your workspace ID> 
  5. The following sample code provides an example on how to specify the params_file parameter to submit a DLC job to the specified workspace and resource group.

    dlc submit tfjob --job_file  ./tfjob.params
  6. Use the following code to query the DLC jobs that you created.

    dlc get job <jobID>

References

After you submit a job, you can perform the following operations:

Appendix: DLC job status

In most cases, a DLC job has the following status in sequence: Creating-> Queuing-> Dequeued-> Running-> Successful, Failed, or Stopped.

  • What do I do if the job status is Dequeued?

    • If the DLC job status is Dequeued, the system starts to schedule resources for the job. It takes approximately 5 minutes for the job status to change to Running.

    • If the job stays in the Dequeued state for a long time, it may be because a CPFS dataset is configured for the distributed training job that you created, but the VPC is not configured. You need to create another distributed training job and configure a CPFS dataset and a VPC. The selected VPC must be the same as the one configured for the CPFS dataset.

  • What do I do if the job status is Failed?

    You can move the pointer over the image.png icon next to the job status on the job details page or view the instance operation logs to identify the cause of the job failure.