All Products
Search
Document Center

Platform For AI:Submit training jobs

Last Updated:Jul 05, 2024

After you complete the preparations, you can submit Deep Learning Containers (DLC) jobs in the Platform for AI (PAI) console or by using SDK for Python or command lines. This topic describes how to submit a DLC job.

Prerequisites

Submit a job in the PAI console

Step 1: Go to the Create Job page

  1. Log on to the PAI Console.

  2. In the left-side navigation pane, click Workspaces. Find the workspace that you want to manage and click the workspace ID.

  3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). On the Distributed Training Jobs page, click Create Job. The Create Job page appears.

Step 2: Configure the parameters for the training job

Environment configuration

In the Environment Information section, configure the key parameters. The following table describes the parameters.

Parameter

Description

Node Image

The worker node image. You can select one of the following node images:

  • Alibaba Cloud image: an image provided by Alibaba Cloud PAI. Such images support different Python versions and deep learning frameworks, such as TensorFlow and PyTorch. For more information, see Before you begin.

  • Custom Image: a custom image that you uploaded to PAI. Before you select this option, you must upload your custom image to PAI. For more information about how to upload an image, see Custom images.

    Note

    If you want to use Lingjun resources, install Remote Direct Memory Access (RDMA) network to use the high-performance RDMA network of Lingjun resources. For more information, see RDMA: high-performance networks for distributed training.

  • Image Address: a custom, community, or Alibaba Cloud image that can be accessed by using the image address. If you select Image Address, you must also specify the public URL of the Docker registry image that you want to access.

    If you want to specify the private URL of an image, click Enter and configure the Image Repository Username and Image Repository Password parameters to grant permissions on the private image registry.

    You can also use an accelerated image to accelerate model training. For more information, see Use an accelerated image in PAI.

Datasets

The location where job data is stored when the job is running. The dataset is used as a larger storage space for the training job.

Select the dataset that you prepared. For more information, see Step 3: Prepare a dataset.

Important
  • If you select an Object Storage Service (OSS) dataset or an Apsara File Storage NAS (NAS) dataset, you must grant PAI the permissions to access OSS or NAS. Otherwise, PAI cannot read or write data. For more information, see the "Grant PAI the permissions to access OSS and NAS" section in the Grant the permissions that are required to use DLC topic.

  • If you select a CPFS dataset, you must configure a virtual private cloud (VPC). The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, exceptions may occur and the training jobs are dequeued after you submit the jobs.

Startup Command

The command that the job runs. Shell commands are supported. For example, you can use the python -c "print('Hello World')" command to run Python.

When you submit a training job, PAI automatically injects multiple general environment variables. To obtain the values of specific environment variables, configure the $Environment Variable Name parameter. For more information about the general environment variables provided by DLC, see General environment variables.

Note
  • If you configure a dataset, the training results are stored in the directory on which the dataset is mounted.

  • If you specify the output path by using variables in the command, the training results are stored in the specified path.

Environment Variable

Additional configuration information or parameters. The format is key:value. You can configure up to 20 environment variables.

Third-party Libraries

This parameter supports the following configurations:

  • Select from List: enter the name of a third-party library in the field.

  • Directory of requirements.txt: enter the path of the requirements.txt file in the field. You must include the address of the third-party library in the requirements.txt file.

Code Builds

This parameter supports the following configurations:

  • Online Configuration

    Specify the location of the repository that stores the code file of the job. In this example, select the code build that you prepared. For information about how to create a code build, see the "Step 4: Prepare a code build" section in the Before you begin topic.

    Note

    DLC automatically downloads the code to the specified working path. Make sure that your account has permissions to access the repository.

  • Local Upload

    Click the image.png icon and follow the on-screen instructions to upload the code build. After the upload succeeds, set the Mount Path parameter to the specified path in the container, such as /mnt/data.

Resource configuration

In the Resource Information section, configure the key parameters. The following table describes the parameters.

Parameter

Description

Resource Type

This parameter is available only if the workspace allows you to use Lingjun resources and general computing resources to submit training jobs in DLC.

  • Lingjun Resources

    Note

    Lingjun resources are available only in the China (Ulanqab) and Singapore regions.

  • General Computing Resources

Resource Source

Public Resources and Resource Quota, which includes general computing resources and Lingjun resources, are available.

Note

The public resources can provide up to two GPUs and eight vCPUs. To increase the resource quota, contact your account manager.

Resource Quota

This parameter is required only if you set the Resource Source parameter to Resource Quota. Select the resource quota that you prepared. For more information about how to prepare a resource quota, see Resource quota overview.

Priority

This parameter is available only if you set the Resource Source parameter to Resource Quota.

Specify the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority.

Framework

Specify the deep learning training framework and tool. The framework provides rich features and operations that you can use to build, train, and optimize deep learning models.

  • Tensorflow

  • PyTorch

  • ElasticBatch

  • XGBoost

  • OneFlow

  • MPIJob

Note

If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, and MPIJob.

Job Resource

Configure the following nodes based on the framework you selected: worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes.

  • Use the public resources

    Configure the following parameters:

    • Nodes: the number of nodes on which the DLC training job runs.

    • Resource Type: Click the image.png icon to select an instance type. For information about the billing of resource specifications, see Billing of DLC.

  • Use general computing resources or Lingjun resources

    Configure the following parameters for the nodes: Nodes, vCPUs, GPUs, Memory (GB), and Shared Memory (GiB).

Maximum Duration

You can specify the maximum duration for which a job runs. The job is automatically stopped when the uptime of the job exceeds the maximum duration. Default value: 30. Unit: days.

Retention Period

The retention period of the instance after the job is completed. After the retention period ends, the jobs are deleted.

Important

DLC jobs that are deleted cannot be restored. Exercise caution when you delete the jobs.

VPC Settings

This parameter is available only if you set the Resource Quota parameter to Public Resources (Pay-as-you-go).

  • If you do not configure a VPC, Internet connection is used. Due to the limited bandwidth of the Internet, the job may be stuck or may not run as expected.

  • To ensure sufficient network bandwidth and stable performance, we recommend that you configure a VPC.

    Select a VPC, a vSwitch, and a security group in the current region. When the configuration takes effect, the cluster on which the job runs directly accesses the services in the VPC and performs access control based on the selected security group.

    Important
    • Before you run a DLC job, make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.

    • If you select a CPFS dataset, you must configure a VPC. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, exceptions may occur and the DLC training jobs are dequeued after you submit the jobs.

Fault Tolerance and Diagnosis

In the Fault Tolerance and Diagnosis section, configure the key parameters. The following table describes the parameters.

Parameter

Description

Automatic Fault Tolerance

After you turn on Automatic Fault Tolerance, the system checks the jobs to identify errors in the algorithms of the jobs and improves GPU utilization. For more information, see AIMaster: elastic fault tolerance engine.

Sanity Check

After you turn on Sanity Check, the system detects the resources that are used to run the training jobs, isolates faulty nodes, and triggers the automated O&M processes in the background. This prevents job failure in the early stage of training and improves the training success rate. For more information, see Sanity Check.

Note

You can enable the sanity check feature only for training jobs that run on Lingjun resources.

Step 3: Submit the training job

Click Submit to submit the training job. You can go to the jobs list to view the status of the job. For more information about the status of the DLC job, see Appendix: DLC job status.

Submit a job by using SDK for Python or command lines

Use SDK for Python

Step 1: Install SDK for Python

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.0

Step 2: Submit the job

  • If you want to submit a job that runs on pay-as-you-go resources, you can use public resources. Training jobs that run on public resources may encounter queuing delays. We recommend that you use public resources in time-insensitive scenarios that involve a small number of tasks.

  • If you want to submit a job that runs on subscription resources, you can use dedicated resources, such as general computing resources or Lingjun resources. You can use dedicated resources to ensure resource availability in high workload scenarios.

Use public resources to submit jobs

The following sample code provides an example on how to create and submit a DLC job:

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account has the required permissions on DLC. 
    region_id = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a Resource Access Management (RAM) user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked. This may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variable to perform identity verification. 
    cred = CredClient()

    # 1. create client;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter. 
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Obtain the image list. 
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Obtain the dataset. 
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # Create a dataset if the specified dataset does not exist. 
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # The ID of the NAS file system. 
            # General-purpose NAS: 31a8e4****. 
            # Extreme NAS: The ID must start with extreme-. Example: extreme-0015****. 
            # CPFS: The ID must start with cpfs-. Example: cpfs-125487****. 
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Obtain the source code file list. 
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Obtain the DLC node specification list. 
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job. 
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
                "UseSpotInstance": False,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Obtain the DLC job list. 
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()

Use subscription resources to submit jobs

  1. Log on to the PAI console.

  2. Follow the instructions below to obtain your workspace ID on the Workspaces page. image.png

  3. Follow the instructions below to obtain the resource quota ID of your dedicated resource group on the General Computing Resources page. image.png

  4. The following sample code provides an example on how to create and submit a job. For information about the available public images, see "Step 2: Prepare an image" section in the Before you begin topic.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API operations. 
    region = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. To prevent security risks, we recommend you to call API operations or perform routine O&M as a RAM user. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources under your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variable to perform identity verification. 
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Specify the resource configurations of the job. You can select a public image or specify an image address. For information about the available public images, see the reference documentation. 
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Specify the execution information for the job. 
    req = CreateJobRequest(
            resource_id='<Your resource quota ID>',
            workspace_id='<Your workspace ID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job. 
    response = client.create_job(req)
    # Obtain the job ID. 
    job_id = response.body.job_id
    
    # Query the job status. 
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the commands that the job runs. 
    job.user_command

Use command lines

Step 1: Download the DLC client and perform user authentication

Download the DLC client for your operating system and verify your credentials. For more information, see Before you begin.

Step 2: Submit the job

  1. Log on to the PAI console.

  2. Follow the instructions below to obtain your workspace ID on the Workspaces page.

    image.png

  3. Follow the instructions below to obtain the resource quota ID on the General Computing Resources page.

    image.png

  4. Create a parameter file named ./tfjob.params and copy the following content into the file. Change the parameter values based on your business requirements. For information about how to use command lines in the DLC client, see Supported commands.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Your resource quota ID> # If you use the public resource group, you can leave this parameter empty.  
    workspace_id=<Your workspace ID> 
  5. The following sample code provides an example on how to specify the params_file parameter to submit a DLC job to the specified workspace and resource group.

    dlc submit tfjob --job_file  ./tfjob.params
  6. The following code provides an example to query the DLC jobs that you created.

    dlc get job <jobID>

What to do next

After you submit the job, you can perform the following operations:

  • View the basic information, resource view, and operation logs of the job. For more information, see View training jobs.

  • Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.

  • View the training results on TensorBoard. For more information, see Use TensorBoard to view training results in DLC.

  • View the billing details when the job is completed. For more information, see Bill details.

  • Enable the log forwarding feature to forward logs of DLC jobs from the current workspace to a specific Logstore for custom analysis. For more information, see Subscribe to job logs.

Appendix: DLC job status

In most cases, a DLC job has the following status in sequence: Creating-> Queuing-> Dequeued-> Running-> Successful, Failed, or Stopped.

  • What do I do if the job status is Dequeued?

    • When the DLC job status becomes Dequeued, the system starts to schedule resources for the job. After approximately 5 minutes, the job status changes to Running.

    • If the job remains in the Dequeued state for a long time, a possible cause is that a CPFS dataset is configured for the distributed training job that you created, but the VPC is not configured. Create another distributed training job and configure a CPFS dataset and a VPC. The selected VPC must be the same as the VPC configured for the CPFS dataset.

  • What do I do if the job status is Failed?

    You can identify the cause of the job failure by moving the pointer over the image.png icon next to the job status on the job details page, or by checking the instance operation logs.