All Products
Search
Document Center

Platform For AI:Create a training job

Last Updated:Jun 03, 2026

PAI-DLC creates single-node or distributed training jobs on Kubernetes, removing the need to provision instances or configure environments. It supports multiple deep learning frameworks and flexible resource configurations.

Quick start

For an MNIST-based walkthrough of single-GPU or multi-node multi-GPU distributed training, see the Distributed Training DLC Quick Start.

Console parameters

Basic information

Configure the Job Name and Tag.

Environment information

Parameter

Description

Image Configuration

In addition to selecting an Alibaba Cloud Image, you can use the following image types:

  • Custom Image: You can use a custom image added to PAI. The image must be stored in Container Registry (ACR) or in a publicly accessible repository. Custom images.

    Note

    If you use a custom image with Lingjun AI computing resources, you must manually install RDMA to fully use the high-performance RDMA network. RDMA: Use a high-performance network for distributed training.

  • Image Address: Specify the URL of a custom or official image that is accessible over the internet.

    • For a private image URL, click Input Username and Password and enter the repository's username and password.

    • To accelerate image pulling, see Image acceleration.

Mount dataset

Datasets provide the data files required for model training. PAI supports two types of datasets:

  • Custom Dataset: You can create a custom dataset to store your training data. You can set the dataset as Read-only and select a dataset version from the version list.

  • Public Dataset: PAI provides public datasets. Only the read-only mount mode is supported.

Mount Path: The path in the DLC container where the dataset is mounted, for example, /mnt/data. You can access the dataset from this path in your code. Mount configuration details: Use cloud storage.

Important

If you configure a CPFS dataset, you must configure a VPC for DLC and ensure that the VPC is the same as the VPC of the CPFS file system. Otherwise, the submitted job may remain in the "Preparing" state for a long time.

Mount storage

You can also mount a data source path to read data or store intermediate files and results.

  • Supported data source types: Object Storage Service (OSS), General-purpose NAS, Extreme NAS, CPFS, and BMCPFS (available only for Lingjun AI computing resources).

  • Advanced Settings: You can use advanced settings to enable specific features for different data source types. Examples:

    • OSS: In the advanced settings, set {"mountType":"ossfs"} to mount OSS storage by using ossfs.

    • General-purpose NAS and CPFS: In the advanced settings, set the nconnect parameter to improve throughput when the DLC container accesses NAS. For more information, see How do I resolve poor performance when I access NAS from a Linux OS?. Example: {"nconnect":"<example_value>"}. Replace <example_value> with a positive integer.

Use cloud storage.

Startup Command

Set the startup command for the job. Shell commands are supported. DLC automatically injects common environment variables for PyTorch and TensorFlow, such as MASTER_ADDR and WORLD_SIZE. You can access them by using the $variable_name format. The following examples show common startup commands:

  • Run Python: python -c "print('Hello World')"

  • PyTorch multi-node, multi-GPU distributed training: python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100

  • Set a shell file path as the startup command: /ml/input/config/launch.sh

Advanced configurations

Environment Variable

In addition to the automatically injected common environment variables for PyTorch and TensorFlow, you can provide custom environment variables in the Key:Value format. A maximum of 20 environment variables are supported.

Third-party Libraries

If the configured container image is missing third-party libraries, add them in the Third-party Libraries section. Two methods are supported:

  • Select from List: Enter the names of the third-party libraries in the text box.

  • Directory of requirements.txt: Add the third-party libraries to a requirements.txt file, upload the file to the DLC container via Code Builds, a dataset, or direct mount, and then specify the path of the file in the container.

Code Builds

Upload your training code to the DLC container. Two methods are supported:

  • Online configuration: If you have access to a Git repository, you can associate the repository by creating a code source. This allows DLC to obtain the job code.

  • Local Upload: Click the image.png button to upload local code files. After the upload is complete, set the Mount path to a specified path within the container, for example, /mnt/data.

Resource information

Parameter

Description

Resource Type

The default value is General Computing. Lingjun Intelligence Resources are available in the following regions: China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), China (Hangzhou), China (Guangzhou), China (Hong Kong), Malaysia (Kuala Lumpur), Germany (Frankfurt), and Atlanta.

Source

  • Public Resources:

    • Billing method: pay-as-you-go.

    • Scenarios: Best for low-volume, non-time-sensitive jobs. Queuing delays may occur.

    • Limits: 2 GPUs and 8 CPU cores. Contact your business manager to increase the limit.

  • Resource Quota: Includes general-purpose computing resources or Lingjun AI computing resources.

    • Billing method: subscription.

    • Scenarios: Recommended for high-volume jobs that require reliable execution.

    • Specific parameters:

      • Resource Quota: You can set the number of resources, such as GPUs and CPUs. To create a resource quota, see Add a resource quota.

      • Priority: The execution priority for concurrent jobs. The value can be an integer from 1 to 9, where 1 is the lowest priority.

    • Pre-check: Verifies compatibility between resources and official images before a job starts, preventing failures from configuration errors.

  • Preemptible Resources:

    • Billing method: pay-as-you-go.

    • Scenarios: Reduces costs with discounted resources.

    • Limits: Availability is not guaranteed — resources may not be immediately available or may be reclaimed. Use preemptible jobs.

Framework

Supported deep learning training frameworks and tools: TensorFlow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, Ray, Custom, DataJuicer, and MPI.

Note

When you select Resource Quota and use Lingjun AI computing resources, you can submit only TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray jobs.

Job Resource

Based on the selected Framework, you can configure resources for Worker, PS, Chief, Evaluator, and GraphLearn node types. If you select the Ray framework, you can click Add Role to customize Worker roles and run jobs on heterogeneous resources.

  • Use public resources: You can configure the following parameters:

    • Number of Nodes: The number of nodes for the DLC job.

    • Resource Type: Select a resource specification. The console displays the corresponding price. Billing details: DLC billing.

  • Use resource quota: You can configure the number of nodes, CPU (cores), GPU (cards), Memory (GiB), and Shared Memory (GiB) for each node type. You can also configure the following parameters:

    • Node-Specific Scheduling: You can run the job on specified compute nodes.

    • Idle Resources: Allows jobs to run on idle resources from other quotas to improve utilization. If the original quota requires those resources, the job is terminated and resources are automatically returned. Use idle resources.

    • CPU Affinity: Binds processes in a container or pod to specific CPU cores, reducing cache misses and context switches. Suitable for performance-sensitive and real-time workloads.

  • Use preemptible resources: In addition to the number of nodes and resource specification, you can configure the Bid Price parameter, which sets the maximum price for requesting preemptible resources. Click the image button to select a bidding method:

    • By Discount: The maximum bid is based on the market price of the resource specification, with discrete discount options from 10% to 90% off. This indicates the upper limit for bidding. You can request a preemptible resource if your maximum bid is at or above the market price and stock is sufficient.

    • By Price: The maximum bid is within the market price range.

Advanced configurations

Maximum Duration

Maximum run duration for a job. Jobs exceeding this duration are stopped. Default: 30 days.

Retention Period

Retention period for completed jobs. Retained jobs continue to occupy resources and are deleted after expiry.

Important

Deleted DLC jobs cannot be recovered. Proceed with caution.

Start Developer Machine

If the resource source is a resource quota, you can start a Developer Machine (DSW) for online debugging. On the job overview page, go to the instance list and click Developer Machine (DSW) in the Actions column.

Advanced Framework Configuration

For a supported parameter list and descriptions, see Advanced parameter list.

  • The parameters ReleaseResourcePolicy, EnableNvidiaIBGDA, EnableNvidiaGDRCopy, EnablePaiNUMACoreBinding, and EnableResourcePreCheck are supported by all frameworks.

  • If the Framework is PyTorch, the following parameters are available: createSvcForAllWorkers, customPortList, and customPortNumPerWorker.

    Important

    Lingjun AI computing resources do not provide custom port capabilities. Therefore, you cannot configure the customPortNumPerWorker parameter when you submit a DLC job that uses Lingjun AI computing resources.

  • If the Framework is Ray, the following parameters are available: RayRuntimeEnv, RayRedisAddress, RayRedisUsername, RayRedisPassword, RaySubmitterBackoffLimit, and RayObjectStoreMemoryBytes. Note: The environment variable and third-party library configurations are overridden by the RayRuntimeEnv configuration.

The following configuration formats are supported:

  • Plaintext: Enter a comma-separated list of strings, with each string in the key=value format. The key is a supported advanced parameter, and the value is the value of the parameter.

  • JSON

Typical configuration scenarios:

  • Scenario 1: PyTorch advanced configuration

    Use advanced configuration parameters to enable network communication between Workers. For example, open extra ports to launch frameworks such as Ray in the DLC container and coordinate with PyTorch for advanced distributed training. Sample configuration:

    createSvcForAllWorkers=true,customPortNumPerWorker=100

    Then, in the Startup Command, you can use the $JOB_NAME and $CUSTOM_PORTS environment variables to get the domain name and available port numbers to launch and connect to frameworks such as Ray.

  • Scenario 2: Manually configure RayRuntimeEnv for the Ray framework (including dependent libraries and environment variables)

    Sample configuration:

    {"RayRuntimeEnv": "{pip: requirements.txt, env_vars: {key: value}}"}
  • Scenario 3: Custom resource release rule

    Currently, only the pod-exit release policy is supported, which automatically releases resources when your pod exits. Sample configuration:

    {
      "ReleaseResourcePolicy": "pod-exit"
    }

VPC configuration

  • Without a VPC, the job uses a Public Gateway with limited bandwidth, which may slow down or fail the job.

  • Configure a VPC with a vSwitch and security group to improve bandwidth, stability, and security. The job cluster can directly access services within the VPC.

    Important
    • If you use a VPC, ensure that the job's resource group instances and dataset storage (OSS) are in a VPC in the same region, and that the VPC is connected to the network of the code repository.

    • If you use a CPFS dataset, you must configure a VPC and ensure that the selected VPC is the same as the VPC of the CPFS file system. Otherwise, the submitted DLC training job may remain in the "Preparing" state for a long time.

    • You must configure a VPC when you submit a DLC job that uses preemptible Lingjun AI computing resources.

    You can also configure an Internet Access Gateway by using one of the following methods:

    • Public Gateway: Has limited bandwidth that may be insufficient during high-concurrency access or large file downloads.

    • Private Gateway: To overcome Public Gateway bandwidth limits, create an Internet NAT Gateway in the DLC VPC, bind an EIP, and configure SNAT entries. Improve public network access speed by using a private gateway.

Fault tolerance and diagnosis

Parameter

Description

Automatic Fault Tolerance

Enable Automatic Fault Tolerance and configure the required parameters to detect and mitigate algorithm-level errors, improving GPU utilization. AIMaster: An elastic automatic fault tolerance engine.

Note

When you enable automatic fault tolerance, an AIMaster instance starts and runs with the job instance. This consumes a specific amount of computing resources. The AIMaster instance uses the following resources:

  • Resource quota: 1 CPU core and 1 GiB of memory.

  • Public resources: Uses the ecs.c6.large specification.

Sanity Check

Enable Sanity Check to comprehensively check training resources, isolate faulty nodes, and trigger backend automated O&M processes. Reduces early-stage failures and improves success rate. SanityCheck: Compute resource health check.

Note

The health check feature is supported only for PyTorch training jobs that are submitted using a Lingjun AI computing resource quota and have a GPU count greater than 0.

Roles and permissions

Instance RAM role configurations. Configure a DLC RAM role.

Instance RAM role

Description

Default Role of PAI

The PAI default role grants the following permissions via STS temporary credentials:

  • When accessing a MaxCompute table, you have the same permissions as the DLC instance owner.

  • When accessing OSS, you can only access the default OSS bucket that is configured for the current workspace.

Custom Role

Select or enter a custom RAM role. The instance assumes this role's permissions when accessing cloud services through STS temporary credentials.

Does Not Associate Role

No RAM role is associated with the DLC job. This is the default option.

Related topics

Appendix

Create a job via SDK or CLI

Python SDK

Step 1: Install the Credentials tool

Install the Credentials tool for SDK authentication. Requirements:

  • Python 3.7 or later.

  • Alibaba Cloud SDK 2.0 series.

pip install alibabacloud_credentials

Step 2: Obtain an AccessKey

This example uses an AccessKey pair. Store AccessKey values as environment variables to prevent security risks. The environment variable for the AccessKey ID is ALIBABA_CLOUD_ACCESS_KEY_ID, and the environment variable for the AccessKey secret is ALIBABA_CLOUD_ACCESS_KEY_SECRET.

Step 3: Install the Python SDKs

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.17

Step 4: Submit the job

Public resources

The following sample code creates and submits a job.

Sample code for creating and submitting a job

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)

def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset.
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id

def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset.
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id

def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None

def main():

    # Make sure that your Alibaba Cloud account is authorized to use DLC and has sufficient permissions.
    region_id = 'cn-hangzhou'
    # An AccessKey pair provides full API access. For security purposes, we recommend that you use a RAM user for API access and daily O&M.
    # Do not hard-code your AccessKey ID and AccessKey secret in your code. This may lead to AccessKey leakage and compromise the security of all resources in your account.
    # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for authentication.
    cred = CredClient()

    # 1. Create clients;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the list of workspaces. You can also specify your workspace name in the workspace_name parameter.
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Obtain the list of images.
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Obtain the datasets.
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # If the dataset does not exist, create one.
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # The ID of the NAS file system.
            # General-purpose NAS: 31a8e4****.
            # Extreme NAS: Must start with extreme-, for example, extreme-0015****.
            # CPFS: Must start with cpfs-, for example, cpfs-125487****.
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Obtain the list of code sources.
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Obtain the list of DLC node specifications.
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job.
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Obtain the list of DLC jobs.
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass

if __name__ == '__main__':
    main()

Subscription resource quota

  1. Log on to the PAI console.

  2. To view your workspace ID: In the left-side navigation pane, click Workspaces. Find the target workspace, click the ⓘ icon next to its name, then view and copy the Workspace ID from the information card that appears.

  3. To view the ID of your resource quota for the dedicated resource group: In the left-side navigation pane, choose AI Computing Resources > Resource Quotas. Click the General-purpose Computing Resources tab and obtain the Quota ID from the Name/ID column in the resource quota list.

  4. Use the following code to create and submit a job. For a list of available public images, see Step 2: Prepare an image.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API.
    region = 'cn-hangzhou'
    # An AccessKey pair provides full API access. For security purposes, we recommend that you use a RAM user for API access and daily O&M.
    # Do not hard-code your AccessKey ID and AccessKey secret in your code. This may lead to AccessKey leakage and compromise the security of all resources in your account.
    # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for authentication.
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Declare the resource configuration for the job. For image selection, you can refer to the public image list in the documentation or provide your own image URL.
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Declare the job's execution details.
    req = CreateJobRequest(
            resource_id='<Replace with the ID of your resource quota>',
            workspace_id='<Replace with your WorkspaceID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job.
    response = client.create_job(req)
    # Get the job ID.
    job_id = response.body.job_id
    
    # Query the job status.
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the command executed by the job.
    job.user_command

Spot instances

  • SpotDiscountLimit (spot discount)

    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'  # The ID of the region in which the DLC job resides, such as cn-hangzhou. 
    cred = CredClient()
    workspace_id = '12****'  # The ID of the workspace to which the DLC job belongs. 
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-spot-job',
        'JobType': 'PyTorchJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.g7.xlarge',
                "SpotSpec": {
                    "SpotStrategy": "SpotWithPriceLimit",
                    "SpotDiscountLimit": 0.4,
                }
            },
        ],
        'UserVpc': {
            "VpcId": "vpc-0jlq8l7qech3m2ta2****",
            "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
            "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
        },
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    
  • SpotPriceLimit (spot price)

    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'
    cred = CredClient()
    workspace_id = '12****'
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-spot-job',
        'JobType': 'PyTorchJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.g7.xlarge',
                "SpotSpec": {
                    "SpotStrategy": "SpotWithPriceLimit",
                    "SpotPriceLimit": 0.011,
                }
            },
        ],
        'UserVpc': {
            "VpcId": "vpc-0jlq8l7qech3m2ta2****",
            "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
            "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
        },
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    

The following table describes the key parameters.

Parameter

Description

SpotStrategy

The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.

SpotDiscountLimit

The spot discount bidding type.

Note
  • You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time.

  • The SpotDiscountLimit parameter is valid only for Lingjun resources.

SpotPriceLimit

The spot price bidding type.

UserVpc

This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.

CLI

Step 1: Download the client and authenticate

Download the client tool for Linux (64-bit) or macOS and complete authentication. Preparations.

Step 2: Submit the job

  1. Log on to the PAI console.

  2. To view your workspace ID:

    In the left-side navigation pane, click Workspaces. Find the target workspace, click the ⓘ icon next to its name and view the Workspace ID in the information card that appears.

  3. To view your resource quota ID:

    In the left-side navigation pane, choose AI Computing Resources > Resource Quotas. Select the tab of the target resource type, such as General-purpose Computing Resources, and obtain the resource quota ID from the Name/ID column.

  4. Create a parameter file named tfjob.params with the following content. Parameter file details: Submit command.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Replace with your resource quota ID> 
    workspace_id=<Replace with your WorkspaceID>
  5. Run the following command to submit the DLC job to a specified workspace and resource quota by using the '--job_file' parameter to specify the path to your parameter file.

    ./dlc submit tfjob --job_file  ./tfjob.params
  6. Run the following command to view the DLC job that you submitted.

    ./dlc get job <jobID>

Advanced parameters

Parameter

Supported frameworks

Description

Value

ReleaseResourcePolicy

ALL

By default, all pod resources are released after the job completes. The only other supported value is 'pod-exit', which releases a pod's resources as soon as the pod exits.

pod-exit

EnableNvidiaIBGDA

ALL

Specifies whether to enable the IBGDA feature when the GPU driver is loaded.

true or false

EnableNvidiaGDRCopy

ALL

Specifies whether to install the GDRCopy kernel module. (Version: 2.4.4)

true or false

EnablePaiNUMACoreBinding

ALL

Specifies whether to enable NUMA core binding.

true or false

EnableResourcePreCheck

ALL

Specifies whether to check if the total resources (node specifications) in the quota can meet the specifications of all roles in the job upon submission.

true or false

createSvcForAllWorkers

PyTorch

Specifies whether to allow network communication between workers.

  • If set to true, network communication is allowed between all PyTorch workers.

  • If the value is false or is not configured, only the master can be accessed by default.

After this feature is enabled, the domain name of each worker is the same as its worker name, such as dlcxxxxx-master-0. The job name, such as dlcxxxxx, is passed into the worker through the JOB_NAME environment variable. You can then determine the domain name of the specific worker that you want to access.

true or false

customPortList

PyTorch

Allows you to specify the network ports to open on each worker, which can be used with createSvcForAllWorkers to enable network communication between workers.

If this parameter is not configured, only port 23456 on the master worker is opened by default. Therefore, make sure that port 23456 is not included in this custom port list.

Important

This parameter and customPortNumPerWorker are mutually exclusive and must not be set at the same time.

A set of strings separated by semicolons, where each string is a port number or a port range connected by a hyphen, such as 10000;10001-10010 (which represents the 11 consecutive port numbers from 10000 to 10010).

customPortNumPerWorker

PyTorch

This allows you to request several network ports for each worker and can be used with createSvcForAllWorkers to enable network communication between workers.

If this setting is not configured, only port 23456 is opened on the master node by default. DLC randomly assigns ports to worker nodes based on the number of ports that you specify. The assigned port numbers are passed to the worker nodes through the CUSTOM_PORTS environment variable, which you can query. The value of this variable is a semicolon-separated list of port numbers.

Important
  • This parameter and customPortList are mutually exclusive. Do not set them at the same time.

  • Lingjun AI Computing Service resources do not provide the custom port feature. Therefore, the customPortNumPerWorker parameter is not supported when you submit a DLC job that uses Lingjun AI Computing Service resources.

An integer up to 65536.

RayRuntimeEnv

Ray

When the framework is Ray, you can manually configure RayRuntimeEnv to define the runtime environment.

Important

This configuration overrides other environment variable and third-party library settings.

Configure environment variables and third-party libraries ({pip: requirements.txt, env_vars: {key: value}})

RayRedisAddress

Ray

The address of the external GCS Redis server.

String

RayRedisUsername

Ray

The username for the external GCS Redis server.

String

RayRedisPassword

Ray

The password for the external GCS Redis server.

String

RaySubmitterBackoffLimit

Ray

The number of submitter retries.

Positive integer (int)

RayObjectStoreMemoryBytes

Ray

Configures shared memory for a node. For example, to configure 1 GiB of shared memory for each node, use the following configuration:

{
  "RayObjectStoreMemoryBytes": "1073741824"
}

Positive integer (int)