All Products
Search
Document Center

Platform For AI:Create a training job

Last Updated:Apr 07, 2026

Create single-node or distributed training jobs in DLC by using the console, Python SDK, or CLI.

Quick start

For an MNIST-based walkthrough of single-GPU or multi-node multi-GPU training, see Distributed training DLC quick start.

Console parameters

Basic information

Configure the Job Name and Tag.

Environment information

Parameter

Description

Image Configuration

In addition to Alibaba Cloud Image, the following image types are available:

  • Custom Image: Use a custom image added to PAI. The image must be stored in Container Registry (ACR) or in a repository that allows public pulls. For more information, see Custom images.

    Note

    If you use a custom image with Lingjun AI computing resources, install RDMA manually to use the high-performance RDMA network. For more information, see RDMA: Use a high-performance network for distributed training.

  • Image Address: Specify the URL of a custom image or an Alibaba Cloud image.

    • For a private image address, click Input username and password to provide the credentials for the image repository.

    • To accelerate image pulling, see Accelerate image pulling.

Mount dataset

Mount data files for model training. Supported dataset types:

  • Custom Dataset: Create a custom dataset to store training data files. Set the dataset as read-only and select a dataset version from the version list.

  • Public Dataset: PAI provides public datasets. Public datasets are read-only.

Mount Path: The mount path in the DLC container, such as /mnt/data. Access the dataset from this path in your code. For more information, see Use cloud storage in DLC training jobs.

Important

If you use a CPFS dataset, configure a VPC for your DLC job. The VPC must match the VPC of the CPFS file system. Otherwise, the job may remain in the Preparing state.

Mount storage

Mount a data source path to read data or store output files.

  • Supported data source types: Object Storage Service (OSS), General-purpose NAS, Extreme NAS, and BMCPFS, which is available only for Lingjun AI computing resources.

  • Advanced Settings: Use advanced settings to enable specific features for different data source types. Examples:

    • OSS: In the advanced settings, specify {"mountType":"ossfs"} to mount OSS storage by using ossfs.

    • General-purpose NAS and CPFS: In the advanced settings, set the nconnect parameter to improve the throughput when the DLC container accesses NAS. For more information, see How do I resolve poor performance when I access NAS from a Linux OS?. Example: {"nconnect":"<example_value>"}. Replace <example_value> with a positive integer.

For more information, see Use cloud storage in DLC training jobs.

Startup Command

Startup command for the job (shell commands supported). DLC automatically injects common environment variables for PyTorch and TensorFlow, such as MASTER_ADDR and WORLD_SIZE. Access them with the $variable_name format. Common startup commands:

  • Run Python: python -c "print('Hello World')"

  • torch multi-node, multi-GPU distributed training: python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100

  • Set a shell file path as the startup command: /ml/input/config/launch.sh

Environment, libraries, and code

Environment Variable

In addition to the automatically injected common environment variables for PyTorch and TensorFlow, define custom environment variables in the Key:Value format. Maximum: 20 environment variables.

Third-party Libraries

If the container image is missing libraries, add them with Third-party Libraries. Two methods:

  • Select from List: Enter the names of the third-party libraries in the text box.

  • Directory of requirements.txt: Add the third-party libraries to a requirements.txt file, upload the file to the DLC container by using code configuration, dataset, or direct mount, and then specify the path of the file in the container.

Code Builds

Upload training code files to the DLC container. Two methods:

  • Online configuration: Create a code source to associate a Git repository. DLC pulls the code for the job.

  • Local Upload: Click the image.png button to upload local code files. After the upload completes, set the Mount path to a path within the container, such as /mnt/data.

Resource information

Parameter

Description

Resource Type

The default value is General Computing. You can select Lingjun Intelligence Resources only in the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions.

Source

  • Public Resources:

    • Billing method: pay-as-you-go.

    • Use cases: Public resources may be subject to queuing delays. Suitable for low-volume, non-time-sensitive jobs.

    • Limits: Default resource limit is 2 GPUs and 8 vCPUs per job. To increase this limit, contact your business manager.

  • Resource Quota: Includes general-purpose computing resources or Lingjun AI computing resources.

    • Billing method: subscription.

    • Use cases: Recommended for high-volume jobs that require reliable execution.

    • Specific parameters:

      • Resource Quota: You can set the number of resources, such as GPUs and CPUs. To prepare a resource quota, see Add a resource quota.

      • Priority: The execution priority for concurrent jobs. The value can be an integer from 1 to 9. A smaller value indicates a lower priority.

  • Preemptible Resources:

    • Billing method: pay-as-you-go.

    • Use cases: Use preemptible resources to reduce costs.

    • Limits: Availability is not guaranteed. Resources may not be immediately available or may be reclaimed. For more information, see Use preemptible jobs.

Framework

Supported deep learning training frameworks: TensorFlow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, and Ray.

Note

If you select Resource Quota and use Lingjun AI computing resources, you can submit only TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray jobs.

Job Resource

Based on the selected Framework, configure resources for Worker, PS, Chief, Evaluator, and GraphLearn nodes. For the Ray framework, click Add Role to customize Worker roles and run jobs on heterogeneous resources.

  • Use public resources: Configure the following parameters:

    • Number of Nodes: The number of nodes for the DLC job.

    • Resource Type: Select a resource specification. The console displays the corresponding price. For billing details, see DLC billing.

  • Use resource quota: Configure the number of nodes, CPU (cores), GPU (cards), Memory (GiB), and Shared Memory (GiB) for each node type. Additional parameters:

    • Node-Specific Scheduling: Run the job on specified compute nodes.

    • Idle Resources: Runs jobs on idle resources from other quotas to improve utilization. If the original quota reclaims resources, the job is preempted and terminated. For more information, see Use idle resources.

    • CPU Affinity: Binds processes in a container or pod to specific CPU cores to reduce cache misses and context switches. Suitable for performance-sensitive and real-time scenarios.

  • Use preemptible resources: In addition to the number of nodes and resource specifications, configure the Bid Price parameter. This sets the maximum price for requesting preemptible resources. Click the image button to select a bidding method:

    • By discount: The maximum bid is based on the market price, with options from 10% to 90% off. Resources are allocated when your bid meets or exceeds the current spot price, subject to availability.

    • By price: Set a specific maximum price for the resource.

Advanced job settings

Maximum Duration

Set the maximum run duration for a job. The system stops jobs that exceed this duration. Default: 30 days.

Retention Period

Retention period for completed (succeeded or failed) jobs. Retained jobs consume storage for metadata and logs. Jobs are deleted automatically after this period.

Important

A deleted DLC job cannot be recovered. Proceed with caution.

Advanced Framework Configuration

Supported parameters and descriptions: Advanced parameter list.

  • The parameters ReleaseResourcePolicy, EnableNvidiaIBGDA, EnableNvidiaGDRCopy, EnablePaiNUMACoreBinding, and EnableResourcePreCheck are supported by all frameworks.

  • If the Framework is PyTorch, the following additional parameters are available: createSvcForAllWorkers, customPortList, and customPortNumPerWorker.

    Important

    Lingjun AI computing resources do not support custom ports, so the customPortNumPerWorker parameter is unavailable for jobs that use these resources.

  • If the Framework is Ray, the following additional parameters are available: RayRuntimeEnv, RayRedisAddress, RayRedisUsername, RayRedisPassword, RaySubmitterBackoffLimit, and RayObjectStoreMemoryBytes. Note: The environment variable and third-party library configurations are overridden by the RayRuntimeEnv configuration.

Supported configuration formats:

  • Plaintext: Enter a comma-separated list of strings, with each string in the key=value format.

  • JSON

Typical configuration scenarios:

  • Scenario 1: PyTorch advanced configuration

    Use advanced configuration parameters to enable network communication between Workers for flexible training methods. For example, use extra open ports to launch frameworks such as Ray in the DLC container and coordinate with PyTorch for advanced distributed training. Sample configuration:

    createSvcForAllWorkers=true,customPortNumPerWorker=100

    Then, in the Startup Command, you can use the $JOB_NAME and $CUSTOM_PORTS environment variables to get the domain name and available port numbers to launch and connect to frameworks such as Ray.

  • Scenario 2: Manually configure RayRuntimeEnv for the Ray framework (including dependent libraries and environment variables)

    Sample configuration:

    {"RayRuntimeEnv": "{pip: requirements.txt, env_vars: {key: value}}"}
  • Scenario 3: Custom resource release rule

    Currently, only the pod-exit release policy is supported, which automatically releases resources when the job's pod exits. Sample configuration:

    {
      "ReleaseResourcePolicy": "pod-exit"
    }

VPC configuration

  • Without a VPC, the job uses a public gateway with limited bandwidth, which may slow down or fail job execution.

  • Configure a VPC with the corresponding vSwitch and security group to improve network bandwidth, stability, and security. The job cluster can directly access services within the VPC.

    Important
    • When using a VPC, the resource group instances, dataset storage (OSS), and code repository must all be in the same VPC.

    • For CPFS datasets, configure the job to use the same VPC as the CPFS file system. Otherwise, the DLC training job may remain in the Preparing state.

    • DLC jobs using preemptible Lingjun AI computing resources require a VPC.

    Configure the Internet Access Gateway. Two methods are supported:

    • Public Gateway: Limited bandwidth that may be insufficient during high-concurrency access or large file downloads.

    • Private Gateway: To overcome the bandwidth limitations of the public gateway, you can create an Internet NAT Gateway in the VPC of the DLC job, bind an Elastic IP Address (EIP), and configure SNAT entries. For more information, see Improve public network access speed by using a private gateway.

Fault tolerance and diagnosis

Parameter

Description

Automatic Fault Tolerance

Enable Automatic Fault Tolerance to detect and handle algorithm-level errors and improve GPU utilization. For more information, see AIMaster: An elastic automatic fault tolerance engine.

Note

Enabling automatic fault tolerance starts an AIMaster instance that runs alongside the job and consumes the following resources:

  • Resource quota: 1 CPU core and 1 GiB of memory.

  • Public resource: Uses the ecs.c6.large specification.

Sanity Check

Enable Sanity Check to check training resource health. Automatically isolates faulty nodes and triggers backend O&M to prevent early-stage failures. For more information, see SanityCheck: Compute resource health check.

Note

Health check is supported only for PyTorch jobs submitted with Lingjun AI computing resource quotas that have a GPU count greater than 0.

Roles and permissions

Configure the instance RAM role. For more information, see Configure a DLC RAM role.

Instance RAM role

Description

Default Role of PAI

Uses the AliyunPAIDLCDefaultRole service-linked role with fine-grained MaxCompute and OSS permissions. Temporary credentials grant:

  • MaxCompute table access uses the same permissions as the DLC instance owner.

  • OSS access is limited to the default OSS bucket configured for the current workspace.

Custom Role

Select or enter a custom RAM role. The instance assumes this role's permissions when accessing Alibaba Cloud services through STS temporary credentials.

Does Not Associate Role

No RAM role is associated with the DLC job. Default option.

References

Appendix

Create a job using an SDK or CLI

Python SDK

Step 1: Install the Alibaba Cloud Credentials tool

Install the Credentials tool to configure your credentials for Alibaba Cloud SDK API calls. Prerequisites:

  • Python 3.7 or later.

  • Alibaba Cloud SDK 2.0 or later.

pip install alibabacloud_credentials

Step 2: Obtain an AccessKey

This example uses an AccessKey pair to configure access credentials. To mitigate security risks, we recommend that you set your AccessKey ID and AccessKey secret in the ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables.

Step 3: Install the Python SDK

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.17

Step 4: Submit the job

Using public resources

The following sample code creates and submits a job.

Sample code for creating and submitting a job

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset.
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset.
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account is authorized to use DLC and has sufficient permissions.
    region_id = 'cn-hangzhou'
    # An AccessKey provides full API access. For better security, we recommend using a RAM user for API access and daily O&M.
    # Do not hard-code your AccessKey ID and AccessKey secret in your code. This can leak your credentials and compromise the security of your resources.
    # This example authenticates by reading the AccessKey from environment variables using the Credentials SDK.
    cred = CredClient()

    # 1. Create clients.
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the workspace list. You can also enter the name of the workspace that you created in the workspace_name parameter.
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Get the image list.
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Get the datasets.
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # If the dataset does not exist, create one.
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # NAS file system ID.
            # General-purpose NAS: 31a8e4****.
            # Extreme NAS: Must start with extreme-, for example, extreme-0015****.
            # CPFS: Must start with cpfs-, for example, cpfs-125487****.
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Get the code source list.
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Get the list of DLC node specifications.
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job.
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Get the DLC job list.
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()

Using a subscription quota

  1. Log on to the PAI console.

  2. Find your workspace ID on the Workspaces page, as shown below.image.png

  3. Find the resource quota ID of your dedicated resource group, as shown below.image

  4. Use the following sample code to create and submit a job. For a list of available public images, see Step 2: Prepare an image.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API.
    region = 'cn-hangzhou'
    # An AccessKey provides full API access. For better security, we recommend using a RAM user for API access and daily O&M.
    # Do not hard-code your AccessKey ID and AccessKey secret in your code. This can leak your credentials and compromise the security of your resources.
    # This example authenticates by reading the AccessKey from environment variables using the Credentials SDK.
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Declare the resource configuration for the job. For image selection, you can refer to the public image list in the documentation or provide your own image URL.
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Define the job.
    req = CreateJobRequest(
            resource_id='<Replace with your resource quota ID>',
            workspace_id='<Replace with your WorkspaceID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job.
    response = client.create_job(req)
    # Get the job ID.
    job_id = response.body.job_id
    
    # Query the job status.
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the job's command.
    job.user_command

Using preemptible instances

  • SpotDiscountLimit (Spot discount)

    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'  # The ID of the region in which the DLC job resides, such as cn-hangzhou. 
    cred = CredClient()
    workspace_id = '12****'  # The ID of the workspace to which the DLC job belongs. 
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-spot-job',
        'JobType': 'PyTorchJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.g7.xlarge',
                "SpotSpec": {
                    "SpotStrategy": "SpotWithPriceLimit",
                    "SpotDiscountLimit": 0.4,
                }
            },
        ],
        'UserVpc': {
            "VpcId": "vpc-0jlq8l7qech3m2ta2****",
            "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
            "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
        },
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    
  • SpotPriceLimit (Spot price)

    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'
    cred = CredClient()
    workspace_id = '12****'
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-spot-job',
        'JobType': 'PyTorchJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.g7.xlarge',
                "SpotSpec": {
                    "SpotStrategy": "SpotWithPriceLimit",
                    "SpotPriceLimit": 0.011,
                }
            },
        ],
        'UserVpc': {
            "VpcId": "vpc-0jlq8l7qech3m2ta2****",
            "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
            "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
        },
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    

Key parameters for preemptible instances.

Parameter

Description

SpotStrategy

The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.

SpotDiscountLimit

The spot discount bidding type.

Note
  • You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time.

  • The SpotDiscountLimit parameter is valid only for Lingjun resources.

SpotPriceLimit

The spot price bidding type.

UserVpc

This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.

CLI

Step 1: Download the client and authenticate

Download the client for Linux (64-bit) or macOS and complete user authentication. For details, see Preparations.

Step 2: Submit the job

  1. Log on to the PAI console.

  2. Find your workspace ID on the Workspaces page, as shown below.

    image.png

  3. Find your resource quota ID, as shown below.

    image

  4. Prepare a parameter file namedtfjob.params as shown in the following example. For more information about how to configure the parameter file, see Submit command.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Replace with your resource quota ID> 
    workspace_id=<Replace with your WorkspaceID>
  5. Run the following command to submit the DLC job to the specified workspace and resource quota by using the parameter file.

    ./dlc submit tfjob --job_file  ./tfjob.params
  6. Run the following command to view the submitted DLC job.

    ./dlc get job <jobID>

Advanced parameters

Parameter (key)

Supported frameworks

Description

Value

ReleaseResourcePolicy

ALL

Custom resource release policy. Default: all pod resources are released after job completion. Set to pod-exit to release a pod's resources immediately when that pod exits.

pod-exit

EnableNvidiaIBGDA

ALL

Enable or disable IBGDA when the GPU driver is loaded.

true or false

EnableNvidiaGDRCopy

ALL

If set to true, installs the GDRCopy kernel module (version 2.4.4).

true or false

EnablePaiNUMACoreBinding

ALL

Enable or disable NUMA binding.

true or false

EnableResourcePreCheck

ALL

Checks whether the quota has sufficient total resources (node specifications) for all roles in the job at submission time.

true or false

createSvcForAllWorkers

PyTorch

Enable or disable network communication between workers.

  • If this parameter is set to true, all PyTorch workers can communicate with each other.

  • When set to false or not configured, only the master is accessible by default.

When enabled, the domain name of each worker is its worker name, such as dlcxxxxx-master-0. The job name (dlcxxxxx) is passed to the worker in the JOB_NAME environment variable. You can then derive the domain name of the specific worker that you want to access.

true or false

customPortList

PyTorch

Defines the network ports to open on each worker. This parameter can be used with createSvcForAllWorkers to enable network communication between workers.

If this parameter is not configured, only port 23456 is opened on the master by default. Therefore, make sure that port 23456 is not included in this custom port list.

Important

This parameter is mutually exclusive with customPortNumPerWorker. Do not configure both parameters at the same time.

A set of strings separated by semicolons (;). Each string can be a port number or a port range specified by a hyphen (-). Example: 10000;10001-10010, which represents 11 consecutive port numbers from 10000 to 10010.

customPortNumPerWorker

PyTorch

Number of network ports to open for each worker. This parameter can be used with createSvcForAllWorkers to enable network communication between workers.

If this parameter is not configured, only port 23456 is opened on the master by default. DLC randomly assigns the specified number of ports to each worker. The assigned ports, in a semicolon-separated format, are passed to the worker in the CUSTOM_PORTS environment variable.

Important
  • This parameter is mutually exclusive with customPortList. Do not configure both parameters at the same time.

  • The customPortNumPerWorker parameter is not supported for DLC jobs that use Lingjun AI Computing Service resources because this service does not provide custom port capabilities.

Integer (up to 65536)

RayRuntimeEnv

Ray

For Ray, define the runtime environment by configuring RayRuntimeEnv.

Important

The environment variable and third-party library configurations are overwritten by this configuration.

Configure environment variables and third-party libraries ({pip: requirements.txt, env_vars: {key: value}}).

RayRedisAddress

Ray

The address of the external GCS Redis.

String

RayRedisUsername

Ray

The username of the external GCS Redis.

String

RayRedisPassword

Ray

The password of the external GCS Redis.

String

RaySubmitterBackoffLimit

Ray

The number of retries for the submitter.

Positive integer (int)

RayObjectStoreMemoryBytes

Ray

Sets the shared memory for a node. For example, to configure 1 GiB of shared memory per node:

{
  "RayObjectStoreMemoryBytes": "1073741824"
}

Positive integer (int)