All Products
Search
Document Center

Platform For AI:Create a training job

Last Updated:Dec 09, 2025

Platform for AI Deep Learning Containers (PAI-DLC) helps you quickly create single-node or distributed training jobs. PAI-DLC uses Kubernetes to launch compute nodes, which eliminates the need to manually purchase machines and configure runtimes while maintaining your existing development workflow. This service supports multiple deep learning frameworks and provides flexible resource configuration options, making it ideal for users who need to start training jobs rapidly.

Prerequisites

  1. Activate PAI and create a Workspace. Log on to the PAI console, select a region at the top of the page, and follow the prompts to authorize and activate the service. For more information, see Activate PAI and create a workspace.

  2. Grant permissions to your account. You can skip this step if you use your Alibaba Cloud account. If you use a RAM user, the user must be assigned one of the following roles: Algorithm Developer, Algorithm O&M, or Workspace Administrator. For more information, see the Configure Member and Role section in Manage workspaces.

Create a job in the console

If you are new to PAI-DLC, we recommend creating a job by using the console. PAI-DLC also provides options to create jobs using an SDK or the command line.

  1. Go to the Create Job page.

    1. Log on to the PAI console. select the target region and workspace at the top of the page, and then click Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. Configure the parameters for the training job in the following sections.

    • Basic Information

      Configure the Job Name and Tag.

    • Environment Information

      Parameter

      Description

      Image config

      In addition to Alibaba Cloud Image, the following image types are also supported:

      • Custom Image: You can use a custom image that is added to PAI. The image repository must be set to allow public pulls, or the image must be stored in Container Registry (ACR). For more information, see Custom images.

        Note

        When you select Lingjun resources for the resource quota and use a custom image, you must manually install RDMA to fully use the high-performance RDMA network of Lingjun. For more information, see RDMA: Use a high-performance network for distributed training.

      • Image Address: You can configure the URL of a custom image or an official image that is accessible over the Internet.

        • If it is a private image URL, click enter the username and password and provide the username and password for the image repository.

        • To improve image pull speeds, see Accelerate image pulling.

      Mount dataset

      Datasets provide the data files required for model training. The following two types are supported:

      • Custom Dataset: Create a custom dataset to store your training data. You can set it as Read-only and select a specific dataset version from the Version List.

      • Public Dataset: PAI provides pre-built public datasets that can only be mounted in read-only mode.

      Mount Path: Specifies the path where the dataset is mounted inside the PAI-DLC container, such as /mnt/data. Access the dataset from your code by using this path. For more information about mount configurations, see Use cloud storage in DLC training jobs.

      Important

      If you configure a CPFS dataset, the PAI-DLC job must be configured with a Virtual Private Cloud (VPC) that is the same as the CPFS VPC. Otherwise, the job might not proceed past the "Environment Preparation" stage.

      Mount storage

      Directly mount a data source path to read data or to store intermediate and final result files.

      • Supported data source types: OSS, General-purpose NAS, Extreme NAS, and BMCPFS (only for Lingjun AI Computing Service resources).

      • Advanced Settings: Use advanced configurations to enable specific features for different data source types. For example:

        • OSS: Set {"mountType":"ossfs"} in the advanced configuration to mount OSS storage using ossfs.

        • General-purpose NAS and CPFS: Set the nconnect parameter in the advanced configuration to improve throughput when the PAI-DLC container accesses NAS. For more information, see How do I resolve poor performance when accessing NAS on a Linux OS?. For example, {"nconnect":"<example_value>"}. Replace <example_value> with a positive integer.

      For more information, see Use cloud storage in DLC training jobs.

      Startup Command

      Set the startup command for the job. Shell commands are supported. PAI-DLC automatically injects common environment variables for PyTorch and TensorFlow, such as MASTER_ADDR and WORLD_SIZE. Access them using the format $variable_name. Examples of startup commands:

      • Run Python: python -c "print('Hello World')"

      • PyTorch multi-node, multi-GPU distributed training: python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100

      • Set a shell file path as the startup command: /ml/input/config/launch.sh

      Show more configurations

      Environment Variable

      In addition to the automatically injected common environment variables for PyTorch and TensorFlow, you can provide custom environment variables in the Key:Value format. You can configure up to 20 environment variables.

      Third-party Libraries

      If the container image is missing third-party libraries, add them here. The following two methods are supported:

      • Select from List: Directly enter the names of the third-party libraries in the text box below.

      • Directory of requirements.txt: Write the third-party libraries into a requirements.txt file, upload the file to the PAI-DLC container, and then specify the file's path within the container in the text box.

      Code Builds

      Upload the code files required for training into the PAI-DLC container. The following two methods are supported:

      • Online configuration: If you have a Git code repository and access permissions, create a code source to allow PAI-DLC to access the job code.

      • Local Upload: Click the image.png button to upload local code files. After the upload is successful, set the Mount Path to a specified path inside the container, such as /mnt/data.

    • Resource information

      Parameter

      Description

      Resource Type

      The default value is General Computing. Only the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions support selecting Lingjun AI Computing Service.

      Source

      • Public Resource:

        • Billing method: Pay-as-you-go.

        • Use cases: Public resources may experience queuing. Recommended for scenarios with a small number of jobs and no strict time constraints.

        • Limitations: The maximum supported resources are 2 GPU cards and 8 CPU cores. To exceed this limit, contact your business manager.

      • Resource Quota: Includes General Computing or Lingjun AI Computing Service resources.

        • Billing method: Subscription.

        • Use cases: Suitable for scenarios with a large number of jobs that require highly reliable execution.

        • Special parameters:

          • Resource Quota: Set the quantity of resources such as GPUs and CPUs. To prepare a resource quota, see Add a resource quota.

          • Priority: Specifies the execution priority for concurrent jobs. The value ranges from 1 to 9, where 1 is the lowest priority.

      • Preemptible Resources:

        • Billing method: Pay-as-you-go.

        • Scenarios: If you want to reduce resource costs, you can use preemptible resources, which usually offer a certain discount.

        • Limits: Stable availability is not guaranteed. Resources may not be immediately preempted or may be reclaimed. For more information, see Use preemptible jobs.

      Framework

      Supports the following deep learning training frameworks and tools: TensorFlow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, and Ray.

      Note

      When you select Lingjun AI Computing Service for your Resource Quota, only TensorFlow, PyTorch, ElasticBatch, MPIJob and Ray jobs are supported.

      Job Resources

      Based on your selected Framework, configure resources for Worker, PS, Chief, Evaluator, and GraphLearn nodes. When you select the Ray framework, click Add Role to customize Worker roles, enabling the mixed execution of heterogeneous resources.

      • Use public resources: The following parameters can be configured:

        • Quantity: The number of nodes to run the DLC job.

        • Resource Type: Select the resource specifications. The console displays the corresponding price. For more billing information, see DLC billing.

      • Use resource quota: Configure the number of nodes, CPU (cores), GPU (cards), Memory (GiB), and Shared Memory (GiB) for each node type. You can also configure the following special parameters:

        • Schedule On Specified Nodes: Execute the job on specified compute nodes.

        • Idle Resources: When using idle resources, jobs can run on the idle resources of other quotas, effectively improving resource utilization. However, when these resources are needed by the original quota's jobs, the job running on idle resources will be preempted and terminated, and the resources will be returned to the original quota. For more information, see: Use idle resources.

        • CPU Affinity: Enabling CPU affinity binds the processes in a container or Pod to specific CPU cores. This reduces CPU cache misses and context switches, thereby improving CPU utilization and application performance. It is suitable for performance-sensitive and real-time scenarios.

      • Use preemptible resources: In addition to the number of nodes and resource specifications, configure the Bid parameter. This allows you to request preemptible instances by setting a maximum bid. Click the image button to select a bidding method:

        • By discount: The maximum price is based on the market price of the resource specification, with discrete options from a 10% to 90% discount, representing the upper limit for bidding. When the maximum bid for the preemptible instance is greater than or equal to the market price and there is sufficient inventory, the instance can be requested.

        • By price: The maximum bid is within the market price range.

      Expand for more configurations: Maximum runtime, Retention period, Advanced framework configuration

      Maximum Duration

      Set the maximum duration for the job to run. A job that exceeds this duration is stopped. The default is 30 days.

      Retention Period

      Configure the retention period for successfully completed or failed jobs. Enabling job retention continuously occupies resources. Jobs exceeding this duration are deleted.

      Important

      Deleted DLC jobs cannot be recovered. Proceed with caution.

      Advanced Framework Configuration

      For a list of configurable parameters and their value descriptions, see Advanced parameter list.

      • The parameters ReleaseResourcePolicy, EnableNvidiaIBGDA, EnableNvidiaGDRCopy, EnablePaiNUMACoreBinding, and EnableResourcePreCheck are supported by all frameworks.

      • When the Framework is PyTorch, the following parameters can be configured: createSvcForAllWorkers, customPortList, and customPortNumPerWorker.

        Important

        Because Lingjun AI Computing Service does not provide custom port capabilities, the customPortNumPerWorker parameter is not supported when you submit DLC jobs with Lingjun AI Computing Service resources.

      • When the Framework is Ray, the following parameters can be configured: RayRuntimeEnv, RayRedisAddress, RayRedisUsername, RayRedisPassword, RaySubmitterBackoffLimit, and RayObjectStoreMemoryBytes. Note that Environment variable and Third-party libraries configuration are overridden by the RayRuntimeEnv configuration.

      The following configuration formats are supported:

      • Plaintext: Must be configured as a comma-separated string of key=value pairs. The key is a supported advanced parameter, and the value is its corresponding setting.

      • JSON

      Typical configuration scenarios:

      • Scenario 1: PyTorch advanced configuration

        Use advanced configuration parameters to enable network communication between Workers for more flexible training methods. For example, use the extra open ports to launch frameworks like Ray within the DLC container and coordinate with PyTorch for more advanced distributed training. Configuration example:

        createSvcForAllWorkers=true,customPortNumPerWorker=100

        You can then obtain the domain name and available port numbers in the Startup Command through the $JOB_NAME and $CUSTOM_PORTS environment variables to launch and connect to frameworks like Ray.

      • Scenario 2: Manually configure RayRuntimeEnv for the Ray framework (including dependent libraries and environment variables)

        Configuration example:

        {"RayRuntimeEnv": "{pip: requirements.txt, env_vars: {key: value}}"}
      • Scenario 3: Custom resource release rule

        Currently, the only supported release policy is pod-exit, which automatically releases resources when your Pod exits. Configuration example: json { "ReleaseResourcePolicy": "pod-exit" }

        {
          "ReleaseResourcePolicy": "pod-exit"
        }
    • VPC configuration

      • If you do not configure a VPC, the public network and a public gateway are used. The limited bandwidth of the public gateway results in slowdowns or interruptions during job execution.

      • Configuring a VPC and selecting the corresponding vSwitch and security group improves network bandwidth, stability, and security. Additionally, the cluster where the job runs can directly access services within this VPC.

        Important
        • When using a VPC, ensure that the job's resource group, dataset storage (OSS), and code repository are in the same region, and that their respective VPCs can communicate with each other.

        • When using a CPFS dataset, you must configure a VPC, and the selected VPC must be the same as the one used by CPFS. Otherwise, the submitted DLC training job will fail to mount the dataset and will remain in the Preparing state until it times out.

        • When submitting a DLC job with Lingjun AI Computing Service preemptible instances, you must configure a VPC.

        Additionally, you can configure a Internet Gateway in one of two ways:

        • Public Gateway: Its network bandwidth is limited, and the network speed may not meet your needs during high-concurrency access or when downloading large files.

        • Private Gateway: To address the bandwidth limitations of the public gateway, create an Internet NAT Gateway in the DLC's VPC, bind an Elastic IP (EIP), and configure SNAT entries. For more information, see Improve public network access speed through a private gateway.

    • Fault tolerance and diagnosis

      Parameter

      Description

      Automatic Fault Tolerance

      Turn on the Automatic Fault Tolerance switch and configure the parameters. The system will provide job detection and control capabilities to promptly detect and avoid errors at the job's algorithm layer, thereby improving GPU utilization. For more information, see AIMaster: An elastic and automatic fault tolerance engine.

      Note

      When you enable automatic fault tolerance, the system starts an AIMaster instance that runs alongside the job instance, which consumes computing resources. The AIMaster instance resource usage is as follows:

      • Resource quota: 1 CPU core and 1 GiB of memory.

      • Public resources: Uses the ecs.c6.large specification.

      Health Check

      When you enable Health Check, a comprehensive check is performed on the resources involved in the training. It automatically isolates faulty nodes and triggers automated backend O&M processes, which reduces the likelihood of encountering issues in the early stages of job training and improves the training success rate. For more information, see SanityCheck: Computing power health check.

      Note

      Health check is only supported for PyTorch training jobs submitted with Lingjun AI Computing Service resource quotas and with a GPU count greater than 0.

    • Roles and permissions

      The following describes how to configure the instance RAM role. For more information about this feature, see Configure a DLC RAM role.

      Instance RAM role

      Description

      Default Role of PAI

      Uses the AliyunPAIDLCDefaultRole service-linked role, which only has fine-grained permissions to access ODPS and OSS. The temporary access credentials issued based on the PAI default role have the following permissions:

      • When accessing MaxCompute tables, it has permissions equivalent to the DLC instance owner.

      • When accessing OSS, it can only access the default OSS Bucket configured for the current workspace.

      Custom Role

      Select or enter a custom RAM role. When accessing cloud products within the instance using STS temporary credentials, the permissions are consistent with those of this custom role.

      Does Not Associate Role

      Does not associate a RAM role with the DLC job. This is the default option.

After you configure the parameters, click OK.

References

After you submit the training job, you can perform the following operations:

  • View basic information, resource views, and operation logs for the job. For more information, see View training details.

  • Manage jobs, including cloning, stopping, and deleting them. For more information, see Manage training jobs.

  • View analysis reports through TensorBoard. For more information, see Visualization tool Tensorboard.

  • Set up monitoring and alerts for jobs. For more information, see Training monitoring and alerts.

  • View the bill for job execution. For more information, see Bill details.

  • Forward DLC job logs from the current workspace to a specified SLS Logstore for custom analysis. For more information, see Subscribe to task logs.

  • Create notification rules in the PAI workspace's event center to track and monitor the status of DLC jobs. For more information, see Notifications.

  • For potential issues and their solutions during DLC job execution, see DLC FAQ.

  • For use cases of DLC, see DLC use cases.

Appendix

Create a job using an SDK or the command line

Python SDK

Step 1: Install the Alibaba Cloud Credentials tool

When you use an Alibaba Cloud SDK to call OpenAPI for resource operations, you must install the Credentials tool to configure your credentials. Requirements:

  • Python 3.7 or later.

  • Alibaba Cloud SDK V2.0.

pip install alibabacloud_credentials

Step 2: Obtain an AccessKey

This example uses an AccessKey pair to configure credentials. To prevent your AccessKey information from being leaked, configure your AccessKey ID and AccessKey secret as environment variables. The environment variable names should be ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET, respectively.

Step 3: Install the Python SDK

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.17

Step 4: Submit the job

Submit a job using public resources

The following code shows how to create and submit a job.

Sample code for creating and submitting a job

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset.
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset.
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account has granted permissions to DLC and has sufficient permissions.
    region_id = 'cn-hangzhou'
    # The AccessKey of an Alibaba Cloud account has permissions to access all APIs. We recommend that you use a RAM user for API access or daily O&M.
    # We strongly recommend that you do not save your AccessKey ID and AccessKey secret in your project code. This can lead to an AccessKey leak and threaten the security of all resources in your account.
    # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for identity verification.
    cred = CredClient()

    # 1. create client;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Get the list of workspaces. You can also enter the name of the workspace you created in the workspace_name parameter.
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Get the list of images.
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Get the dataset.
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # If the current dataset does not exist, create a dataset.
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # NAS file system ID.
            # General-purpose NAS: 31a8e4****.
            # Extreme NAS: Must start with extreme-, for example, extreme-0015****.
            # CPFS: Must start with cpfs-, for example, cpfs-125487****.
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Get the list of code sources.
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Get the list of node specifications for DLC.
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job.
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Get the list of DLC jobs.
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()

Submit a job using a subscription resource quota

  1. Log on to the PAI console.

  2. As shown in the following figure, find your workspace ID on the workspace list page.image.png

  3. As shown in the following figure, find the resource quota ID of your dedicated resource group.image

  4. Use the following code to create and submit a job. For a list of available public images, see Step 2: Prepare an image.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API.
    region = 'cn-hangzhou'
    # The AccessKey of an Alibaba Cloud account has permissions to access all APIs. We recommend that you use a RAM user for API access or daily O&M.
    # We strongly recommend that you do not save your AccessKey ID and AccessKey secret in your project code. This can lead to an AccessKey leak and threaten the security of all resources in your account.
    # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for identity verification.
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Declare the resource configuration for the job. For image selection, you can refer to the public image list in the documentation or provide your own image URL.
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Declare the execution content of the job.
    req = CreateJobRequest(
            resource_id='<Replace with your resource quota ID>',
            workspace_id='<Replace with your WorkspaceID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job.
    response = client.create_job(req)
    # Get the job ID.
    job_id = response.body.job_id
    
    # Query the job status.
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the command executed by the job.
    job.user_command

Submit a job using preemptible resources

  • SpotDiscountLimit (Spot discount)

    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'  # The ID of the region in which the DLC job resides, such as cn-hangzhou. 
    cred = CredClient()
    workspace_id = '12****'  # The ID of the workspace to which the DLC job belongs. 
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-spot-job',
        'JobType': 'PyTorchJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.g7.xlarge',
                "SpotSpec": {
                    "SpotStrategy": "SpotWithPriceLimit",
                    "SpotDiscountLimit": 0.4,
                }
            },
        ],
        'UserVpc': {
            "VpcId": "vpc-0jlq8l7qech3m2ta2****",
            "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
            "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
        },
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    
  • SpotPriceLimit (Spot price)

    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'
    cred = CredClient()
    workspace_id = '12****'
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-spot-job',
        'JobType': 'PyTorchJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.g7.xlarge',
                "SpotSpec": {
                    "SpotStrategy": "SpotWithPriceLimit",
                    "SpotPriceLimit": 0.011,
                }
            },
        ],
        'UserVpc': {
            "VpcId": "vpc-0jlq8l7qech3m2ta2****",
            "SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
            "SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
        },
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    

The following table describes the key parameters:

Parameter

Description

SpotStrategy

The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit.

SpotDiscountLimit

The spot discount bidding type.

Note
  • You cannot specify the SpotDiscountLimit and SpotPriceLimit parameters at the same time.

  • The SpotDiscountLimit parameter is valid only for Lingjun resources.

SpotPriceLimit

The spot price bidding type.

UserVpc

This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides.

Command line

Step 1: Download the client and perform user authentication

Download the Linux 64-bit or macOS version of the client tool based on your operating system and complete user authentication. For more information, see Preparations.

Step 2: Submit the job

  1. Log on to the PAI console.

  2. On the workspace list page, follow the instructions in the image below to view your workspace ID.

    image.png

  3. Follow the instructions in the image below to view your resource quota ID.

    image

  4. Prepare the parameter file tfjob.params by referring to the following content. For more information about how to configure the parameter file, see Submit command.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Replace with your resource quota ID> 
    workspace_id=<Replace with your WorkspaceID>
  5. Use the following code example to pass the params_file parameter to submit the job. You can submit the DLC job to the specified workspace and resource quota.

    ./dlc submit tfjob --job_file  ./tfjob.params
  6. Use the following code to view the DLC job that you submitted.

    ./dlc get job <jobID>

Advanced parameter list

Parameter (key)

Supported framework types

Parameter description

Parameter value (value)

ReleaseResourcePolicy

ALL

Configures a custom resource release rule. This parameter is optional. If you do not configure this parameter, all pod resources are released when the job ends. If configured, the only currently supported value is pod-exit, which releases a pod's resources when it exits.

pod-exit

EnableNvidiaIBGDA

ALL

Specifies whether to enable the IBGDA feature when loading the GPU driver.

true or false

EnableNvidiaGDRCopy

ALL

Specifies whether to install the GDRCopy kernel module. Version 2.4.4 is currently installed.

true or false

EnablePaiNUMACoreBinding

ALL

Specifies whether to enable NUMA.

true or false

EnableResourcePreCheck

ALL

When submitting a job, this checks whether the total resources (node specifications) in the quota can meet the specifications of all roles in the job.

true or false

createSvcForAllWorkers

PyTorch

Specifies whether to allow network communication between workers.

  • If true, network communication is allowed between all PyTorch workers.

  • If false or not configured, only the master can be accessed by default.

When enabled, the domain name of each worker is its name, such as dlcxxxxx-master-0. The job name (dlcxxxxx) is passed to the workers through the JOB_NAME environment variable, allowing you to determine the specific worker's domain name for access.

true or false

customPortList

PyTorch

Defines the network ports to be opened on each worker. This can be used with createSvcForAllWorkers to enable network communication between workers.

If not configured, only port 23456 on the master is opened by default. Avoid using port 23456 in this custom port list.

Important

This parameter is mutually exclusive with customPortNumPerWorker. Do not set both at the same time.

A semicolon-separated string, where each string is a port number or a port range connected by a hyphen, such as 10000;10001-10010 (which will be converted to 11 consecutive port numbers from 10000 to 10010)

customPortNumPerWorker

PyTorch

Requests a number of network ports to be opened for each worker. This can be used with createSvcForAllWorkers to enable network communication between workers.

If not configured, only port 23456 on the master is opened by default. DLC randomly assigns the requested number of ports to each worker. The assigned ports are passed to the worker through the CUSTOM_PORTS environment variable in a semicolon-separated format for you to query.

Important
  • This parameter is mutually exclusive with customPortList. Do not set both at the same time.

  • Because Lingjun AI Computing Service does not provide custom port capabilities, the customPortNumPerWorker parameter is not supported when you submit DLC jobs with Lingjun AI Computing Service resources.

Integer (maximum 65536)

RayRuntimeEnv

Ray

When the framework is Ray, define the runtime environment by manually configuring RayRuntimeEnv.

Important

The Environment variable and Third-party library configuration are overridden by this configuration.

Configure environment variables and third-party libraries ({pip: requirements.txt, env_vars: {key: value}})

RayRedisAddress

Ray

The external GCS Redis address.

String

RayRedisUsername

Ray

The external GCS Redis username.

String

RayRedisPassword

Ray

The external GCS Redis password.

String

RaySubmitterBackoffLimit

Ray

The number of retries for the submitter.

Positive integer (int)

RayObjectStoreMemoryBytes

Ray

Configures shared memory for a node. For example, to configure 1 GiB of shared memory for each node, use the following configuration:

{
  "RayObjectStoreMemoryBytes": "1073741824"
}

Positive integer (int)