All Products
Search
Document Center

Resource Orchestration Service:ALIYUN::PAIDLC::Job

Last Updated:Sep 05, 2023

ALIYUN::PAIDLC::Job is used to create a Machine Learning Platform for AI (PAI) job to run in a cluster.

Syntax

{
  "Type": "ALIYUN::PAIDLC::Job",
  "Properties": {
    "ThirdpartyLibs": List,
    "Options": String,
    "Priority": Integer,
    "Envs": String,
    "JobMaxRunningTimeMinutes": Integer,
    "WorkspaceId": String,
    "CodeSource": Map,
    "UserVpc": Map,
    "JobSpecs": List,
    "UserCommand": String,
    "DataSources": List,
    "JobType": String,
    "ResourceId": String,
    "ThirdpartyLibDir": String,
    "DisplayName": String,
    "SuccessPolicy": String,
    "Settings": Map
  }
}

Properties

Property

Type

Required

Editable

Description

Constraint

ThirdpartyLibs

List

No

No

The third-party Python library and its version.

Example: numpy==1.16.1.

Options

String

No

No

The additional configurations of the job.

You can use this property to adjust the behavior of the attached data source. For example, if the attached data source of the job is of the Object Storage Service (OSS) type, you can use this property to add the following configurations to override the default parameters of JindoFS: fs.oss.download.thread.concurrency=4,fs.oss.download.queue.size=16.

Priority

Integer

No

Yes

The priority of the job.

Default value: 1.

Valid values: 1 to 9. Each value specifies a different priority:

  • 1 is the lowest priority.

  • 9 is the highest priority.

Envs

String

No

No

The environment variable configurations.

None.

JobMaxRunningTimeMinutes

Integer

No

No

The maximum running duration of the job.

Unit: minutes.

WorkspaceId

String

Yes

No

The workspace ID.

None.

CodeSource

Map

No

No

The code source of the job.

Before the node of the job starts to run, Deep Learning Containers (DLC) automatically downloads the configured code from the code source and mounts the code to the on-premises path of the container. For more information, see CodeSource properties.

UserVpc

Map

No

No

The configurations of the user virtual private cloud (VPC).

For more information, see UserVpc syntax.

JobSpecs

List

Yes

No

The running configurations of the job.

For more information, see JobSpecs properties.

UserCommand

String

Yes

No

The startup command for all nodes of the job.

None.

DataSources

List

No

No

All data sources of the job.

The data source is mounted to the on-premises path of the container that runs on each node based on the configuration in the data sources. The MountPath property in DataSources specifies the on-premises path.

The process in the startup command of the job directly accesses the distributed file system that resides in the path specified by MountPath. Each data source represents a distributed file system.

For more information, see DataSources properties.

JobType

String

Yes

No

The job type.

The value is case-sensitive. The following job types are supported:

  • TFJob

  • PyTorchJob

  • XGBoostJob

  • OneFlowJob

  • ElasticBatch

ResourceId

String

No

No

The ID of the resource group to which the job belongs.

This property is optional.

  • If you leave this property empty, the job is added to a public resource group.

  • If the workspace of the job is associated with a dedicated resource group, you can set the value to the ID of the dedicated resource group. For more information about how to create a dedicated resource group and query the ID of a dedicated resource group, see Create and manage general training resources.

ThirdpartyLibDir

String

No

No

The name of the folder in which the requirements.txt file of the Python third-party library resides.

Before each node runs the startup command specified by UserCommand, DLC fetches the requirements.txt file from the folder and runs pip install -r to install the required packages and libraries.

DisplayName

String

Yes

No

The name of the job.

The name must meet the following requirements:

  • It can be up to 256 characters in length.

  • It can contain digits, letters, underscores (_), periods (.), and hyphens (-).

SuccessPolicy

String

No

No

The policy that is used to check whether a distributed multi-node job is successful.

Only TensorFlow distributed multi-node jobs are supported.

Valid values:

  • ChiefWorker: If you use this policy, the job is considered successful when the pod on the chief node completes operations.

  • AllWorkers (default): If you use this policy, the job is considered successful when all worker nodes complete operations.

Settings

Map

No

No

The additional parameter configurations of the job.

None.

CodeSource syntax

"CodeSource": {
  "MountPath": String,
  "Commit": String,
  "Branch": String,
  "CodeSourceId": String
}

CodeSource properties

Property

Type

Required

Editable

Description

Constraint

MountPath

String

No

No

The path to which you want to mount the job.

By default, the mount path that is configured in the data source is used.

Commit

String

No

No

The commit ID of the code that is required to be downloaded for the job.

By default, the commit ID that is configured in the code source is used.

Branch

String

No

No

The branch of the code repository that is referenced when the job is running.

By default, the branch that is configured in the code source is used.

CodeSourceId

String

Yes

No

The ID of the code source.

None.

UserVpc syntax

"UserVpc": {
  "VpcId": String,
  "SecurityGroupId": String,
  "SwitchId": String,
  "ExtendedCIDRs": List
}

UserVpc properties

Property

Type

Required

Editable

Description

Constraint

VpcId

String

Yes

No

The VPC ID.

None.

SecurityGroupId

String

No

No

The ID of the security group.

None.

SwitchId

String

No

No

The vSwitch ID.

This property is optional.

  • If you leave this property empty, the system automatically selects a vSwitch based on the inventory status.

  • You can also specify a vSwitch ID.

ExtendedCIDRs

List

No

No

The extended CIDR blocks.

Valid values:

  • If you leave SwitchId and ExtendedCIDRs empty, the system automatically obtains all CIDR blocks in the specified VPC.

  • If you specify SwitchId, you must specify ExtendedCIDRs. We recommend that you specify all CIDR blocks in the VPC.

JobSpecs syntax

"JobSpecs": [
  {
    "PodCount": Integer,
    "ImageConfig": Map,
    "UseSpotInstance": Boolean,
    "Type": String,
    "EcsSpec": String,
    "ResourceConfig": Map,
    "Image": String,
    "ExtraPodSpec": Map
  }
]

JobSpecs properties

Property

Type

Required

Editable

Description

Constraint

PodCount

Integer

Yes

No

The number of replicas.

None.

ImageConfig

Map

No

No

The private image configurations.

None.

UseSpotInstance

Boolean

Yes

No

Specifies whether to use preemptible instances.

Valid values:

  • true

  • false

Type

String

Yes

No

The node type.

Type is closely related to JobType. The valid values of Type vary based on the value of JobType:

  • Valid values when JobType is set to TFJob: Chief, PS, Worker, Evaluator, and GraphLearn.

  • Valid values when JobType is set to PyTorchJob: Worker and Master.

  • Valid values when JobType is set to XGBoostJob: Worker and Master.

The master node for a PyTorch or XGBoost job is optional. If you do not specify the master node for a PyTorch or XGBoost job, the system automatically uses the first worker node as the master node.

EcsSpec

String

Yes

No

The Elastic Compute Service (ECS) instance specifications of the worker node.

The price varies based on instance specifications. For more information, see Billing of DLC.

ResourceConfig

Map

No

No

The resource configurations.

None.

Image

String

Yes

No

The address of the image that is run by the worker node.

You can call the ListImages operation to query community images provided by PAI and images optimized by PAI. You can also specify a third-party public image.

ExtraPodSpec

Map

No

No

The additional pod configurations.

None.

DataSources syntax

"DataSources": [
  {
    "MountPath": String,
    "DataSourceId": String
  }
]

DataSources properties

Property

Type

Required

Editable

Description

Constraint

MountPath

String

No

No

The path to which you want to mount the job.

By default, the mount path that is configured in the data source is used.

DataSourceId

String

Yes

No

The ID of the data source.

None.

Return values

Fn::GetAtt

JobId: the job ID.

Examples

  • YAML format

    ROSTemplateFormatVersion: '2015-09-01'
    Parameters:
      CodeSource:
        Description: The code source used in this task. Before the mission node starts,
          the DLC will automatically download the code configured in the code source,
          and mount to the local directory of the container.
        Type: Json
      DataSources:
        Description: List of data source used for task operation.
        Type: Json
      DisplayName:
        Description: 'The name of the task is as follows:
    
          The name length does not exceed 256 characters.
    
          Allow numbers, letters, lower strokes (_), English period (.) And short horizontal
          lines (-).'
        Type: String
      Envs:
        Description: Environment variable configuration.
        Type: String
      JobMaxRunningTimeMinutes:
        Description: The longest running time is running, and the unit is minutes.
        Type: Number
      JobSpecs:
        Description: 'Jobspecs describes various configurations of tasks during the mission,
          such as mirror address, start command, node resource statement, number of copies,
          etc.
    
          The DLC task consists of different types of nodes. The same type of nodes have
          exactly the same configuration. This configuration is called a Jobspec. Jobspecs
          describes the configuration of all types of nodes and is the array of Jobspec.'
        Type: Json
      JobType:
        AllowedValues:
        - TFJob
        - PyTorchJob
        - XGBoostJob
        - OneFlowJob
        - ElasticBatch
        Description: 'The type of job. Values: TFJob, PyTorchJob, XGBoostJob, OneFlowJob,
          ElasticBatch'
        Type: String
      Options:
        Description: The additional configuration of this task can adjust some of the
          behavior of the mounting data source through this parameter. If the task has
          a data source that mounted the OSS type, you can cover the default parameters
          of the jinofs by configure the configuration of this parameter to fs.OSS.DOWNLOAD.CONCURRENCY
          = 4, fs.oss.download.queue.size = 16.
        Type: String
      Priority:
        Description: 'The priority of the task, optional parameter, default value 1, the
          range of parameter values is 1 ~ 9.in:
    
          1 is the minimum priority.
    
          9 is the highest priority.'
        Type: Number
      ResourceId:
        Description: 'Resource group ID, optional parameter.
    
          The parameter value is empty indicating that submitted to the public resource
          group.
    
          If the current working space has been bound to a proprietary resource group,
          you can specify the corresponding resource group ID here; how to create a proprietary
          resource group and inquire about the proprietary resource group ID, please refer
          to the preparation and management of the DLC resource group cluster.'
        Type: String
      Settings:
        Description: Job settings.
        Type: Json
      SuccessPolicy:
        Description: 'The successful strategy of distributed multi -machine tasks is currently
          only supported by TensorFlow''s multi -machine task.
    
          ChiefWorker: When it is specified as this value, as long as the Chief''s POD
          is successful, it is considered that the entire task is successful.
    
          All workers: All workers must be successful to think that the entire task is
          successful.'
        Type: String
      ThirdpartyLibDir:
        Description: The name folder of the Requirements.txt file is located; before each
          node runs the specified usercommand, PAI -DLC will take the requirements.txt
          file from the specified folder and call the PIP Install -R installation.
        Type: String
      ThirdpartyLibs:
        Description: Python third-party library list to be installed.
        Type: Json
      UserCommand:
        Description: Start commands of all nodes in the task.
        Type: String
      UserVpc:
        Description: User VPC configuration.
        Type: Json
      WorkspaceId:
        Description: Work space ID, how to get working space ID, see listworkSpaces.
        Type: String
    Resources:
      Job:
        Properties:
          CodeSource:
            Ref: CodeSource
          DataSources:
            Ref: DataSources
          DisplayName:
            Ref: DisplayName
          Envs:
            Ref: Envs
          JobMaxRunningTimeMinutes:
            Ref: JobMaxRunningTimeMinutes
          JobSpecs:
            Ref: JobSpecs
          JobType:
            Ref: JobType
          Options:
            Ref: Options
          Priority:
            Ref: Priority
          ResourceId:
            Ref: ResourceId
          Settings:
            Ref: Settings
          SuccessPolicy:
            Ref: SuccessPolicy
          ThirdpartyLibDir:
            Ref: ThirdpartyLibDir
          ThirdpartyLibs:
            Ref: ThirdpartyLibs
          UserCommand:
            Ref: UserCommand
          UserVpc:
            Ref: UserVpc
          WorkspaceId:
            Ref: WorkspaceId
        Type: ALIYUN::PAIDLC::Job
    Outputs:
      JobId:
        Description: The task ID created this time.
        Value:
          Fn::GetAtt:
          - Job
          - JobId
  • JSON format

    {
      "ROSTemplateFormatVersion": "2015-09-01",
      "Parameters": {
        "ThirdpartyLibs": {
          "Type": "Json",
          "Description": "Python third-party library list to be installed."
        },
        "Options": {
          "Type": "String",
          "Description": "The additional configuration of this task can adjust some of the behavior of the mounting data source through this parameter. If the task has a data source that mounted the OSS type, you can cover the default parameters of the jinofs by configure the configuration of this parameter to fs.OSS.DOWNLOAD.CONCURRENCY = 4, fs.oss.download.queue.size = 16."
        },
        "Priority": {
          "Type": "Number",
          "Description": "The priority of the task, optional parameter, default value 1, the range of parameter values is 1 ~ 9.in:\n1 is the minimum priority.\n9 is the highest priority."
        },
        "Envs": {
          "Type": "String",
          "Description": "Environment variable configuration."
        },
        "JobMaxRunningTimeMinutes": {
          "Type": "Number",
          "Description": "The longest running time is running, and the unit is minutes."
        },
        "WorkspaceId": {
          "Type": "String",
          "Description": "Work space ID, how to get working space ID, see listworkSpaces."
        },
        "CodeSource": {
          "Type": "Json",
          "Description": "The code source used in this task. Before the mission node starts, the DLC will automatically download the code configured in the code source, and mount to the local directory of the container."
        },
        "UserVpc": {
          "Type": "Json",
          "Description": "User VPC configuration."
        },
        "JobSpecs": {
          "Type": "Json",
          "Description": "Jobspecs describes various configurations of tasks during the mission, such as mirror address, start command, node resource statement, number of copies, etc.\nThe DLC task consists of different types of nodes. The same type of nodes have exactly the same configuration. This configuration is called a Jobspec. Jobspecs describes the configuration of all types of nodes and is the array of Jobspec."
        },
        "UserCommand": {
          "Type": "String",
          "Description": "Start commands of all nodes in the task."
        },
        "DataSources": {
          "Type": "Json",
          "Description": "List of data source used for task operation."
        },
        "JobType": {
          "Type": "String",
          "Description": "The type of job. Values: TFJob, PyTorchJob, XGBoostJob, OneFlowJob, ElasticBatch",
          "AllowedValues": [
            "TFJob",
            "PyTorchJob",
            "XGBoostJob",
            "OneFlowJob",
            "ElasticBatch"
          ]
        },
        "ResourceId": {
          "Type": "String",
          "Description": "Resource group ID, optional parameter.\nThe parameter value is empty indicating that submitted to the public resource group.\nIf the current working space has been bound to a proprietary resource group, you can specify the corresponding resource group ID here; how to create a proprietary resource group and inquire about the proprietary resource group ID, please refer to the preparation and management of the DLC resource group cluster."
        },
        "ThirdpartyLibDir": {
          "Type": "String",
          "Description": "The name folder of the Requirements.txt file is located; before each node runs the specified usercommand, PAI -DLC will take the requirements.txt file from the specified folder and call the PIP Install -R installation."
        },
        "DisplayName": {
          "Type": "String",
          "Description": "The name of the task is as follows:\nThe name length does not exceed 256 characters.\nAllow numbers, letters, lower strokes (_), English period (.) And short horizontal lines (-)."
        },
        "SuccessPolicy": {
          "Type": "String",
          "Description": "The successful strategy of distributed multi -machine tasks is currently only supported by TensorFlow's multi -machine task.\nChiefWorker: When it is specified as this value, as long as the Chief's POD is successful, it is considered that the entire task is successful.\nAll workers: All workers must be successful to think that the entire task is successful."
        },
        "Settings": {
          "Type": "Json",
          "Description": "Job settings."
        }
      },
      "Resources": {
        "Job": {
          "Type": "ALIYUN::PAIDLC::Job",
          "Properties": {
            "ThirdpartyLibs": {
              "Ref": "ThirdpartyLibs"
            },
            "Options": {
              "Ref": "Options"
            },
            "Priority": {
              "Ref": "Priority"
            },
            "Envs": {
              "Ref": "Envs"
            },
            "JobMaxRunningTimeMinutes": {
              "Ref": "JobMaxRunningTimeMinutes"
            },
            "WorkspaceId": {
              "Ref": "WorkspaceId"
            },
            "CodeSource": {
              "Ref": "CodeSource"
            },
            "UserVpc": {
              "Ref": "UserVpc"
            },
            "JobSpecs": {
              "Ref": "JobSpecs"
            },
            "UserCommand": {
              "Ref": "UserCommand"
            },
            "DataSources": {
              "Ref": "DataSources"
            },
            "JobType": {
              "Ref": "JobType"
            },
            "ResourceId": {
              "Ref": "ResourceId"
            },
            "ThirdpartyLibDir": {
              "Ref": "ThirdpartyLibDir"
            },
            "DisplayName": {
              "Ref": "DisplayName"
            },
            "SuccessPolicy": {
              "Ref": "SuccessPolicy"
            },
            "Settings": {
              "Ref": "Settings"
            }
          }
        }
      },
      "Outputs": {
        "JobId": {
          "Description": "The task ID created this time.",
          "Value": {
            "Fn::GetAtt": [
              "Job",
              "JobId"
            ]
          }
        }
      }
    }