After you complete the preparations, you can submit Deep Learning Containers (DLC) jobs in the Platform for AI (PAI) console, or by using SDK for Python or command lines. This topic describes how to submit a DLC job.
Prerequisites
The required resources, images, datasets, and code sets are prepared. For more information, see Before you begin.
Environment variables are configured if you use the SDK for Python to submit a training job. For more information, see the "Install the Credentials tool" section in the Manage access credentials topic and the "Step 2: Configure environment variables" section in the Get started with Alibaba Cloud Darabonba SDK for Python topic.
Submit a job in the console
Step 1: Go to the Create Job page
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane of the Workspace page, choose . On the Distributed Training Jobs page, click Create Job. The Create Job page appears.
Step 2: Configure the parameters for the training job
Basic Information
In the Basic Information section, configure the following parameters.
Parameter | Description |
Node Image | The image used by the nodes. Valid values:
|
Datasets | The location where job data is stored when the job is running. The dataset is used as a larger storage space for the training job. Select the dataset that you prepared. For information about how to create a dataset, see (Optional) Prepare a dataset. Important
|
Code Builds | Valid values:
|
Third-party Libraries | Valid values:
|
Environment Variable | Additional configuration information or parameters in the |
Job Command | The command that the job runs. Shell commands are supported. For example, you can use the When you submit a training job, PAI automatically injects multiple general environment variables. To obtain the values of specific environment variables, configure the Note
|
Resource Configuration
In the Resource Configuration section, configure the parameters. The following table describes the parameters.
Parameter | Description |
Resource Quota | You can select the public resource group, general computing resources, or Lingjun resources that you prepared. For information about how to reserve resource quotas, see Overview. Note The public resource group can provide up to 2 GPUs and 8 vCPUs. To increase the resource quota, contact your account manager. |
Priority | This parameter is available when you set the Resource Quota parameter to general computing resources or Lingjun resources. Specify the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority. |
Framework | Specify the deep learning training framework and tool. The frameworks provide rich features and operations for you to build, train, and optimize deep learning models. Valid values:
Note If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, and MPIJob. |
Job Resource | Configure worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes based on the framework that you select.
|
Automatic Fault Tolerance | After you turn on Automatic Fault Tolerance in the Resource Configuration section, the system detects the jobs to identify errors in the algorithms of the training jobs and improves the GPU utilization. For more information, see AIMaster: elastic automatic fault tolerance engine. |
Sanity Check | After you turn on Sanity Check in the Resource Configuration section, the system detects the resources that are used to run the training jobs, isolates faulty nodes, and triggers the automated O&M processes in the background. This helps prevent job failure in the early stage of the training and improves the training success rate. For more information, see Sanity Check. Note You can enable the sanity check feature only for training jobs that run on Lingjun resources. |
Maximum Duration | Specify the maximum duration for which the job runs. The job is automatically stopped if the uptime of the job exceeds the maximum duration. Default value: 30. Unit: days. |
Instance Retention Period | The retention period of the instance after the job is completed. After the duration exceeds, the jobs are deleted. Important DLC jobs that are deleted cannot be restored. Perform the delete operation with caution. |
VPC
This parameter is available if you set the Resource Group parameter to Public Resource Group.
If you do not configure VPC, the Internet connection is used. Due to the limited bandwidth of the Internet, the job may be stuck or may not run as expected.
We recommend that you configure VPC to ensure sufficient network bandwidth and stable performance.
Select a VPC, a vSwitch, and a security group in the current region. When the configuration takes effect, the cluster on which the job runs directly accesses the services in this VPC and performs access control based on the security group.
ImportantBefore you run a DLC job, make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.
If you select a CPFS dataset, you also need to configure a VPC. The VPC must be the same as the one configured for the CPFS dataset. Otherwise, exceptions may occur and the DLC training jobs are dequeued after you submit the jobs.
Step 3: Submit the training job
Click Submit to submit the training job. You can go to the jobs list to view the status of the job. For more information about the status of the DLC job, see Appendix: DLC job status.
Submit a job by using SDK for Python or command lines
Use SDK for Python
Step 1: Install SDK for Python
Install the workspace SDK.
pip install alibabacloud_aiworkspace20210204==3.0.1
Install the DLC SDK.
pip install alibabacloud_pai_dlc20201203==1.4.0
Step 2: Submit a job
Use a public resource group to submit the job
You can use the following sample code to create and submit a DLC job:
#!/usr/bin/env python3
from __future__ import print_function
import json
import time
from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
ListJobsRequest,
ListEcsSpecsRequest,
CreateJobRequest,
GetJobRequest,
)
from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
ListWorkspacesRequest,
CreateDatasetRequest,
ListDatasetsRequest,
ListImagesRequest,
ListCodeSourcesRequest
)
def create_nas_dataset(client, region, workspace_id, name,
nas_id, nas_path, mount_path):
'''Create a NAS dataset.
'''
response = client.create_dataset(CreateDatasetRequest(
workspace_id=workspace_id,
name=name,
data_type='COMMON',
data_source_type='NAS',
property='DIRECTORY',
uri=f'nas://{nas_id}.{region}{nas_path}',
accessibility='PRIVATE',
source_type='USER',
options=json.dumps({
'mountPath': mount_path
})
))
return response.body.dataset_id
def create_oss_dataset(client, region, workspace_id, name,
oss_bucket, oss_endpoint, oss_path, mount_path):
'''Create an OSS dataset.
'''
response = client.create_dataset(CreateDatasetRequest(
workspace_id=workspace_id,
name=name,
data_type='COMMON',
data_source_type='OSS',
property='DIRECTORY',
uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
accessibility='PRIVATE',
source_type='USER',
options=json.dumps({
'mountPath': mount_path
})
))
return response.body.dataset_id
def wait_for_job_to_terminate(client, job_id):
while True:
job = client.get_job(job_id, GetJobRequest()).body
print('job({}) is {}'.format(job_id, job.status))
if job.status in ('Succeeded', 'Failed', 'Stopped'):
return job.status
time.sleep(5)
return None
def main():
# Make sure that your Alibaba Cloud account is granted the required permissions on DLC.
region_id = 'cn-hangzhou'
# The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. To prevent security risks, we recommend that you call API operations or perform routine O&M as a RAM user.
# We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and the security of resources within your account may be compromised.
# In this example, the Credentials SDK reads the AccessKey pair from the environment variable to implement identity verification.
cred = CredClient()
# 1. create client;
workspace_client = AIWorkspaceClient(
config=Config(
credential=cred,
region_id=region_id,
endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
)
)
dlc_client = DLCClient(
config=Config(
credential=cred,
region_id=region_id,
endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
)
)
print('------- Workspaces -----------')
# Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter.
workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
page_number=1, page_size=1, workspace_name='',
module_list='PAI'
))
for workspace in workspaces.body.workspaces:
print(workspace.workspace_name, workspace.workspace_id,
workspace.status, workspace.creator)
if len(workspaces.body.workspaces) == 0:
raise RuntimeError('found no workspaces')
workspace_id = workspaces.body.workspaces[0].workspace_id
print('------- Images ------------')
# Obtain the image list.
images = workspace_client.list_images(ListImagesRequest(
labels=','.join(['system.supported.dlc=true',
'system.framework=Tensorflow 1.15',
'system.pythonVersion=3.6',
'system.chipType=CPU'])))
for image in images.body.images:
print(json.dumps(image.to_map(), indent=2))
image_uri = images.body.images[0].image_uri
print('------- Datasets ----------')
# Obtain the dataset.
datasets = workspace_client.list_datasets(ListDatasetsRequest(
workspace_id=workspace_id,
name='example-nas-data', properties='DIRECTORY'))
for dataset in datasets.body.datasets:
print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)
if len(datasets.body.datasets) == 0:
# Create a dataset when the specified dataset does not exist.
dataset_id = create_nas_dataset(
client=workspace_client,
region=region_id,
workspace_id=workspace_id,
name='example-nas-data',
# The ID of the NAS file system.
# General-purpose NAS: 31a8e4****.
# Extreme NAS: The ID must start with extreme-. Example: extreme-0015****.
# CPFS: The ID must start with cpfs-. Example: cpfs-125487****.
nas_id='***',
nas_path='/',
mount_path='/mnt/data/nas')
print('create dataset with id: {}'.format(dataset_id))
else:
dataset_id = datasets.body.datasets[0].dataset_id
print('------- Code Sources ----------')
# Obtain the source code file list.
code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
workspace_id=workspace_id))
for code_source in code_sources.body.code_sources:
print(code_source.display_name, code_source.code_source_id, code_source.code_repo)
print('-------- ECS SPECS ----------')
# Obtain the DLC node specification list.
ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
for spec in ecs_specs.body.ecs_specs:
print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)
print('-------- Create Job ----------')
# Create a DLC job.
create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
'WorkspaceId': workspace_id,
'DisplayName': 'sample-dlc-job',
'JobType': 'TFJob',
'JobSpecs': [
{
"Type": "Worker",
"Image": image_uri,
"PodCount": 1,
"EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
"UseSpotInstance": False,
},
],
"UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
'DataSources': [
{
"DataSourceId": dataset_id,
},
],
}))
job_id = create_job_resp.body.job_id
wait_for_job_to_terminate(dlc_client, job_id)
print('-------- List Jobs ----------')
# Obtain the DLC job list.
jobs = dlc_client.list_jobs(ListJobsRequest(
workspace_id=workspace_id,
page_number=1,
page_size=10,
))
for job in jobs.body.jobs:
print(job.display_name, job.job_id, job.workspace_name,
job.status, job.job_type)
pass
if __name__ == '__main__':
main()
Use general computing resources to submit the job
Log on to the PAI console.
Follow the instructions shown in the following figure to obtain your workspace ID on the Workspaces page.
Follow the instructions shown in the following figure to obtain the resource quota ID of your dedicated resource group on the General Computing Resources page.
Use the following code to create and submit a job. For information about the available public images, see Step 2: Prepare an image.
from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_credentials.client import Client as CredClient from alibabacloud_tea_openapi.models import Config from alibabacloud_pai_dlc20201203.models import ( CreateJobRequest, JobSpec, ResourceConfig, GetJobRequest ) # Initialize a client to access the DLC API operations. region = 'cn-hangzhou' # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. To prevent security risks, we recommend that you call API operations or perform routine O&M as a RAM user. # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and the security of resources within your account may be compromised. # In this example, the Credentials SDK reads the AccessKey pair from the environment variable to implement identity verification. cred = CredClient() client = Client( config=Config( credential=cred, region_id=region, endpoint=f'pai-dlc.{region}.aliyuncs.com', ) ) # Specify the resource configurations of the job. You can select a public image or specify an image address. For more information about available public images, see the reference documentation. spec = JobSpec( type='Worker', image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04', pod_count=1, resource_config=ResourceConfig(cpu='1', memory='2Gi') ) # Specify the execution information for the job. req = CreateJobRequest( resource_id='<Replace with your resource quota ID>', workspace_id='<Replace with your Workspace ID>', display_name='sample-dlc-job', job_type='TFJob', job_specs=[spec], user_command='echo "Hello World"', ) # Submit the job. response = client.create_job(req) # Obtain the job ID. job_id = response.body.job_id Query the status of the job. job = client.get_job(job_id, GetJobRequest()).body print('job status:', job.status) # View the commands that the job runs. job.user_command
Use command lines
Step 1: Download the client and perform user authentication
Download the DLC client for your operating system and authenticate your credentials. For more information, see Before you begin.
Step 2: Submit the job
Log on to the PAI console.
Follow the instructions as shown in the following figure to obtain your workspace ID on the Workspaces page.
Follow the instructions shown in the following figure to obtain the resource quota ID on the General Computing Resources page.
Create a parameter file named
./tfjob.params
and copy the following content into the file. Replace the parameters as required. For more information about how to use command lines in the DLC client, see Supported commands.name=test_cli_tfjob_001 workers=1 worker_cpu=4 worker_gpu=0 worker_memory=4Gi worker_shared_memory=4Gi worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=echo good && sleep 120 resource_id=<Your resource quota ID> # This parameter can be left empty if you use a public resource group. workspace_id=<Your workspace ID>
The following sample code provides an example on how to specify the params_file parameter to submit a DLC job to the specified workspace and resource group.
dlc submit tfjob --job_file ./tfjob.params
Use the following code to query the DLC jobs that you created.
dlc get job <jobID>
References
After you submit a job, you can perform the following operations:
Monitor the status of the job. You can also view the billing details when the job is completed. For more information, see Billing details.
View the basic information, resource view, and operation logs of the job. For more information, see View training details.
Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.
View the training result analysis on TensorBoard. For more information, see Use TensorBoard to view training results in DLC.