submit PyTorch training jobs that run on DSW and NAS in DLC - Platform For AI

Prerequisites

A general-purpose NAS file system is created in a region. For more information, see Create a General-purpose NAS file system in the NAS console.

Limits

The operations described in this topic are applicable only for clusters that use general computing resources and are deployed in the public resource group.

Step 1: Create datasets

Go to the Dataset Management page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane, choose AI Computing Asset Management > Datasets.
On the Dataset management page, click Create dataset.

On the Create dataset panel, configure the parameters. The following table describes the parameters.

Parameter	Description
Name	The name of the dataset.
Description	The description of the dataset. The description helps distinguish the dataset from other datasets.
Select data store	Select General-purpose NAS.
Select File System	The ID of the existing NAS file system. You can log on to the NAS console to view the ID of the NAS file system in the region. You can also view the ID of the NAS file system from the drop-down list.
File System Path	The mount path of the NAS file system. In this example, set the parameter to `/`.
Default Mount path	The path of the NAS data in the job. In this example, set the parameter to `/mnt/data`.

Important

The region of the DSW instance must be the same as the region of the NAS file system in which training data and code are stored.

Click Submit.

Step 2: Create a DSW instance

When you create a DSW instance, click Shared Datasets in the Storage section, select the NAS dataset that you created in Step 1, and set the Mount Path parameter to /mnt/workspace/. For information about other parameters, see Create and manage DSW instances.

Step 3: Prepare data

The data used in this topic is available for public access. You can click here to download the data and then decompress and use the data.

Go to the development environment of Data Science Workshop (DSW).
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.
3. In the upper-left corner of the page, select the region where you want to use the service.
4. In the left-side navigation pane, choose Model Training > Notebook Service (DSW).
5. Optional: On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
6. Find the DSW instance and click Launch in the Actions column.
In the DSW development environment, click the Notebook tab in the top navigation bar.
Download data.
1. Click the icon in the upper-left toolbar to create a folder. In this example, pytorch_transfer_learning is used as the folder name.
2. In the DSW development environment, click the Terminal tab in the top navigation bar.
3. On the Terminal tab, run the commands as shown in the following figure. You can use the cd command to go to the folder that you create, and the wget command to download the dataset.
```
cd /mnt/workspace/pytorch_transfer_learning/
wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz
```
  https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz is the URL for downloading the dataset file.
4. Run the tar -xf ./data.tar.gz command to decompress the dataset.
5. Click the Notebook tab. Go to the pytorch_transfer_learning directory, right-click the extracted hymenoptera_data folder, and then click Rename to rename the file as input.

Step 4: Prepare the training code and the model storage folder

On the Terminal tab of the DSW instance, run the wget command to download the training code to the pytorch_transfer_learning folder.
```
cd /mnt/workspace/pytorch_transfer_learning/
wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py
```
https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py is the URL for downloading the training code.
In the pytorch_transfer_learning folder, create a folder named output to store the trained model.
```
mkdir output
```
View the content contained in the pytorch_transfer_learning folder.
The folder contains the following content:
- input: the folder that stores the training data.
- main.py: the training code file.
- output: the folder that stores the trained model.

Step 5: Create a scheduled job

Go to the Create Job page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). Click Create Job on the Distributed Training Jobs page. The Create Job page appears.

On the Create Job page, configure the required parameters. The following table describes key parameters.

Section	Parameter	Description
Basic Information	Job Name	Specify the name of the job.
	Node Image	Select Alibaba Cloud Image and then select a PyTorch image from the drop-down list. In this example, the `pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04` image is used.
	Datasets	Select the NAS dataset that you created in Step 1.
	Code Builds	You do not need to configure this parameter.
	Job Command	Set this parameter to `python /mnt/data/pytorch_transfer_learning/main.py -i /mnt/data/pytorch_transfer_learning/input -o /mnt/data/pytorch_transfer_learning/output`.
	Third-party Libraries	Click Select from List and then enter the following content in the text box: `numpy==1.16.4 absl-py==0.11.0`
Resource Configuration	Resource Quota	Select Public Resource Group.
	Framework	Select PyTorch.
	Job Resource	Select an instance type and specify the number of instances. Example: select ecs.g6.xlarge on the CPU tab of the Resource Type page. Set the Nodes parameter to 1.

Click OK.

Step 6: View the job details and logs

On the Distributed Training Jobs page, click the name of the job that you want to view.
On the Details page, view the Basic Information and Resources of the job.
On the lower part of the Details page of the job, click the Instance tab, find the instance that you want to manage and click Log in the Actions column to view the logs.
The following figure shows an example of the logs.