This topic describes how to use Deep Learning Containers (DLC) of Platform for AI (PAI), Data Science Workshop (DSW), and Apsara File Storage NAS (NAS) to perform PyTorch-based offline migration training.
Prerequisites
A general-purpose NAS file system is created in a region. For more information, see Create a General-purpose NAS file system in the NAS console.
Limits
The operations described in this topic are applicable only for clusters that use general computing resources and are deployed in the public resource group.
Step 1: Create datasets
Go to the Datasets page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane, choose .
On the Dataset management page, click Create dataset.
On the Create dataset panel, configure the parameters. The following table describes the parameters.
Parameter
Description
Name
The name of the dataset.
Description
The description of the dataset. The description helps distinguish the dataset from other datasets.
Select data store
Select General-purpose NAS.
Select File System
The ID of the existing NAS file system. You can log on to the NAS console to view the ID of the NAS file system in the region. You can also view the ID of the NAS file system from the drop-down list.
File System Path
The mount path of the NAS file system. In this example, set the parameter to
/
.Default Mount path
The path of the NAS data in the job. In this example, set the parameter to
/mnt/data
.ImportantThe region of the DSW instance must be the same as the region of the NAS file system in which training data and code are stored.
Click Submit.
Step 2: Create a DSW instance
When you create a DSW instance, click Add to add Dataset in the Environment information section, select the NAS dataset that you created in Step 1, and set Mount Path to /mnt/data/
. Set Working Directory to dataset-/mnt/data/
.
For information about other parameters, see Create a DSW instance.
Step 3: Prepare data
The data used in this topic is available for public access. You can click here to download the data and then decompress and use the data.
Go to the development environment of a DSW instance.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the upper-left corner of the page, select the region where you want to use PAI.
In the left-side navigation pane, choose .
Optional: On the Data Science Workshop (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
Click Open in the Actions column of the instance.
In the DSW development environment, click the Notebook tab in the top navigation bar.
Download data.
Click the icon in the upper-left toolbar to create a folder. In this example, pytorch_transfer_learning is used as the folder name.
In the DSW development environment, click the Terminal tab in the top navigation bar.
On the Terminal tab, run the commands as shown in the following figure. You can use the
cd
command to go to the folder that you create, and thewget
command to download the dataset.cd /mnt/workspace/pytorch_transfer_learning/ wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz
https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz
is the URL for downloading the dataset file.Run the
tar -xf ./data.tar.gz
command to decompress the dataset.Click the Notebook tab. Go to the pytorch_transfer_learning directory, right-click the extracted hymenoptera_data folder, and then click Rename to rename the file as input.
Step 4: Prepare the training code and the model storage folder
On the Terminal tab of the DSW instance, run the
wget
command to download the training code to thepytorch_transfer_learning
folder.cd /mnt/workspace/pytorch_transfer_learning/ wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py
https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py
is the URL for downloading the training code.In the pytorch_transfer_learning folder, create a folder named output to store the trained model.
mkdir output
View the content contained in the pytorch_transfer_learning folder.
The folder contains the following content:
input: the folder that stores the training data.
main.py: the training code file.
output: the folder that stores the trained model.
Step 5: Create a scheduled job
Go to the Create Job page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. Find the workspace that you want to manage and click the workspace ID.
In the left-side navigation pane of the Workspace page, choose . On the Distributed Training Jobs page, click Create Job. The Create Job page appears.
On the Create Job page, configure the required parameters. The following table describes key parameters.
Section
Parameter
Description
Basic Information
Job Name
Specify the name of the job.
Node Image
Select Alibaba Cloud Image and then select a PyTorch image from the drop-down list. In this example, the
pytorch-training:1.12pai-gpu-py38-cu113-ubuntu20.04
image is used.Datasets
Select the NAS dataset that you created in Step 1.
Code Builds
You do not need to configure this parameter.
Startup Command
Set this parameter to
python /mnt/data/pytorch_transfer_learning/main.py -i /mnt/data/pytorch_transfer_learning/input -o /mnt/data/pytorch_transfer_learning/output
.Third-party Libraries
Click Select from List and then enter the following content in the text box:
numpy==1.16.4 absl-py==0.11.0
Resource Configuration
Resource Quota
Select Public Resources.
Framework
Select PyTorch.
Job Resource
Select an instance type and specify the number of instances. Example: select ecs.g6.xlarge on the CPU tab of the Resource Type page. Set the Nodes parameter to 1.
Click OK.
Step 6: View the job details and logs
On the Distributed Training Jobs page, click the name of the job that you want to view.
On the Details page, view the Basic Information and Resources of the job.
On the lower part of the Details page of the job, click the Instance tab, find the instance that you want to manage and click Log in the Actions column to view the logs.
The following figure shows an example of the logs.