This topic describes how to use DLC, DSW, and NAS to perform offline transfer learning based on PyTorch.
Prerequisites
Create a General-purpose NAS file system in your desired region. For more information, see Create a General-purpose NAS file system.
Limitations
The steps in this topic apply only to jobs that run on compute clusters in a public resource group.
Step 1: Create a dataset
-
Go to the Datasets page.
-
Log on to the PAI console.
-
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
-
In the left-side navigation pane, choose .
-
-
Create a basic dataset and set Storage Type to NAS.
Step 2: Create a DSW instance
Create a DSW instance and configure the following key parameters. For details about the other parameters, see Create a DSW instance.
|
Parameter |
Description |
|
|
Environment Information |
Dataset Mounting |
Click Custom Dataset, select the NAS-type dataset that you created in Step 1, and specify the mount path |
|
Working Directory |
Select |
|
|
Network information |
VPC Settings |
No VPC configuration is required. |
Step 3: Prepare the data
The data used in this article is stored in a public location. You can download the data (Download data) and use it after you decompress the file.
-
Go to the development environment of a DSW instance.
-
Log on to the PAI console.
-
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
-
In the upper-left corner of the page, select the region where you want to use PAI.
-
In the left-side navigation pane, choose .
-
Optional: On the Data Science Workshop (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
-
Click Open in the Actions column of the instance.
-
-
In the top menu bar of the DSW environment, click the Notebook tab.
-
Download the data.
-
In the upper-left toolbar, click the
icon to create a folder. For this example, name the folder pytorch_transfer_learning. -
In the top menu bar of the DSW environment, click the Terminal tab to open a terminal.
-
In the terminal, use the
cdcommand to change to the pytorch_transfer_learning folder, then use thewgetcommand to download the dataset:cd /mnt/workspace/pytorch_transfer_learning/ wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gzThe URL
https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gzspecifies the download location for the dataset.~/workspace> cd pytorch_transfer_learning/ ~/workspace/pytorch_transfer_learning> wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz --2021-01-28 10:55:55-- https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz Resolving pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)... Connecting to pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)|xxx|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 47237380 (45M) [application/x-gzip] Saving to: 'data.tar.gz' data.tar.gz 100%[=======================================================================>] 45.05M 16.0MB/s in 2.8s 2021-01-28 10:55:58 (16.0 MB/s) - 'data.tar.gz' saved [47237380/47237380] ~/workspace/pytorch_transfer_learning> ls data.tar.gz hol-transfer_learning_tutorial.py input LICENSE main.py output README.md ~/workspace/pytorch_transfer_learning> -
Use the
tar -xf ./data.tar.gzcommand to decompress the dataset. -
Switch to the Notebook tab. In the directory tree on the left, navigate to the
pytorch_transfer_learningdirectory. Right-click the unzipped data folder (hymenoptera_data), select Rename from the context menu, and rename the folder to input.
-
Step 4: Prepare the code and output folder
-
In the terminal of your DSW instance, use the
wgetcommand to download the training code into thepytorch_transfer_learningfolder.cd /mnt/workspace/pytorch_transfer_learning/ wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.pywhere
https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.pyis the storage location of the training code. -
In the pytorch_transfer_learning folder, create a new folder named output to store the trained model.
mkdir output -
Ensure the pytorch_transfer_learning folder contains the following items:
The folder should contain the following files and directories:
-
input: the folder for the training data.
-
main.py: the training script.
-
output: the folder for storing the output model.
-
Step 5: Create a job
-
Log on to the PAI console. In the top navigation bar, select your target region and workspace, and then click Go to DLC.
-
On the DLC page, click Create Task.
-
On the Create job page, configure the following parameters.
Parameter
Description
Basic Information
Job Name
Enter a name for the deep learning training job.
Environment Information
Node Image
Select Alibaba Cloud Image and choose a PyTorch image. For example, you can select
pytorch-training:1.12-gpu-py39-cu113-ubuntu20.04.Dataset
Click Custom Dataset and select the NAS dataset that you created in Step 1.
Start Command
Set this parameter to
python /mnt/data/pytorch_transfer_learning/main.py -i /mnt/data/pytorch_transfer_learning/input -o /mnt/data/pytorch_transfer_learning/output.Third-party Libraries
Select Third-Party Libraries and enter the following content in the text box.
numpy==1.16.4 absl-py==0.11.0Code Build
No configuration is required.
Resource Information
Source
Select Public Resources.
Framework
Select PyTorch.
Job Resource
For Job resources, select a server. For example, set Resource Type to ecs.g6.xlarge under CPU and set Nodes to 1.
-
Click Confirm.
Step 6: View job details and logs
-
On the Deep Learning Containers (DLC) page, click the name of your job.
-
On the job details page, you can view the Basic Information and Resource Information of the job.
-
At the bottom of the job details page, find the target instance in the Instance section and click Log in the Actions column.
The following is a sample of the log output.
Epoch 5/9 ---------- train Loss: 0.4959 Acc: 0.7951 val Loss: 0.2213 Acc: 0.9150 Epoch 6/9 ---------- train Loss: 0.6845 Acc: 0.7664 val Loss: 0.5303 Acc: 0.8301 Epoch 7/9 ---------- train Loss: 0.4233 Acc: 0.8156 val Loss: 0.2569 Acc: 0.9150 Epoch 8/9 ---------- train Loss: 0.4147 Acc: 0.8443 val Loss: 0.2397 Acc: 0.9346 Epoch 9/9 ---------- train Loss: 0.3133 Acc: 0.8770 val Loss: 0.2333 Acc: 0.9346 Training complete in 3m 50s Best val Acc: 0.934641