All Products
Search
Document Center

Platform For AI:Use NAS to submit standalone PyTorch migration learning jobs

Last Updated:Sep 05, 2024

This topic describes how to use Deep Learning Containers (DLC) of Platform for AI (PAI), Data Science Workshop (DSW), and Apsara File Storage NAS (NAS) to perform PyTorch-based offline migration training.

Prerequisites

A general-purpose NAS file system is created in a region. For more information, see Create a General-purpose NAS file system in the NAS console.

Limits

The operations described in this topic are applicable only for clusters that use general computing resources and are deployed in the public resource group.

Step 1: Create datasets

  1. Go to the Datasets page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Asset Management > Datasets.

  2. On the Dataset management page, click Create dataset.

  3. On the Create dataset panel, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Name

    The name of the dataset.

    Description

    The description of the dataset. The description helps distinguish the dataset from other datasets.

    Select data store

    Select General-purpose NAS.

    Select File System

    The ID of the existing NAS file system. You can log on to the NAS console to view the ID of the NAS file system in the region. You can also view the ID of the NAS file system from the drop-down list.

    File System Path

    The mount path of the NAS file system. In this example, set the parameter to /.

    Default Mount path

    The path of the NAS data in the job. In this example, set the parameter to /mnt/data.

    Important

    The region of the DSW instance must be the same as the region of the NAS file system in which training data and code are stored.

  4. Click Submit.

Step 2: Create a DSW instance

When you create a DSW instance, click Add to add Dataset in the Environment information section, select the NAS dataset that you created in Step 1, and set Mount Path to /mnt/data/. Set Working Directory to dataset-/mnt/data/.

For information about other parameters, see Create a DSW instance.

image

Step 3: Prepare data

The data used in this topic is available for public access. You can click here to download the data and then decompress and use the data.

  1. Go to the development environment of a DSW instance.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the upper-left corner of the page, select the region where you want to use PAI.

    4. In the left-side navigation pane, choose Model Training > Data Science Workshop (DSW).

    5. Optional: On the Data Science Workshop (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.

    6. Click Open in the Actions column of the instance.

  2. In the DSW development environment, click the Notebook tab in the top navigation bar.

  3. Download data.

    1. Click the 创建文件夹 icon in the upper-left toolbar to create a folder. In this example, pytorch_transfer_learning is used as the folder name.

    2. In the DSW development environment, click the Terminal tab in the top navigation bar.

    3. On the Terminal tab, run the commands as shown in the following figure. You can use the cd command to go to the folder that you create, and the wget command to download the dataset.

      cd /mnt/workspace/pytorch_transfer_learning/
      wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz

      https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz is the URL for downloading the dataset file.

      image

    4. Run the tar -xf ./data.tar.gz command to decompress the dataset.

    5. Click the Notebook tab. Go to the pytorch_transfer_learning directory, right-click the extracted hymenoptera_data folder, and then click Rename to rename the file as input.

Step 4: Prepare the training code and the model storage folder

  1. On the Terminal tab of the DSW instance, run the wget command to download the training code to the pytorch_transfer_learning folder.

    cd /mnt/workspace/pytorch_transfer_learning/
    wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py

    https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py is the URL for downloading the training code.

  2. In the pytorch_transfer_learning folder, create a folder named output to store the trained model.

    mkdir output
  3. View the content contained in the pytorch_transfer_learning folder.

    The folder contains the following content:

    • input: the folder that stores the training data.

    • main.py: the training code file.

    • output: the folder that stores the trained model.

    最终的文件夹内容

Step 5: Create a scheduled job

  1. Go to the Create Job page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. Find the workspace that you want to manage and click the workspace ID.

    3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). On the Distributed Training Jobs page, click Create Job. The Create Job page appears.

  2. On the Create Job page, configure the required parameters. The following table describes key parameters.

    Section

    Parameter

    Description

    Basic Information

    Job Name

    Specify the name of the job.

    Node Image

    Select Alibaba Cloud Image and then select a PyTorch image from the drop-down list. In this example, the pytorch-training:1.12pai-gpu-py38-cu113-ubuntu20.04 image is used.

    Datasets

    Select the NAS dataset that you created in Step 1.

    Code Builds

    You do not need to configure this parameter.

    Startup Command

    Set this parameter to python /mnt/data/pytorch_transfer_learning/main.py -i /mnt/data/pytorch_transfer_learning/input -o /mnt/data/pytorch_transfer_learning/output.

    Third-party Libraries

    Click Select from List and then enter the following content in the text box:

    numpy==1.16.4
    absl-py==0.11.0

    Resource Configuration

    Resource Quota

    Select Public Resources.

    Framework

    Select PyTorch.

    Job Resource

    Select an instance type and specify the number of instances. Example: select ecs.g6.xlarge on the CPU tab of the Resource Type page. Set the Nodes parameter to 1.

  3. Click OK.

Step 6: View the job details and logs

  1. On the Distributed Training Jobs page, click the name of the job that you want to view.

  2. On the Details page, view the Basic Information and Resources of the job.

  3. On the lower part of the Details page of the job, click the Instance tab, find the instance that you want to manage and click Log in the Actions column to view the logs.

    The following figure shows an example of the logs. image.png