All Products
Search
Document Center

Platform For AI:Submit a standalone training job that uses PyTorch

Last Updated:Sep 05, 2023

This topic describes how to use the Deep Learning Containers (DLC) of Machine Learning Platform for AI (PAI) to train transfer learning models based on the PyTorch framework.

Step 1: Prepare data

In this topic, the data used for training is pre-stored in a public storage medium. You can download the data directly and do not need to prepare additional data.

Step 2: Prepare the training code and model storage file

In this topic, the training code package is pre-stored in a public storage medium. You can download the code package directly and do not need to develop additional code.

Step 3: Create a job

  1. Go to the Create Job page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). Click Create Job on the Distributed Training Jobs page. The Create Job page appears.

  2. On the Create Job page, set the parameters in the following table, and use the default values for the remaining parameters.

    a37e96bc71a6b8b03b82fa5a98d97d07.png

    Parameter

    Description

    Resource Group

    Select Public Resource Group.

    Job Name

    Enter a name for the job. Example: torch-sample.

    Job Type

    Select PyTorch.

    Job Command

    Enter the following command to perform the following operations: download data, download code package, run training jobs, and check models.

    wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz && tar -xf ./data.tar.gz && mv ./hymenoptera_data/ ./input && mkdir output && wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py && python main.py -i ./input -o ./output && ls ./output

    Node Image

    Select Alibaba Cloud Image and select a PyTorch image from the drop-down list.

    Number of Nodes

    Set the value to 1.

    Node Configuration

    Click GPUInstance and then select ecs.gn6e-c12g1.3xlarge.

  3. Click Submit.

    The Distributed Training Jobs page appears.

Step 4: View the details and logs of the training job

  1. On the Distributed Training Jobs page, click the name of the job that you want to view.

  2. On the Details page, view the Basic Information and Resources of the job.

  3. In the Instances section of the Job Details page, find the instance whose logs you want to view and click Log in the Actions column.