This topic describes how to use the PyTorch component.

Procedure

Log on to the PAI console and navigate to the Algorithm Platform of Machine Learning Platform for AI (PAI). In the left-side navigation pane of the Algorithm Platform, click Components in the left-side navigation pane. Find the PyTorch component in the Deep Learning folder and the Read File Data component in the Data Source/Target folder. PyTorch can read only OSS data.1

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Parameters Setting Python Version Valid values: 2.7 and 3.6.
    Python Code Files The path of the code file. If you upload a tar.gz file, you must specify Primary Python File. If you upload a single PY file, Primary Python File is not required.
    Primary Python File Optional. The name of the primary code file. If you upload a single file, this parameter is not required. If you upload a package, you must set this parameter to the path in which the file is stored. Example: train/train.py.
    Data Source Directory The OSS path in which data sources are stored.
    Configuration File Hyperparameters and Custom Parameters The hyperparameter file, in the key-value pair format.
    Checkpoint Output Directory/Model Input Directory The output directory of the model.
    Limit Job Runtime Specifies whether to limit the time for job execution.
    Maximum Scheduled Job Runtime Unit: hours. The default value is 24 hours.
    Tuning GPUs per Worker The number of GPUs for each worker.
    Workers The number of distributed machines.
  • PAI command
    PAI -name pytorch_ext  -DossHost="oss-cn-beijing-internal.aliyuncs.com"
          -Dcluster="{\"worker\":{\"gpu\":100}}" -DworkerCount="2"
          -Dpython="3.6"
          -Dinputs="oss://${The name of the OSS bucket}.oss-cn-beijing-internal.aliyuncs.com/mnist/"
          -Darn="acs:ram::168069136******:role/aliyunodpspaidefaultrole"
          -Dscript="oss://${The name of the OSS bucket}.oss-cn-beijing-internal.aliyuncs.com/pytorch/pytorch_dist_mnist.py"
          -DcheckpointDir="oss://${The name of the OSS bucket}.oss-cn-beijing-internal.aliyuncs.com/pytorch/";
    Parameter Description
    DossHost The endpoint of the OSS bucket.
    Dcluster The number of GPUs for each worker. The value 100 indicates one GPU, and the value 200 indicates two GPUs.
    DworkerCount The number of workers.
    Dpython The Python version. Valid values:2.7 and 3.6.
    Dinputs The input path of data sources.
    Darn The ARN of the OSS role.
    Dscript The name of the code file.
    DcheckpointDir The path in which the model is stored.
    DhyperParameters The hyperparameter file.
    1To obtain parameters configured on the web side, you must specify parser in the code. For example, if the path of the OSS data source is specified in the component parameters, you can specify inputs in the code to obtain the path. For more information, see Example.

Example

  1. Download the Distributed MNIST file code processed by PyTorch and upload the file to OSS. You must enter your AccessKey pair.4
  2. Download the MNIST training file and MNIST test file, and upload them to the OSS folder.
  3. Configure the parameters as needed.
    The following figure shows an example of parameter settings.5
  4. Configure resources.
    The sample code is distributed code, and the number of workers must be greater than 1. The following figure shows an example of parameter settings.6