This topic describes what you need to prepare before submitting training jobs, including compute resources, an image, a dataset, and a code build. Platform for AI (PAI) allows you to specify datasets stored in Apsara File Storage NAS (NAS) file systems, Cloud Parallel File Storage (CPFS) file systems, or Object Storage Service (OSS) buckets and code builds stored in Git repositories.
Prerequisites
If you use OSS to store data, make sure that the role that you use is granted the permissions to access OSS. Otherwise, I/O errors may occur when the system accesses the data stored in your OSS bucket. For more information about how to grant a service-linked role the permissions to access OSS, see Grant the permissions that are required to use DLC.
Limits
OSS is a distributed object storage service instead of a file system. When you use OSS to store data, some file system features are not supported. For example, you cannot append data to or overwrite existing files in OSS buckets.
Step 1: Prepare resources
Before you submit a training job, you need to prepare computing resources for the training. Select one of the following resources:
The public resource group
After you complete Deep Learning Containers (DLC) authorization, the system automatically prepares a public resource group for you. You do not need to manually create a resource group. For more information, see Grant the permissions that are required to use DLC. You can select the public resource group when you configure a job on the Create Job page in your workspace.
General computing resources
You can create a dedicated resource group, purchase the required general computing resources, and allocate computing resources in the dedicated resource group by creating resource quotas and associating them with workspaces. After you associate a resource quota with a workspace, you can use the resource quota to run training jobs in the workspace. For more information, see Resource quota for general computing resources.
Intelligent computing LINGJUN resources
If you want to leverage the high performance offered by LINGJUN resources, you need to prepare the intelligent computing LINGJUN resources for the training jobs and associate the resources with the workspace. For more information, see Resource quota for intelligent computing LINGJUN resources.
Step 2: Prepare an image
Before you submit a training job, you need to prepare the image for the training environment. Select one of the following image types:
Community image: If you use a general development environment, you can select a public standard image from open source communities without further configuration.
Alibaba Cloud image: PAI provides official images based on different frameworks that are optimized for Alibaba Cloud services. These images are suitable for trainings that use Alibaba Cloud services and help you achieve improved compatibility and performance.
Custom image: If you have specific requirements on training environments or dependencies, you can create a custom image to meet your business requirements.
The following table lists the available community images and Alibaba Cloud images when you submit a distributed training job.
Community image
Images
Standard images provided by the community. They support resources of various types. Click to view the details of the image files.
registry.${region}.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/pytorch-training:1.7.1-gpu-py37-cu110-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:2.3.0-cpu-py36-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:2.3.0-gpu-py36-cu101-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.4-cpu-py36-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.4-gpu-py36-cu100-ubuntu18.04
Replace ${region}
with a specific region. Example values:
cn-hangzhou
cn-shanghai
cn-qingdao
cn-beijing
cn-zhangjiakou
cn-huhehaote
cn-shenzhen
cn-chengdu
cn-hongkong
ap-southeast-1
The following table lists the URLs of the community images when ${region}
is set to cn-hangzhou
.
${region} | Framework | CPU/GPU | Python version | Image URL |
cn-hangzhou | Tensorflow 2.3 | CPU | 3.6 (py36) | |
Tensorflow 2.3 | GPU | 3.6 (py36) | |
Tensorflow 1.15 | CPU | 3.6 (py36) | |
Tensorflow 1.15 | GPU | 3.6 (py36) | |
PyTorch 1.6 | GPU | 3.7 (py37) | |
PyTorch 1.7 | GPU | 3.7 (py37) | |
Image versions
This section describes the operating systems, Python versions, and third-party libraries supported by each community image.
tensorflow-training:2.3-cpu-py36-ubuntu18.04
tensorflow-training:2.3-gpu-py36-cu101-ubuntu18.04
tensorflow-training:1.15-cpu-py36-ubuntu18.04
tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04
pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
pytorch-training:1.7.1-gpu-py37-cu110-ubuntu18.04
Alibaba Cloud image
Images
Official images provided by Alibaba Cloud. Click to view the details of the image files.
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-mkl-cpu-py27-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py27-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py36-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-mkl-cpu-py36-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py36-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI-gpu-py27-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI-gpu-py36-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/pytorch-training:1.3.1PAI-gpu-py37-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/pytorch-training:1.4.0PAI-gpu-py37-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/pytorch-training:1.5.1PAI-gpu-py37-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/pytorch-training:1.6.0PAI-gpu-py37-cu100-ubuntu16.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py27-cu101-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py36-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py36-cu101-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.4PAI-cpu-py36-ubuntu18.04
registry.${region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.4PAI-gpu-py36-cu101-ubuntu18.04
Replace ${region}
with a specific region. Example values:
cn-hangzhou
cn-shanghai
cn-qingdao
cn-beijing
cn-zhangjiakou
cn-huhehaote
cn-shenzhen
cn-chengdu
cn-hongkong
ap-southeast-1
The following table lists the URL of PAI images when ${region}
is set to cn-hangzhou
.
${region} | Framework | CPU/GPU | Python version | Image URL |
cn-hangzhou | TensorFlow 1.12 | CPU | 2.7 (py27) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-cpu-py27-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-cpu-py27-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu18.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu18.04
|
MKL-CPU | 2.7 (py27) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-mkl-cpu-py27-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-mkl-cpu-py27-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-mkl-cpu-py27-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-mkl-cpu-py27-ubuntu16.04
|
GPU | 2.7 (py27) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py27-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py27-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-gpu-py27-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-gpu-py27-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py27-cu101-ubuntu18.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py27-cu101-ubuntu18.04
|
CPU | 3.6 (py36) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py36-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py36-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-cpu-py36-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-cpu-py36-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py36-ubuntu18.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py36-ubuntu18.04
|
MKL-CPU | 3.6 (py36) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-mkl-cpu-py36-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-mkl-cpu-py36-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-mkl-cpu-py36-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-mkl-cpu-py36-ubuntu16.04
|
GPU | 3.6 (py36) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py36-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py36-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-gpu-py36-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI2011-gpu-py36-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py36-cu101-ubuntu18.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-gpu-py36-cu101-ubuntu18.04
|
TensorFlow 1.15 | GPU | 2.7 (py27) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI-gpu-py27-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI-gpu-py27-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI2011-gpu-py27-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI2011-gpu-py27-cu100-ubuntu16.04
|
CPU | 3.6 (py36) | |
GPU | 3.6 (py36) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI-gpu-py36-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI-gpu-py36-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI2011-gpu-py36-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0PAI2011-gpu-py36-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.4PAI-gpu-py36-cu101-ubuntu18.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15.4PAI-gpu-py36-cu101-ubuntu18.04
|
PyTorch 1.3 | GPU | 3.7 (py37) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.3.1PAI-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.3.1PAI-gpu-py37-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.3.1PAI2011-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.3.1PAI2011-gpu-py37-cu100-ubuntu16.04
|
PyTorch 1.4 | GPU | 3.7 (py37) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.4.0PAI-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.4.0PAI-gpu-py37-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.4.0PAI2011-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.4.0PAI2011-gpu-py37-cu100-ubuntu16.04
|
PyTorch 1.5 | GPU | 3.7 (py37) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.5.1PAI-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.5.1PAI-gpu-py37-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.5.1PAI2011-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.5.1PAI2011-gpu-py37-cu100-ubuntu16.04
|
PyTorch 1.6 | GPU | 3.7 (py37) | registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.6.0PAI-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.6.0PAI-gpu-py37-cu100-ubuntu16.04 registry.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.6.0PAI2011-gpu-py37-cu100-ubuntu16.04 registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.6.0PAI2011-gpu-py37-cu100-ubuntu16.04
|
Image versions
This section describes the operating systems, Python versions, and third-party libraries supported by each Alibaba Cloud image.
tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 2.7.18 Anaconda
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.15 | aliyun-python-sdk-kms 2.14.0 | astor 0.8.1 |
backports.weakref 1.0.post1 | certifi 2020.6.20 | crcmod 1.7 | Cython 0.29.14 |
enum34 1.1.6 | funcsigs 1.0.2 | futures 3.3.0 | gast 0.4.0 |
grpcio 1.27.2 | h5py 2.10.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 |
Keras-Preprocessing 1.1.2 | Markdown 3.1.1 | mkl-fft 1.0.15 | mkl-random 1.1.0 |
mkl-service 2.3.0 | mock 3.0.5 | numpy 1.16.4 | opencv-python 4.2.0.32 |
oss2 2.9.1 | paiio 0.1.0 | pip 9.0.1 | protobuf 3.14.0 |
pycryptodome 3.9.7 | pyodps 0.10.4 | pypai 1.1.0+tensorflow.1.12.2pai2011 | requests 2.13.0 |
setuptools 36.4.0 | six 1.15.0 | tensorboard 1.12.2 | tensorflow 1.12.2PAI2011 |
termcolor 1.1.0 | toposort 1.5 | Werkzeug 1.0.1 | wheel 0.35.1 |
tensorflow-training:1.12.2PAI-mkl-cpu-py27-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 2.7.18 Anaconda
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.15 | aliyun-python-sdk-kms 2.14.0 | astor 0.8.1 |
backports.weakref 1.0.post1 | certifi 2020.6.20 | crcmod 1.7 | Cython 0.29.14 |
enum34 1.1.6 | funcsigs 1.0.2 | futures 3.3.0 | gast 0.4.0 |
grpcio 1.27.2 | h5py 2.10.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 |
Keras-Preprocessing 1.1.2 | Markdown 3.1.1 | mkl-fft 1.0.15 | mkl-random 1.1.0 |
mkl-service 2.3.0 | mock 3.0.5 | numpy 1.16.4 | opencv-python 4.2.0.32 |
oss2 2.9.1 | paiio 0.1.0 | pip 9.0.1 | protobuf 3.14.0 |
pycryptodome 3.9.7 | pyodps 0.10.4 | pypai 1.1.0+tensorflow.1.12.2pai2011 | requests 2.13.0 |
setuptools 36.4.0 | six 1.15.0 | tensorboard 1.12.2 | tensorflow 1.12.2PAI2011 |
termcolor 1.1.0 | toposort 1.5 | Werkzeug 1.0.1 | wheel 0.35.1 |
tensorflow-training:1.12.2PAI-gpu-py27-cu100-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 2.7.18 Anaconda
CUDA version: 10.0
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.15 | aliyun-python-sdk-kms 2.14.0 | astor 0.8.1 |
backports.weakref 1.0.post1 | certifi 2020.6.20 | crcmod 1.7 | Cython 0.29.14 |
enum34 1.1.6 | funcsigs 1.0.2 | futures 3.3.0 | gast 0.4.0 |
grpcio 1.27.2 | h5py 2.10.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 |
Keras-Preprocessing 1.1.2 | Markdown 3.1.1 | mkl-fft 1.0.15 | mkl-random 1.1.0 |
mkl-service 2.3.0 | mock 3.0.5 | numpy 1.16.4 | opencv-python 4.2.0.32 |
oss2 2.9.1 | paiio 0.1.0 | pip 9.0.1 | protobuf 3.14.0 |
pycryptodome 3.9.7 | pyodps 0.10.4 | pypai 1.1.0+tensorflow.gpu.1.12.2pai2011 | requests 2.13.0 |
setuptools 36.4.0 | six 1.15.0 | tensorboard 1.12.2 | tensorflow-gpu 1.12.2PAI2011 |
termcolor 1.1.0 | toposort 1.5 | Werkzeug 1.0.1 | wheel 0.35.1 |
subprocess32 3.5.4 | tao-wrapper 0.1.1 | whale 0.0.2 |
tensorflow-training:1.12.2PAI-cpu-py36-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 3.6.12 Anaconda
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.29 | aliyun-python-sdk-core-v3 2.13.11 | aliyun-python-sdk-kms 2.14.0 |
astor 0.8.1 | cached-property 1.5.2 | certifi 2020.12.5 | crcmod 1.7 |
Cython 0.29.21 | gast 0.4.0 | grpcio 1.31.0 | h5py 3.1.0 |
importlib-metadata 3.4.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 | Keras-Preprocessing 1.1.2 |
Markdown 3.3.3 | mkl-fft 1.2.0 | mkl-random 1.1.1 | mkl-service 2.3.0 |
numpy 1.16.4 | opencv-python 4.2.0.32 | oss2 2.12.1 | paiio 0.1.0 |
pip 20.2.4 | protobuf 3.14.0 | pycryptodome 3.9.9 | pyodps 0.10.4 |
pypai 1.1.0+tensorflow.1.12.2pai2011 | requests 2.13.0 | setuptools 50.3.1.post20201107 | six 1.15.0 |
tensorboard 1.12.2 | tensorflow 1.12.2PAI2011 | termcolor 1.1.0 | toposort 1.5 |
typing-extensions 3.7.4.3 | Werkzeug 1.0.1 | wheel 0.35.1 | zipp 3.4.0 |
tensorflow-training:1.12.2PAI-mkl-cpu-py36-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 3.6.12 Anaconda
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.29 | aliyun-python-sdk-core-v3 2.13.11 | aliyun-python-sdk-kms 2.14.0 |
astor 0.8.1 | cached-property 1.5.2 | certifi 2020.12.5 | crcmod 1.7 |
Cython 0.29.21 | gast 0.4.0 | grpcio 1.31.0 | h5py 3.1.0 |
importlib-metadata 3.4.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 | Keras-Preprocessing 1.1.2 |
Markdown 3.3.3 | mkl-fft 1.2.0 | mkl-random 1.1.1 | mkl-service 2.3.0 |
numpy 1.16.4 | opencv-python 4.2.0.32 | oss2 2.12.1 | paiio 0.1.0 |
pip 20.2.4 | protobuf 3.14.0 | pycryptodome 3.9.9 | pyodps 0.10.4 |
pypai 1.1.0+tensorflow.1.12.2pai2011 | requests 2.13.0 | setuptools 50.3.1.post20201107 | six 1.15.0 |
tensorboard 1.12.2 | tensorflow 1.12.2PAI2011 | termcolor 1.1.0 | toposort 1.5 |
typing-extensions 3.7.4.3 | Werkzeug 1.0.1 | wheel 0.35.1 | zipp 3.4.0 |
tensorflow-training:1.12.2PAI-gpu-py36-cu100-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 3.6.12 Anaconda
CUDA version: 10.0
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.29 | aliyun-python-sdk-core-v3 2.13.11 | aliyun-python-sdk-kms 2.14.0 |
astor 0.8.1 | cached-property 1.5.2 | certifi 2020.12.5 | crcmod 1.7 |
Cython 0.29.21 | gast 0.4.0 | grpcio 1.31.0 | h5py 3.1.0 |
importlib-metadata 3.4.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 | Keras-Preprocessing 1.1.2 |
Markdown 3.3.3 | mkl-fft 1.2.0 | mkl-random 1.1.1 | mkl-service 2.3.0 |
numpy 1.16.4 | opencv-python 4.2.0.32 | oss2 2.12.1 | paiio 0.1.0 |
pip 20.2.4 | protobuf 3.14.0 | pycryptodome 3.9.9 | pyodps 0.10.4 |
pypai 1.1.0+tensorflow.gpu.1.12.2pai2011 | requests 2.13.0 | setuptools 50.3.1.post20201107 | six 1.15.0 |
tensorboard 1.12.2 | tensorflow-gpu 1.12.2PAI2011 | termcolor 1.1.0 | toposort 1.5 |
typing-extensions 3.7.4.3 | Werkzeug 1.0.1 | wheel 0.35.1 | zipp 3.4.0 |
subprocess32 3.5.4 | tao-wrapper 0.1.1 | whale 0.0.2 |
tensorflow-training:1.15.0PAI-gpu-py27-cu100-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 2.7.18 Anaconda
CUDA version: 10.0
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.15 | aliyun-python-sdk-kms 2.14.0 | astor 0.8.1 |
backports.weakref 1.0.post1 | certifi 2020.6.20 | crcmod 1.7 | Cython 0.29.14 |
enum34 1.1.6 | funcsigs 1.0.2 | functools32 3.2.3.post2 | futures 3.3.0 |
gast 0.2.2 | google-pasta 0.2.0 | opt-einsum 2.3.2 | tensorflow-estimator 1.15.1 |
grpcio 1.27.2 | h5py 2.10.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 |
Keras-Preprocessing 1.1.2 | Markdown 3.1.1 | mkl-fft 1.0.15 | mkl-random 1.1.0 |
mkl-service 2.3.0 | mock 3.0.5 | numpy 1.16.4 | opencv-python 4.2.0.32 |
oss2 2.9.1 | paiio 0.1.0 | pip 9.0.1 | protobuf 3.14.0 |
pycryptodome 3.9.7 | pyodps 0.10.4 | pypai 1.1.0+tensorflow.gpu.1.15.0 | requests 2.13.0 |
setuptools 44.1.1 | six 1.15.0 | tensorboard 1.15.0 | tensorflow-gpu 1.15.0 |
termcolor 1.1.0 | toposort 1.5 | Werkzeug 1.0.1 | wheel 0.35.1 |
subprocess32 3.5.4 | tao-wrapper 0.1.1 | whale 0.0.2 | wrapt 1.12.1 |
tensorflow-training:1.15.0PAI-gpu-py36-cu100-ubuntu16.04
Operating system: Ubuntu 16.04.6 LTS
Python version: 3.6.12 Anaconda
CUDA version: 10.0
Third-party libraries: The following table lists the third-party libraries and versions.
Third-party library and version |
absl-py 0.11.0 | aliyun-python-sdk-core 2.13.29 | aliyun-python-sdk-core-v3 2.13.11 | aliyun-python-sdk-kms 2.14.0 |
astor 0.8.1 | cached-property 1.5.2 | certifi 2020.12.5 | crcmod 1.7 |
Cython 0.29.21 | gast 0.2.2 | grpcio 1.31.0 | h5py 3.1.0 |
importlib-metadata 3.4.0 | jmespath 0.10.0 | Keras-Applications 1.0.8 | Keras-Preprocessing 1.1.2 |
Markdown 3.3.3 | mkl-fft 1.2.0 | mkl-random 1.1.1 | mkl-service 2.3.0 |
numpy 1.16.4 | opencv-python 4.2.0.32 | oss2 2.12.1 | paiio 0.1.0 |
pip 20.2.4 | protobuf 3.14.0 | pycryptodome 3.9.9 | pyodps 0.10.4 |
pypai 1.1.0+tensorflow.gpu.1.15.0 | requests 2.13.0 | setuptools 50.3.1.post20201107 | six 1.15.0 |
tensorboard 1.15.0 | tensorflow-gpu 1.15.0 | termcolor 1.1.0 | toposort 1.5 |
typing-extensions 3.7.4.3 | Werkzeug 1.0.1 | wheel 0.35.1 | zipp 3.4.0 |
subprocess32 3.5.4 | tao-wrapper 0.1.1 | whale 0.0.2 | google-pasta 0.2.0 |
opt-einsum 3.3.0 | tensorflow-estimator 1.15.1 | wrapt 1.12.1 |
pytorch-training:1.3.1PAI-gpu-py37-cu100-ubuntu16.04
pytorch-training:1.4.0PAI-gpu-py37-cu100-ubuntu16.04
pytorch-training:1.5.1PAI-gpu-py37-cu100-ubuntu16.04
pytorch-training:1.6.0PAI-gpu-py37-cu100-ubuntu16.04
Custom image
Custom images that you uploaded to PAI. If you choose to use a custom image, we recommend that you go to the page and add the custom image as an AI asset. This way, the image can be used by multiple training jobs. For more information, see View and add images.
Step 3: Prepare a dataset
Before you submit a deep learning job, you need to upload the dataset required by the job to an OSS bucket or a NAS file system and register the dataset so that the job can use the dataset.
Supported dataset types
Datasets of the following types are supported: OSS, General-purpose NAS, Extreme NAS, CPFS, and CPFS for Lingjun.
You can enable the dataset acceleration feature for datasets of the OSS and CPFS type. When you submit a distributed training job, you can use the dataset acceleration feature to improve data read efficiency.
If you use LINGJUN resources to run DLC jobs, you can enable dataset acceleration only for OSS datasets.
Create a dataset
For information about how to configure the parameters, see Create and manage datasets. Take note of the following items:
When you create a dataset for training jobs, you need to select Alibaba Cloud Storage Service and set Property to Folder.
Compared to NAS, OSS is a distributed object storage service instead of a file system. When you use OSS to store data, some file system features are not supported. For example, you cannot append data to or overwrite existing files in OSS buckets.
If you select a CPFS dataset, you also need to configure the virtual private cloud (VPC). The VPC must be the same as the one that you configured for the CPFS dataset. Otherwise, exceptions may occur and the DLC training jobs are removed from the queue after you submit the jobs.
Step 4: Prepare a code build
Before you submit a deep learning job, you need to add the code required by the job to a code build. We recommend that you go to the page and add the code build as an AI asset. This way, the code build can be used by multiple training jobs. For more information, see Code builds.
References
After you complete the preparations, you can create a training task. For more information, see Submit a training job.