Deep Learning Containers (DLC) lets you quickly create distributed or single-node training jobs. Built on Kubernetes, DLC eliminates the need to manually purchase machines and configure runtime environments, allowing you to use it without changing your existing workflow. This topic uses the MNIST handwriting recognition task as an example to demonstrate how to use DLC for single-node, single-GPU training and multi-node, multi-GPU distributed training.
MNIST handwriting recognition is a classic introductory task in deep learning. The goal is to build a machine learning model to recognize 10 handwritten digits (0 to 9).

Prerequisites
Activate PAI and create a workspace using your Alibaba Cloud account. Log on to the PAI console. In the upper-left corner, select a region. Then, grant the required permissions and activate the product.
Billing
The examples in this topic use public resources to create DLC jobs. The billing method is pay-as-you-go. For more information about billing rules, see Billing of Deep Learning Containers (DLC).
Single-node, single-GPU training
Create a dataset
Datasets store the code, data, and results for model training. This topic uses an Object Storage Service (OSS) dataset as an example.
In the navigation pane on the left of the PAI console, click Datasets > Custom Datasets > Create Dataset.
Configure the dataset parameters. The key parameters are described below. You can use the default values for other parameters.
Name: For example,
dataset_mnist.Storage Type: Alibaba Cloud Object Storage Service (OSS).
OSS Path: Click the
icon, select a Bucket, and create a new folder, such as dlc_mnist.If you have not activated OSS, or if no bucket is available in the current region, follow these steps to activate OSS and create a bucket:
Click Confirm to create the dataset.
Upload the training code and data.
Download the provided training code by clicking mnist_train.py. To simplify the process, the code is configured to automatically download the training data to the dataSet folder of the dataset at runtime.
For production use, you can upload the code and training data to your PAI dataset in advance.
Upload the code. On the dataset details page, click View Data to open the OSS console. Then, click Upload Object > Select Files > Upload Object to upload the training code to OSS.

Create a DLC job
In the navigation pane on the left of the PAI console, click Deep Learning Containers (DLC) > Create Job.

Configure the DLC job parameters. The key parameters are described below. You can use the default values for other parameters. For more information about all parameters, see Create a training job.
Image config: Select Image Address and enter the image URL for your Region.

Region
Image URL
China (Beijing)
dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Shanghai)
dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Hangzhou)
dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
Other regions
Find your region ID and replace <Region ID> in the image URL to get the full link:
dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
This image has been verified to be compatible with the environment in the Quick Start for Interactive Modelling (DSW). The typical workflow for modeling with PAI is to first verify the environment and develop code in DSW, and then use DLC for training.
Mount dataset: Select Custom Dataset and choose the dataset you created in the previous step. The default Mount Path is
/mnt/data.Startup Command:
python /mnt/data/mnist_train.pyThis start command is the same as the one used when running in DSW or locally. However, because
mnist_train.pyis mounted to/mnt/data/, the code path must be updated to/mnt/data/mnist_train.py.Source: Select Public Resources. For Resource Type, select
ecs.gn7i-c8g1.2xlarge.If this instance specification is out of stock, you can select another GPU-accelerated instance.
Click Confirm to create the job. The job takes about 15 minutes to complete. You can monitor the training process by clicking Logs.

After the job is complete, the best model checkpoint and TensorBoard logs are saved to the
outputpath of the mounted dataset.
(Optional) View TensorBoard
You can use the TensorBoard visualization tool to view the loss curve and understand the training details.
To use TensorBoard for a DLC job, you must configure a dataset.
On the DLC job details page, click the TensorBoard tab and then click Create TensorBoard.

Set Configuration Type to By Task. For Summary Path, enter the path where the summaries are stored in the training code:
/mnt/data/output/runs/. Click Confirm to start.This corresponds to the code snippet:
writer = SummaryWriter('/mnt/data/output/runs/mnist_experiment')Click View TensorBoard to view the loss curves for the training dataset (train_loss) and the validation set (validation_loss).

Deploy the trained model
For more information, see Use EAS to deploy the model as an online service.
Single-node multi-GPU or multi-node multi-GPU distributed training
If the video memory of a single GPU is insufficient for your training needs, or if you want to accelerate the training process, you can create a single-node multi-GPU or multi-node multi-GPU distributed training job.
This topic uses an example of two instances, each with one GPU. This example also applies to other configurations for single-node multi-GPU or multi-node multi-GPU training.
Create a dataset
If you have already created a dataset during the single-node, single-GPU training, you only need to download and upload the mnist_train_distributed.py code. Otherwise, you must first create a dataset and then upload the code.
Create a DLC job
In the navigation pane on the left of the PAI console, click Deep Learning Containers (DLC) > New Job.

Configure the DLC job parameters. The key parameters are described below. You can use the default values for other parameters. For more information about all parameters, see Create a training job.
Image config: Select Image Address and enter the image URL for your Region.

Region
Image URL
China (Beijing)
dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Shanghai)
dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Hangzhou)
dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
Other regions
Find your region ID and replace <Region ID> in the image URL to get the full link:
dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
This image has been verified to be compatible with the environment in the Quick Start for Interactive Modelling (DSW). The typical workflow for modeling with PAI is to first verify the environment and code in DSW, and then use DLC for training.
Mount dataset: Select Custom Dataset and choose the dataset you created in the previous step. The default Mount Path is
/mnt/data.Startup Command:
torchrun --nproc_per_node=1 --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} /mnt/data/mnist_train_distributed.pyDLC automatically injects common environment variables such as
MASTER_ADDR,WORLD_SIZE$VARIABLE_NAMESource: Select Public Resources. Set Quantity to 2. For Resource Type, select
ecs.gn7i-c8g1.2xlarge.If this instance specification is out of stock, you can select another GPU-accelerated instance.
Click Confirm to create the job. The job takes about 10 minutes to run. While the job runs, you can view the training Log for both instances on the Overview page.

After the job is complete, the best model checkpoint and TensorBoard logs are saved to the
output_distributedpath of the mounted dataset.
(Optional) View TensorBoard
You can use the TensorBoard visualization tool to view the loss curve and understand the training details.
To use TensorBoard for a DLC job, you must configure a dataset.
On the DLC job details page, click the TensorBoard tab and then click Create TensorBoard.

Set Configuration Type to By Task. For Summary Path, enter the path where the summaries are stored in the training code:
/mnt/data/output_distributed/runs. Click Confirm to start.This corresponds to the code snippet:
writer = SummaryWriter('/mnt/data/output_distributed/runs/mnist_experiment')Click View TensorBoard to view the loss curves for the training dataset (train_loss) and the validation set (validation_loss).

Deploy the trained model
For more information, see Use EAS to deploy the model as an online service.
References
For more information about DLC features, see Deep Learning Containers (DLC).
