Create single-node or distributed training jobs on Kubernetes without infrastructure setup using MNIST handwriting recognition.
MNIST handwriting recognition is a classic introductory deep learning task—building a model to recognize 10 handwritten digits (0 to 9).

Prerequisites
Activate PAI and create a workspace using your Alibaba Cloud account. Log on to PAI console. In upper-left corner, select a region, grant required permissions, and activate product.
Billing
Examples in this tutorial use public resources to create DLC jobs. Billing method is pay-as-you-go. For billing rules, see Billing of DLC.
Single-node, single-GPU training
Create a dataset
Datasets store code, data, and results for model training. This tutorial uses an Object Storage Service (OSS) dataset.
-
In the navigation pane on the left of the PAI console, click Datasets > Custom Datasets > Create Dataset.
-
Configure dataset parameters. Key parameters:
-
Name: For example,
dataset_mnist. -
Storage Type: Alibaba Cloud Object Storage Service (OSS).
-
OSS Path: Click
icon, select a bucket, and create a new folder, such as dlc_mnist.If OSS is not activated or no bucket is available in current region, follow these steps to activate OSS and create a bucket:
Click Confirm to create dataset.
-
-
Upload training code and data.
-
Download provided training code by clicking mnist_train.py. To simplify the process, code automatically downloads training data to dataSet folder at runtime.
For production use, upload code and training data to your PAI dataset in advance.
-
Upload code. On dataset details page, click View Data to open OSS console. Click Upload Object > Select Files > Upload Object to upload training code to OSS.

-
Create a DLC job
-
In navigation pane on left of PAI console, click Deep Learning Containers (DLC) > Create Job.

-
Configure DLC job parameters. Key parameters (use defaults for others):
-
Image config: Select Image Address and enter the image URL for your Region.

Region
Image URL
China (Beijing)
dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Shanghai)
dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Hangzhou)
dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
Other regions
Find your region ID and replace <Region ID> in the image URL to get the full link:
dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
This image has been verified to be compatible with environment in Quick Start for Interactive Modelling (DSW). Standard PAI workflow: first verify environment and develop code in DSW, then use DLC for training.
-
Mount dataset: Select Custom Dataset and choose dataset created in previous step. Default Mount Path is
/mnt/data. -
Startup Command:
python /mnt/data/mnist_train.pyThis start command matches what you'd use in DSW or locally. However, because
mnist_train.pyis mounted to/mnt/data/, update code path to/mnt/data/mnist_train.py. -
Source: Select Public Resources. For Resource Type, select
ecs.gn7i-c8g1.2xlarge.Select another GPU-accelerated instance if unavailable.
Click Confirm to create job. Job takes about 15 minutes to complete. Monitor training process by clicking Logs.

After job completes, best model checkpoint and TensorBoard logs are saved to
outputpath of mounted dataset.
-
(Optional) View TensorBoard
Use TensorBoard visualization tool to view loss curve and understand training details.
To use TensorBoard for a DLC job, configure a dataset.
-
On DLC job details page, click TensorBoard tab and then click Create TensorBoard.

-
Set Configuration Type to By Task. For Summary Path, enter path where summaries are stored in training code:
/mnt/data/output/runs/. Click Confirm to start.This corresponds to the code snippet:
writer = SummaryWriter('/mnt/data/output/runs/mnist_experiment') -
Click View TensorBoard to view the loss curves for the training dataset (train_loss) and the validation set (validation_loss).

Deploy trained model
For more information, see Use EAS to deploy the model as an online service.
Single-node multi-GPU or multi-node multi-GPU distributed training
Use distributed training when single GPU memory is insufficient or to accelerate training.
This example uses two instances with one GPU each. Same approach applies to other single-node multi-GPU or multi-node multi-GPU configurations.
Create a dataset
If you already created a dataset during single-node, single-GPU training, download and upload mnist_train_distributed.py code. Otherwise, first create a dataset and upload code.
Create a DLC job
-
In navigation pane on left of PAI console, click Deep Learning Containers (DLC) > Create Job.

-
Configure DLC job parameters. Key parameters (use defaults for others):
-
Image config: Select Image Address and enter the image URL for your Region.

Region
Image URL
China (Beijing)
dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Shanghai)
dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Hangzhou)
dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
Other regions
Find your region ID and replace <Region ID> in the image URL to get the full link:
dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
This image has been verified to be compatible with environment in Quick Start for Interactive Modelling (DSW). Standard PAI workflow: first verify environment and code in DSW, then use DLC for training.
-
Mount dataset: Select Custom Dataset and choose dataset created in previous step. Default Mount Path is
/mnt/data. -
Startup Command:
torchrun --nproc_per_node=1 --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} /mnt/data/mnist_train_distributed.pyDLC automatically injects common environment variables such as
MASTER_ADDR,WORLD_SIZE$VARIABLE_NAME -
Source: Select Public Resources. Set Quantity to 2. For Resource Type, select
ecs.gn7i-c8g1.2xlarge.Select another GPU-accelerated instance if unavailable.
Click Confirm to create job. Job takes about 10 minutes. View training Log for both instances on Overview page.

After job completes, best model checkpoint and TensorBoard logs are saved to
output_distributedpath of mounted dataset.
-
(Optional) View TensorBoard
Use TensorBoard visualization tool to view loss curve and understand training details.
To use TensorBoard for a DLC job, configure a dataset.
-
On DLC job details page, click TensorBoard tab and then click Create TensorBoard.

-
Set Configuration Type to By Task. For Summary Path, enter path where summaries are stored in training code:
/mnt/data/output_distributed/runs. Click Confirm to start.This corresponds to the code snippet:
writer = SummaryWriter('/mnt/data/output_distributed/runs/mnist_experiment') -
Click View TensorBoard to view the loss curves for the training dataset (train_loss) and the validation set (validation_loss).

Deploy the trained model
For more information, see Use EAS to deploy the model as an online service.
References
-
For DLC features, see Deep Learning Containers (DLC).
