Quick Start for Deep Learning Containers (DLC) - Platform For AI

Deep Learning Containers (DLC) lets you quickly create distributed or single-node training jobs. Built on Kubernetes, DLC eliminates the need to manually purchase machines and configure runtime environments, allowing you to use it without changing your existing workflow. This topic uses the MNIST handwriting recognition task as an example to demonstrate how to use DLC for single-node, single-GPU training and multi-node, multi-GPU distributed training.

Note

MNIST handwriting recognition is a classic introductory task in deep learning. The goal is to build a machine learning model to recognize 10 handwritten digits (0 to 9).

Prerequisites

Activate PAI and create a workspace using your Alibaba Cloud account. Log on to the PAI console. In the upper-left corner, select a region. Then, grant the required permissions and activate the product.

Billing

The examples in this topic use public resources to create DLC jobs. The billing method is pay-as-you-go. For more information about billing rules, see Billing of Deep Learning Containers (DLC).

Single-node, single-GPU training

Create a dataset

Datasets store the code, data, and results for model training. This topic uses an Object Storage Service (OSS) dataset as an example.

In the navigation pane on the left of the PAI console, click Datasets > Custom Datasets > Create Dataset.
Configure the dataset parameters. The key parameters are described below. You can use the default values for other parameters.
- Name: For example, dataset_mnist.
- Storage Type: Alibaba Cloud Object Storage Service (OSS).
- OSS Path: Click the icon, select a Bucket, and create a new folder, such as dlc_mnist.
  If you have not activated OSS, or if no bucket is available in the current region, follow these steps to activate OSS and create a bucket:
  (Optional) Activate OSS and create a bucket
  1. Activate OSS.
  2. Log on to the OSS console. Click Create Bucket. Enter a Bucket Name. For Region, select the same region as PAI. Use the default values for other parameters. Then, click Create.
Click Confirm to create the dataset.

Upload the training code and data.

Download the provided training code by clicking mnist_train.py. To simplify the process, the code is configured to automatically download the training data to the dataSet folder of the dataset at runtime.

For production use, you can upload the code and training data to your PAI dataset in advance.

Single-node, single-GPU training code example: mnist_train.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torch.utils.tensorboard import SummaryWriter

# Hyperparameters
batch_size = 64  # Amount of data for each training batch
learning_rate = 0.01  # Learning rate
num_epochs = 20  # Number of training epochs

# Check if a GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data pre-processing
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=True, download=True, transform=transform)
val_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=False, download=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)


# Define a simple neural network
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # First convolutional layer: 1 input channel (grayscale image), 10 output channels, 5x5 kernel
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        # Second convolutional layer: 10 input channels, 20 output channels, 3x3 kernel
        self.conv2 = nn.Conv2d(10, 20, kernel_size=3)
        # Fully connected layer: Input is 20 × 5 × 5 (feature map size after convolution and pooling), output is 128
        self.fc1 = nn.Linear(20 * 5 * 5, 128)
        # Output layer: 128 -> 10 (corresponding to 10 digit classes)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        # Input x shape: [batch, 1, 28, 28]
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)  # [batch, 10, 12, 12]
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)  # [batch, 20, 5, 5]
        x = x.view(-1, 20 * 5 * 5)  # Flatten to [batch, 500]
        x = F.relu(self.fc1(x))      # [batch, 128]
        x = self.fc2(x)              # [batch, 10]
        return x


# Instantiate the model and move it to the GPU if available
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Create a TensorBoard SummaryWriter to visualize the model training process
writer = SummaryWriter('/mnt/data/output/runs/mnist_experiment')

# Variable to save the model with the highest accuracy
best_val_accuracy = 0.0

# Train the model and record the loss and accuracy
for epoch in range(num_epochs):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)  # Move the data and target to the GPU

        # Zero the gradients
        optimizer.zero_grad()
        # Forward propagation
        output = model(data)
        # Calculate the loss
        loss = criterion(output, target)
        # Backward propagation
        loss.backward()
        # Update the parameters
        optimizer.step()

        # Log the training loss to TensorBoard
        if batch_idx % 100 == 0:  # Log every 100 batches
            writer.add_scalar('Loss/train', loss.item(), epoch * len(train_loader) + batch_idx)
            print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}')

    # Validate the model and record the validation loss and accuracy
    model.eval()
    val_loss = 0
    correct = 0
    with torch.no_grad():  # Do not calculate gradients
        for data, target in val_loader:
            data, target = data.to(device), target.to(device)  # Move the data and target to the GPU
            output = model(data)
            val_loss += criterion(output, target).item()  # Accumulate the validation loss
            pred = output.argmax(dim=1, keepdim=True)  # Get the predicted label
            correct += pred.eq(target.view_as(pred)).sum().item()  # Accumulate the number of correct predictions

    val_loss /= len(val_loader)  # Calculate the average validation loss
    val_accuracy = 100. * correct / len(val_loader.dataset)  # Calculate the validation accuracy
    print(f'Validation Loss: {val_loss:.4f}, Accuracy: {correct}/{len(val_loader.dataset)} ({val_accuracy:.0f}%)')

    # Log the validation loss and accuracy to TensorBoard
    writer.add_scalar('Loss/validation', val_loss, epoch)
    writer.add_scalar('Accuracy/validation', val_accuracy, epoch)

    # Save the model with the highest validation accuracy
    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        torch.save(model.state_dict(), '/mnt/data/output/best_model.pth')
        print(f'Model saved with accuracy: {best_val_accuracy:.2f}%')

# Close the SummaryWriter
writer.close()
print('Training complete. writer.close()')

Upload the code. On the dataset details page, click View Data to open the OSS console. Then, click Upload Object > Select Files > Upload Object to upload the training code to OSS.

Create a DLC job

In the navigation pane on the left of the PAI console, click Deep Learning Containers (DLC) > Create Job.

Configure the DLC job parameters. The key parameters are described below. You can use the default values for other parameters. For more information about all parameters, see Create a training job.

Image config: Select Image Address and enter the image URL for your Region.

Region	Image URL
China (Beijing)	dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Shanghai)	dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Hangzhou)	dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
Other regions	Find your region ID and replace <Region ID> in the image URL to get the full link: dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

This image has been verified to be compatible with the environment in the Quick Start for Interactive Modelling (DSW). The typical workflow for modeling with PAI is to first verify the environment and develop code in DSW, and then use DLC for training.

Mount dataset: Select Custom Dataset and choose the dataset you created in the previous step. The default Mount Path is /mnt/data.
Startup Command: python /mnt/data/mnist_train.py
This start command is the same as the one used when running in DSW or locally. However, because mnist_train.py is mounted to /mnt/data/, the code path must be updated to /mnt/data/mnist_train.py.
Source: Select Public Resources. For Resource Type, select ecs.gn7i-c8g1.2xlarge.
If this instance specification is out of stock, you can select another GPU-accelerated instance.

Click Confirm to create the job. The job takes about 15 minutes to complete. You can monitor the training process by clicking Logs.

After the job is complete, the best model checkpoint and TensorBoard logs are saved to the output path of the mounted dataset.

(Optional) View TensorBoard

You can use the TensorBoard visualization tool to view the loss curve and understand the training details.

Important

To use TensorBoard for a DLC job, you must configure a dataset.

On the DLC job details page, click the TensorBoard tab and then click Create TensorBoard.
Set Configuration Type to By Task. For Summary Path, enter the path where the summaries are stored in the training code: /mnt/data/output/runs/. Click Confirm to start.
This corresponds to the code snippet: writer = SummaryWriter('/mnt/data/output/runs/mnist_experiment')
Click View TensorBoard to view the loss curves for the training dataset (train_loss) and the validation set (validation_loss).
(Optional) Adjust hyperparameters based on the loss graph to improve model performance
You can evaluate the model's training performance by observing the trend of the loss value:
- If both train_loss and validation_loss are still decreasing when the training ends, this indicates underfitting.
  To address this, you can increase `num_epochs` (the number of training epochs, which is positively correlated with training depth) or slightly increase `learning_rate` and then retrain the model.
- If train_loss continues to decrease while validation_loss starts to increase, this indicates overfitting.
  To address this, you can decrease `num_epochs` or slightly decrease `learning_rate` and then retrain the model.
- If both train_loss and validation_loss stabilize before the training ends, this indicates a good fit.
  If the model has a good fit, you can proceed to the next steps.

Deploy the trained model

For more information, see Use EAS to deploy the model as an online service.

Single-node multi-GPU or multi-node multi-GPU distributed training

If the video memory of a single GPU is insufficient for your training needs, or if you want to accelerate the training process, you can create a single-node multi-GPU or multi-node multi-GPU distributed training job.

This topic uses an example of two instances, each with one GPU. This example also applies to other configurations for single-node multi-GPU or multi-node multi-GPU training.

Create a dataset

If you have already created a dataset during the single-node, single-GPU training, you only need to download and upload the mnist_train_distributed.py code. Otherwise, you must first create a dataset and then upload the code.

Single-node multi-GPU or multi-node multi-GPU training code example: mnist_train_distributed.py

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms
from torch.utils.tensorboard import SummaryWriter

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=3)
        self.fc1 = nn.Linear(20 * 5 * 5, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, 20 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

def main():
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ["LOCAL_RANK"])
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(local_rank)
    device = torch.device('cuda', local_rank)

    batch_size = 64
    learning_rate = 0.01
    num_epochs = 20

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,))
    ])

    # Only the main process (rank=0) needs to download. Other processes must wait for it to finish.
    # Let processes with rank!=0 wait at the barrier first.
    if rank != 0:
        dist.barrier()

    # All processes execute the dataset creation.
    # However, only the process with rank=0 will actually perform the download.
    train_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=True, download=(rank == 0), transform=transform)

    # After the process with rank=0 finishes downloading, it also reaches the barrier, releasing all processes.
    if rank == 0:
        dist.barrier()

    # At this point, all processes are synchronized and can continue executing the subsequent code.
    val_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=False, download=False, transform=transform)

    train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank, shuffle=True)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler, num_workers=4, pin_memory=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)

    model = SimpleCNN().to(device)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    criterion = nn.CrossEntropyLoss().to(device)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)

    if rank == 0:
        writer = SummaryWriter('/mnt/data/output_distributed/runs/mnist_experiment')
    best_val_accuracy = 0.0

    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 100 == 0:
                # Each rank and local_rank prints its own loss.
                print(f"Rank: {rank}, Local_Rank: {local_rank} -- Train Epoch: {epoch} "
                      f"[{batch_idx * len(data) * world_size}/{len(train_loader.dataset)} "
                      f"({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}")

                if rank == 0:
                    writer.add_scalar('Loss/train', loss.item(), epoch * len(train_loader) + batch_idx)

        # Validation
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for data, target in val_loader:
                data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)
                output = model(data)
                val_loss += criterion(output, target).item() * data.size(0)
                pred = output.argmax(dim=1, keepdim=True)
                correct += pred.eq(target.view_as(pred)).sum().item()
                total += target.size(0)
        val_loss_tensor = torch.tensor([val_loss], dtype=torch.float32, device=device)
        correct_tensor = torch.tensor([correct], dtype=torch.float32, device=device)
        total_tensor = torch.tensor([total], dtype=torch.float32, device=device)
        dist.all_reduce(val_loss_tensor, op=dist.ReduceOp.SUM)
        dist.all_reduce(correct_tensor, op=dist.ReduceOp.SUM)
        dist.all_reduce(total_tensor, op=dist.ReduceOp.SUM)

        val_loss = val_loss_tensor.item() / total_tensor.item()
        val_accuracy = 100. * correct_tensor.item() / total_tensor.item()

        if rank == 0:
            print(f'Validation Loss: {val_loss:.4f}, Accuracy: {int(correct_tensor.item())}/{int(total_tensor.item())} ({val_accuracy:.0f}%)')
            writer.add_scalar('Loss/validation', val_loss, epoch)
            writer.add_scalar('Accuracy/validation', val_accuracy, epoch)
            if val_accuracy > best_val_accuracy:
                best_val_accuracy = val_accuracy
                torch.save(model.module.state_dict(), '/mnt/data/output_distributed/best_model.pth')
                print(f'Model saved with accuracy: {best_val_accuracy:.2f}%')
    if rank == 0:
        writer.close()
    dist.destroy_process_group()
    if rank == 0:
        print('Training complete. writer.close()')


if __name__ == "__main__":
    main()

Create a DLC job

In the navigation pane on the left of the PAI console, click Deep Learning Containers (DLC) > New Job.

Configure the DLC job parameters. The key parameters are described below. You can use the default values for other parameters. For more information about all parameters, see Create a training job.

Image config: Select Image Address and enter the image URL for your Region.

Region	Image URL
China (Beijing)	dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Shanghai)	dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
China (Hangzhou)	dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04
Other regions	Find your region ID and replace <Region ID> in the image URL to get the full link: dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

This image has been verified to be compatible with the environment in the Quick Start for Interactive Modelling (DSW). The typical workflow for modeling with PAI is to first verify the environment and code in DSW, and then use DLC for training.

Mount dataset: Select Custom Dataset and choose the dataset you created in the previous step. The default Mount Path is /mnt/data.
Startup Command: torchrun --nproc_per_node=1 --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} /mnt/data/mnist_train_distributed.py
DLC automatically injects common environment variables such as MASTER_ADDR, WORLD_SIZE$VARIABLE_NAME
Source: Select Public Resources. Set Quantity to 2. For Resource Type, select ecs.gn7i-c8g1.2xlarge.
If this instance specification is out of stock, you can select another GPU-accelerated instance.

Click Confirm to create the job. The job takes about 10 minutes to run. While the job runs, you can view the training Log for both instances on the Overview page.

After the job is complete, the best model checkpoint and TensorBoard logs are saved to the output_distributed path of the mounted dataset.