All Products
Search
Document Center

Platform For AI:Run a training job on DLC

Last Updated:Mar 11, 2026

Create single-node or distributed training jobs on Kubernetes without infrastructure setup using MNIST handwriting recognition.

Note

MNIST handwriting recognition is a classic introductory deep learning task—building a model to recognize 10 handwritten digits (0 to 9).

image

Prerequisites

Activate PAI and create a workspace using your Alibaba Cloud account. Log on to PAI console. In upper-left corner, select a region, grant required permissions, and activate product.

Billing

Examples in this tutorial use public resources to create DLC jobs. Billing method is pay-as-you-go. For billing rules, see Billing of DLC.

Single-node, single-GPU training

Create a dataset

Datasets store code, data, and results for model training. This tutorial uses an Object Storage Service (OSS) dataset.

  1. In the navigation pane on the left of the PAI console, click Datasets > Custom Datasets > Create Dataset.

  2. Configure dataset parameters. Key parameters:

    • Name: For example, dataset_mnist.

    • Storage Type: Alibaba Cloud Object Storage Service (OSS).

    • OSS Path: Click image icon, select a bucket, and create a new folder, such as dlc_mnist.

      If OSS is not activated or no bucket is available in current region, follow these steps to activate OSS and create a bucket:

      (Optional) Activate OSS and create a bucket

      1. Activate OSS.

      2. Log on to OSS console. Click Create Bucket. Enter a Bucket Name. For Region, select same region as PAI. Use default values for other parameters. Click Create.

        image

    Click Confirm to create dataset.

  3. Upload training code and data.

    1. Download provided training code by clicking mnist_train.py. To simplify the process, code automatically downloads training data to dataSet folder at runtime.

      For production use, upload code and training data to your PAI dataset in advance.

      Single-node, single-GPU training code example: mnist_train.py

      import torch
      import torch.nn as nn
      import torch.nn.functional as F
      import torch.optim as optim
      from torch.utils.data import DataLoader
      from torchvision import datasets, transforms
      from torch.utils.tensorboard import SummaryWriter
      
      # Hyperparameters
      batch_size = 64  # Amount of data for each training batch
      learning_rate = 0.01  # Learning rate
      num_epochs = 20  # Number of training epochs
      
      # Check if a GPU is available
      device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
      
      # Data pre-processing
      transform = transforms.Compose([
          transforms.ToTensor(),
          transforms.Normalize((0.5,), (0.5,))
      ])
      
      train_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=True, download=True, transform=transform)
      val_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=False, download=False, transform=transform)
      
      train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
      val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
      
      
      # Define a simple neural network
      class SimpleCNN(nn.Module):
          def __init__(self):
              super(SimpleCNN, self).__init__()
              # First convolutional layer: 1 input channel (grayscale image), 10 output channels, 5x5 kernel
              self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
              # Second convolutional layer: 10 input channels, 20 output channels, 3x3 kernel
              self.conv2 = nn.Conv2d(10, 20, kernel_size=3)
              # Fully connected layer: Input is 20 × 5 × 5 (feature map size after convolution and pooling), output is 128
              self.fc1 = nn.Linear(20 * 5 * 5, 128)
              # Output layer: 128 -> 10 (corresponding to 10 digit classes)
              self.fc2 = nn.Linear(128, 10)
      
          def forward(self, x):
              # Input x shape: [batch, 1, 28, 28]
              x = F.max_pool2d(F.relu(self.conv1(x)), 2)  # [batch, 10, 12, 12]
              x = F.max_pool2d(F.relu(self.conv2(x)), 2)  # [batch, 20, 5, 5]
              x = x.view(-1, 20 * 5 * 5)  # Flatten to [batch, 500]
              x = F.relu(self.fc1(x))      # [batch, 128]
              x = self.fc2(x)              # [batch, 10]
              return x
      
      
      # Instantiate the model and move it to the GPU if available
      model = SimpleCNN().to(device)
      criterion = nn.CrossEntropyLoss()
      optimizer = optim.SGD(model.parameters(), lr=learning_rate)
      
      # Create a TensorBoard SummaryWriter to visualize the model training process
      writer = SummaryWriter('/mnt/data/output/runs/mnist_experiment')
      
      # Variable to save the model with the highest accuracy
      best_val_accuracy = 0.0
      
      # Train the model and record the loss and accuracy
      for epoch in range(num_epochs):
          model.train()
          for batch_idx, (data, target) in enumerate(train_loader):
              data, target = data.to(device), target.to(device)  # Move the data and target to the GPU
      
              # Zero the gradients
              optimizer.zero_grad()
              # Forward propagation
              output = model(data)
              # Calculate the loss
              loss = criterion(output, target)
              # Backward propagation
              loss.backward()
              # Update the parameters
              optimizer.step()
      
              # Log the training loss to TensorBoard
              if batch_idx % 100 == 0:  # Log every 100 batches
                  writer.add_scalar('Loss/train', loss.item(), epoch * len(train_loader) + batch_idx)
                  print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}')
      
          # Validate the model and record the validation loss and accuracy
          model.eval()
          val_loss = 0
          correct = 0
          with torch.no_grad():  # Do not calculate gradients
              for data, target in val_loader:
                  data, target = data.to(device), target.to(device)  # Move the data and target to the GPU
                  output = model(data)
                  val_loss += criterion(output, target).item()  # Accumulate the validation loss
                  pred = output.argmax(dim=1, keepdim=True)  # Get the predicted label
                  correct += pred.eq(target.view_as(pred)).sum().item()  # Accumulate the number of correct predictions
      
          val_loss /= len(val_loader)  # Calculate the average validation loss
          val_accuracy = 100. * correct / len(val_loader.dataset)  # Calculate the validation accuracy
          print(f'Validation Loss: {val_loss:.4f}, Accuracy: {correct}/{len(val_loader.dataset)} ({val_accuracy:.0f}%)')
      
          # Log the validation loss and accuracy to TensorBoard
          writer.add_scalar('Loss/validation', val_loss, epoch)
          writer.add_scalar('Accuracy/validation', val_accuracy, epoch)
      
          # Save the model with the highest validation accuracy
          if val_accuracy > best_val_accuracy:
              best_val_accuracy = val_accuracy
              torch.save(model.state_dict(), '/mnt/data/output/best_model.pth')
              print(f'Model saved with accuracy: {best_val_accuracy:.2f}%')
      
      # Close the SummaryWriter
      writer.close()
      print('Training complete. writer.close()')
    2. Upload code. On dataset details page, click View Data to open OSS console. Click Upload Object > Select Files > Upload Object to upload training code to OSS.

      image

Create a DLC job

  1. In navigation pane on left of PAI console, click Deep Learning Containers (DLC) > Create Job.

    image

  2. Configure DLC job parameters. Key parameters (use defaults for others):

    • Image config: Select Image Address and enter the image URL for your Region.

      image

      Region

      Image URL

      China (Beijing)

      dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      China (Shanghai)

      dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      China (Hangzhou)

      dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      Other regions

      Find your region ID and replace <Region ID> in the image URL to get the full link:

      dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      This image has been verified to be compatible with environment in Quick Start for Interactive Modelling (DSW). Standard PAI workflow: first verify environment and develop code in DSW, then use DLC for training.
    • Mount dataset: Select Custom Dataset and choose dataset created in previous step. Default Mount Path is /mnt/data.

    • Startup Command: python /mnt/data/mnist_train.py

      This start command matches what you'd use in DSW or locally. However, because mnist_train.py is mounted to /mnt/data/, update code path to /mnt/data/mnist_train.py.
    • Source: Select Public Resources. For Resource Type, select ecs.gn7i-c8g1.2xlarge.

      Select another GPU-accelerated instance if unavailable.

    Click Confirm to create job. Job takes about 15 minutes to complete. Monitor training process by clicking Logs.

    image

    After job completes, best model checkpoint and TensorBoard logs are saved to output path of mounted dataset.

    image

(Optional) View TensorBoard

Use TensorBoard visualization tool to view loss curve and understand training details.

Important

To use TensorBoard for a DLC job, configure a dataset.

  1. On DLC job details page, click TensorBoard tab and then click Create TensorBoard.

    image

  2. Set Configuration Type to By Task. For Summary Path, enter path where summaries are stored in training code: /mnt/data/output/runs/. Click Confirm to start.

    This corresponds to the code snippet: writer = SummaryWriter('/mnt/data/output/runs/mnist_experiment')
  3. Click View TensorBoard to view the loss curves for the training dataset (train_loss) and the validation set (validation_loss).

    image

    (Optional) Adjust hyperparameters based on the loss graph to improve model performance

    Evaluate model training performance by observing loss value trends:

    • If both train_loss and validation_loss are still decreasing when training ends, this indicates underfitting.

      To address this, increase `num_epochs` (number of training epochs, positively correlated with training depth) or slightly increase `learning_rate` and retrain.

    • If train_loss continues to decrease while validation_loss starts to increase, this indicates overfitting.

      To address this, decrease `num_epochs` or slightly decrease `learning_rate` and retrain.

    • If both train_loss and validation_loss stabilize before training ends, this indicates a good fit.

      If model has a good fit, proceed to next steps.

Deploy trained model

For more information, see Use EAS to deploy the model as an online service.

Single-node multi-GPU or multi-node multi-GPU distributed training

Use distributed training when single GPU memory is insufficient or to accelerate training.

This example uses two instances with one GPU each. Same approach applies to other single-node multi-GPU or multi-node multi-GPU configurations.

Create a dataset

If you already created a dataset during single-node, single-GPU training, download and upload mnist_train_distributed.py code. Otherwise, first create a dataset and upload code.

Single-node multi-GPU or multi-node multi-GPU training code example: mnist_train_distributed.py

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms
from torch.utils.tensorboard import SummaryWriter

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=3)
        self.fc1 = nn.Linear(20 * 5 * 5, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, 20 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

def main():
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ["LOCAL_RANK"])
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(local_rank)
    device = torch.device('cuda', local_rank)

    batch_size = 64
    learning_rate = 0.01
    num_epochs = 20

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,))
    ])

    # Only the main process (rank=0) needs to download. Other processes must wait for it to finish.
    # Let processes with rank!=0 wait at the barrier first.
    if rank != 0:
        dist.barrier()

    # All processes execute the dataset creation.
    # However, only the process with rank=0 will actually perform the download.
    train_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=True, download=(rank == 0), transform=transform)

    # After the process with rank=0 finishes downloading, it also reaches the barrier, releasing all processes.
    if rank == 0:
        dist.barrier()

    # At this point, all processes are synchronized and can continue executing the subsequent code.
    val_dataset = datasets.MNIST(root='/mnt/data/dataSet', train=False, download=False, transform=transform)

    train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank, shuffle=True)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler, num_workers=4, pin_memory=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)

    model = SimpleCNN().to(device)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    criterion = nn.CrossEntropyLoss().to(device)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)

    if rank == 0:
        writer = SummaryWriter('/mnt/data/output_distributed/runs/mnist_experiment')
    best_val_accuracy = 0.0

    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 100 == 0:
                # Each rank and local_rank prints its own loss.
                print(f"Rank: {rank}, Local_Rank: {local_rank} -- Train Epoch: {epoch} "
                      f"[{batch_idx * len(data) * world_size}/{len(train_loader.dataset)} "
                      f"({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}")

                if rank == 0:
                    writer.add_scalar('Loss/train', loss.item(), epoch * len(train_loader) + batch_idx)

        # Validation
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for data, target in val_loader:
                data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)
                output = model(data)
                val_loss += criterion(output, target).item() * data.size(0)
                pred = output.argmax(dim=1, keepdim=True)
                correct += pred.eq(target.view_as(pred)).sum().item()
                total += target.size(0)
        val_loss_tensor = torch.tensor([val_loss], dtype=torch.float32, device=device)
        correct_tensor = torch.tensor([correct], dtype=torch.float32, device=device)
        total_tensor = torch.tensor([total], dtype=torch.float32, device=device)
        dist.all_reduce(val_loss_tensor, op=dist.ReduceOp.SUM)
        dist.all_reduce(correct_tensor, op=dist.ReduceOp.SUM)
        dist.all_reduce(total_tensor, op=dist.ReduceOp.SUM)

        val_loss = val_loss_tensor.item() / total_tensor.item()
        val_accuracy = 100. * correct_tensor.item() / total_tensor.item()

        if rank == 0:
            print(f'Validation Loss: {val_loss:.4f}, Accuracy: {int(correct_tensor.item())}/{int(total_tensor.item())} ({val_accuracy:.0f}%)')
            writer.add_scalar('Loss/validation', val_loss, epoch)
            writer.add_scalar('Accuracy/validation', val_accuracy, epoch)
            if val_accuracy > best_val_accuracy:
                best_val_accuracy = val_accuracy
                torch.save(model.module.state_dict(), '/mnt/data/output_distributed/best_model.pth')
                print(f'Model saved with accuracy: {best_val_accuracy:.2f}%')
    if rank == 0:
        writer.close()
    dist.destroy_process_group()
    if rank == 0:
        print('Training complete. writer.close()')


if __name__ == "__main__":
    main()

Create a DLC job

  1. In navigation pane on left of PAI console, click Deep Learning Containers (DLC) > Create Job.

    image

  2. Configure DLC job parameters. Key parameters (use defaults for others):

    • Image config: Select Image Address and enter the image URL for your Region.

      image

      Region

      Image URL

      China (Beijing)

      dsw-registry-vpc.cn-beijing.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      China (Shanghai)

      dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      China (Hangzhou)

      dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      Other regions

      Find your region ID and replace <Region ID> in the image URL to get the full link:

      dsw-registry-vpc.<Region ID>.cr.aliyuncs.com/pai/modelscope:1.28.0-pytorch2.3.1tensorflow2.16.1-gpu-py311-cu121-ubuntu22.04

      This image has been verified to be compatible with environment in Quick Start for Interactive Modelling (DSW). Standard PAI workflow: first verify environment and code in DSW, then use DLC for training.
    • Mount dataset: Select Custom Dataset and choose dataset created in previous step. Default Mount Path is /mnt/data.

    • Startup Command: torchrun --nproc_per_node=1 --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} /mnt/data/mnist_train_distributed.py

      DLC automatically injects common environment variables such as MASTER_ADDR, WORLD_SIZE$VARIABLE_NAME
    • Source: Select Public Resources. Set Quantity to 2. For Resource Type, select ecs.gn7i-c8g1.2xlarge.

      Select another GPU-accelerated instance if unavailable.

    Click Confirm to create job. Job takes about 10 minutes. View training Log for both instances on Overview page.

    image

    After job completes, best model checkpoint and TensorBoard logs are saved to output_distributed path of mounted dataset.

    image

(Optional) View TensorBoard

Use TensorBoard visualization tool to view loss curve and understand training details.

Important

To use TensorBoard for a DLC job, configure a dataset.

  1. On DLC job details page, click TensorBoard tab and then click Create TensorBoard.

    image

  2. Set Configuration Type to By Task. For Summary Path, enter path where summaries are stored in training code: /mnt/data/output_distributed/runs. Click Confirm to start.

    This corresponds to the code snippet: writer = SummaryWriter('/mnt/data/output_distributed/runs/mnist_experiment')
  3. Click View TensorBoard to view the loss curves for the training dataset (train_loss) and the validation set (validation_loss).

    image

    (Optional) Adjust hyperparameters based on the loss graph to improve model performance

    You can evaluate the model's training performance by observing the trend of the loss value:

    • If both train_loss and validation_loss are still decreasing when the training ends, this indicates underfitting.

      To address this, you can increase `num_epochs` (the number of training epochs, which is positively correlated with training depth) or slightly increase `learning_rate` and then retrain the model.

    • If train_loss continues to decrease while validation_loss starts to increase, this indicates overfitting.

      To address this, you can decrease `num_epochs` or slightly decrease `learning_rate` and then retrain the model.

    • If both train_loss and validation_loss stabilize before the training ends, this indicates a good fit.

      If the model has a good fit, you can proceed to the next steps.

Deploy the trained model

For more information, see Use EAS to deploy the model as an online service.

References