All Products
Search
Document Center

Platform For AI:Best practices for DLC MNIST training

Last Updated:Jun 21, 2026

This topic shows you how to submit an AutoML experiment for hyperparameter tuning on DLC compute resources. Using the PyTorch framework and the torchvision.datasets.MNIST module, this solution automatically downloads and loads the MNIST dataset, then trains a model to find the optimal hyperparameter configuration. You can choose from three training modes: standalone, distributed, and nested parameter training.

Prerequisites

Step 1: Create a dataset

  1. Upload the script file mnist.py to your OSS bucket. For more information, see Quick start.

  2. Create an OSS dataset to store data files from the experiment. For more information, see Create and manage datasets.

    Configure the following key parameters and leave the other parameters at their default settings:

    • Dataset Name: Enter a custom dataset name.

    • Select Data Storage: Select the OSS directory where the script file is stored.

    • Property: Select Property.

Step 2: Create an experiment

Go to the New page and configure the key parameters as described in the following steps. For more information about other parameters, see Create an experiment. After you complete the configuration, click Submit.

  1. Configure the execution settings.

    This solution provides three training modes: standalone training, distributed training, and nested parameter training.

    Standalone training

    Parameter

    Description

    Job Type

    Select DLC.

    Resource Group

    Select Public Resource Group.

    Framework

    Select PyTorch.

    Datasets

    Select the dataset that you created in Step 1.

    Node Image

    Select PAI Image > pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04.

    Instance Type

    Select CPU > ecs.g6.4xlarge.

    Nodes

    Set to 1.

    Startup Command

    Set it to python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}.

    Hyperparameters

    • batch_size

      • Constraint type: Select choice.

      • Search space: Click image.png to add three enumerated values: 16, 32, and 64.

    • lr

      • Constraint type: Select choice.

      • Search space: Click image.png to add three enumerated values: 0.0001, 0.001, and 0.01.

    This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.

    Distributed training

    Parameter

    Description

    Job Type

    Select DLC.

    Resource Group

    Select Public Resource Group.

    Framework

    Select PyTorch.

    Datasets

    Select the dataset that you created in Step 1.

    Node Image

    Select PAI Image > pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04.

    Instance Type

    Select CPU > ecs.g6.4xlarge.

    Nodes

    Set to 3.

    Startup Command

    Set the command to python -m torch.distributed.launch --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK /mnt/data/mnist.py --data_dir=/mnt/data/examples/search/data --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}.

    Hyperparameters

    • batch_size

      • Constraint type: Select choice.

      • Search space: Click image.png to add three enumerated values: 16, 32, and 64.

    • lr

      • Constraint type: Select choice.

      • Search space: Click image.png to add three enumerated values: 0.0001, 0.001, and 0.01.

    This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.

    Nested parameter training

    Parameter

    Description

    Job Type

    Select DLC.

    Resource Group

    Select Public Resource Group.

    Framework

    Select PyTorch.

    Datasets

    Select the dataset that you created in Step 1.

    Node Image

    Select PAI Image > pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04.

    Instance Type

    Select CPU > ecs.g6.4xlarge.

    Nodes

    Set to 1.

    Startup Command

    Set it to python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${nested_params}.{batch_size} --lr=${nested_params}.{lr} --gamma=${gamma}.

    Hyperparameters

    • nested_params

      • Constraint type: Select choice.

      • Search space: Click image.png to add two enumerated values: {"_name":"large","{lr}":{"_type":"choice","_value":[0.02,0.2]},"{batch_size}":{"_type":"choice","_value":[256,128]}} and {"_name":"small","{lr}":{"_type":"choice","_value":[0.01,0.1]},"{batch_size}":{"_type":"choice","_value":[64,32]}}.

    • gamma

      • Constraint type: Select choice.

      • Search space: Click image.png to add three enumerated values: 0.8, 0.7, and 0.9.

    This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.

  2. Configure the trial settings.

    Parameter

    Description

    Metric

    Metric type

    Select stdout. This setting extracts the final metric from the standard output (stdout) during the run.

    Method

    Select best.

    Metric weight

    Configure the following items:

    • key: validation: accuracy=([0-9\\.]+)

    • Value: 1

    Metric source

    Set the command keyword to cmd1.

    Optimization direction

    Select Maximize.

    Model storage path

    Set to the OSS path for saving the model. This solution is configured as oss://examplebucket/examples/model/model_${exp_id}_${trial_id}.

  3. Configure the search settings.

    Parameter

    Description

    Search algorithm

    Select TPE. For more information about the algorithm, see Supported search algorithms.

    Maximum trials

    Set to 3.

    Maximum concurrent trials

    Set to 2.

    Enable early stopping

    Turn on this switch to stop a trial early if it performs poorly.

    Start step

    Set to 5. A trial can be stopped early only after it completes at least five evaluations.

Step 3: View experiment details and results

  1. In the experiment list, click the experiment name to open the Experiment Details page.

    The Experiment Details page has four main areas. The Basic configuration area displays the experiment ID, name, visibility, status, creator, creation time, and update time. The Trial status statistics area shows a donut chart with the number of trials that are completed, failed, running, or in other states. The Trial configuration area includes the metric type, calculation method, metric weight regular expression, metric source, and model storage path. Finally, the Search configuration area includes the search algorithm, maximum trials, maximum concurrent trials, optimization direction, and early stopping settings.

    On this page, you can monitor the progress and status of all trials. The experiment creates three trials based on your search algorithm and maximum trial settings.

  2. Click the Trials tab. You can view a list of all trials generated for the experiment, including the status, final metric, and hyperparameter combination for each trial.

Related documents