This topic shows you how to submit an AutoML experiment for hyperparameter tuning on DLC compute resources. Using the PyTorch framework and the torchvision.datasets.MNIST module, this solution automatically downloads and loads the MNIST dataset, then trains a model to find the optimal hyperparameter configuration. You can choose from three training modes: standalone, distributed, and nested parameter training.
Prerequisites
-
If you are using AutoML for the first time, grant the required permissions. For more information, see Cloud product dependencies and authorization: AutoML.
-
You have granted the required permissions to DLC. For more information, see Cloud product dependencies and authorization: DLC.
-
You have created a workspace and associated it with a public resource group. For more information, see Create and manage workspaces.
-
You have activated Object Storage Service (OSS) and created an OSS bucket. For more information, see Quick start.
Step 1: Create a dataset
-
Upload the script file mnist.py to your OSS bucket. For more information, see Quick start.
-
Create an OSS dataset to store data files from the experiment. For more information, see Create and manage datasets.
Configure the following key parameters and leave the other parameters at their default settings:
-
Dataset Name: Enter a custom dataset name.
-
Select Data Storage: Select the OSS directory where the script file is stored.
-
Property: Select Property.
-
Step 2: Create an experiment
Go to the New page and configure the key parameters as described in the following steps. For more information about other parameters, see Create an experiment. After you complete the configuration, click Submit.
-
Configure the execution settings.
This solution provides three training modes: standalone training, distributed training, and nested parameter training.
Standalone training
Parameter
Description
Job Type
Select DLC.
Resource Group
Select Public Resource Group.
Framework
Select PyTorch.
Datasets
Select the dataset that you created in Step 1.
Node Image
Select PAI Image >
pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04.Instance Type
Select CPU >
ecs.g6.4xlarge.Nodes
Set to 1.
Startup Command
Set it to
python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}.Hyperparameters
-
batch_size
-
Constraint type: Select choice.
-
Search space: Click
to add three enumerated values: 16, 32, and 64.
-
-
lr
-
Constraint type: Select choice.
-
Search space: Click
to add three enumerated values: 0.0001, 0.001, and 0.01.
-
This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.
Distributed training
Parameter
Description
Job Type
Select DLC.
Resource Group
Select Public Resource Group.
Framework
Select PyTorch.
Datasets
Select the dataset that you created in Step 1.
Node Image
Select PAI Image >
pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04.Instance Type
Select CPU >
ecs.g6.4xlarge.Nodes
Set to 3.
Startup Command
Set the command to
python -m torch.distributed.launch --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK /mnt/data/mnist.py --data_dir=/mnt/data/examples/search/data --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}.Hyperparameters
-
batch_size
-
Constraint type: Select choice.
-
Search space: Click
to add three enumerated values: 16, 32, and 64.
-
-
lr
-
Constraint type: Select choice.
-
Search space: Click
to add three enumerated values: 0.0001, 0.001, and 0.01.
-
This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.
Nested parameter training
Parameter
Description
Job Type
Select DLC.
Resource Group
Select Public Resource Group.
Framework
Select PyTorch.
Datasets
Select the dataset that you created in Step 1.
Node Image
Select PAI Image >
pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04.Instance Type
Select CPU >
ecs.g6.4xlarge.Nodes
Set to 1.
Startup Command
Set it to
python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${nested_params}.{batch_size} --lr=${nested_params}.{lr} --gamma=${gamma}.Hyperparameters
-
nested_params
-
Constraint type: Select choice.
-
Search space: Click
to add two enumerated values: {"_name":"large","{lr}":{"_type":"choice","_value":[0.02,0.2]},"{batch_size}":{"_type":"choice","_value":[256,128]}}and{"_name":"small","{lr}":{"_type":"choice","_value":[0.01,0.1]},"{batch_size}":{"_type":"choice","_value":[64,32]}}.
-
-
gamma
-
Constraint type: Select choice.
-
Search space: Click
to add three enumerated values: 0.8, 0.7, and 0.9.
-
This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.
-
-
Configure the trial settings.
Parameter
Description
Metric
Metric type
Select stdout. This setting extracts the final metric from the standard output (stdout) during the run.
Method
Select best.
Metric weight
Configure the following items:
-
key: validation: accuracy=([0-9\\.]+)
-
Value: 1
Metric source
Set the command keyword to cmd1.
Optimization direction
Select Maximize.
Model storage path
Set to the OSS path for saving the model. This solution is configured as
oss://examplebucket/examples/model/model_${exp_id}_${trial_id}. -
-
Configure the search settings.
Parameter
Description
Search algorithm
Select TPE. For more information about the algorithm, see Supported search algorithms.
Maximum trials
Set to 3.
Maximum concurrent trials
Set to 2.
Enable early stopping
Turn on this switch to stop a trial early if it performs poorly.
Start step
Set to 5. A trial can be stopped early only after it completes at least five evaluations.
Step 3: View experiment details and results
-
In the experiment list, click the experiment name to open the Experiment Details page.
The Experiment Details page has four main areas. The Basic configuration area displays the experiment ID, name, visibility, status, creator, creation time, and update time. The Trial status statistics area shows a donut chart with the number of trials that are completed, failed, running, or in other states. The Trial configuration area includes the metric type, calculation method, metric weight regular expression, metric source, and model storage path. Finally, the Search configuration area includes the search algorithm, maximum trials, maximum concurrent trials, optimization direction, and early stopping settings.
On this page, you can monitor the progress and status of all trials. The experiment creates three trials based on your search algorithm and maximum trial settings.
-
Click the Trials tab. You can view a list of all trials generated for the experiment, including the status, final metric, and hyperparameter combination for each trial.
Related documents
-
You can also submit a hyperparameter tuning experiment on MaxCompute compute resources. For more information, see Best practices for MaxCompute k-means clustering.
-
For more information about how to use AutoML and how it works, see AutoML.