Submit a hyperparameter tuning experiment on DLC compute resources - Platform For AI

Prerequisites

If you are using AutoML for the first time, grant the required permissions. For more information, see Cloud product dependencies and authorization: AutoML.
You have granted the required permissions to DLC. For more information, see Cloud product dependencies and authorization: DLC.
You have created a workspace and associated it with a public resource group. For more information, see Create and manage workspaces.
You have activated Object Storage Service (OSS) and created an OSS bucket. For more information, see Quick start.

Step 1: Create a dataset

Upload the script file mnist.py to your OSS bucket. For more information, see Quick start.
Create an OSS dataset to store data files from the experiment. For more information, see Create and manage datasets.

Configure the following key parameters and leave the other parameters at their default settings:
- Dataset Name: Enter a custom dataset name.
- Select Data Storage: Select the OSS directory where the script file is stored.
- Property: Select Property.

Step 2: Create an experiment

Go to the New page and configure the key parameters as described in the following steps. For more information about other parameters, see Create an experiment. After you complete the configuration, click Submit.

Configure the execution settings.

This solution provides three training modes: standalone training, distributed training, and nested parameter training.

Standalone training

Parameter	Description
Job Type	Select DLC.
Resource Group	Select Public Resource Group.
Framework	Select PyTorch.
Datasets	Select the dataset that you created in Step 1.
Node Image	Select PAI Image > `pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04`.
Instance Type	Select CPU > `ecs.g6.4xlarge`.
Nodes	Set to 1.
Startup Command	Set it to `python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}`.
Hyperparameters	batch_size Constraint type: Select choice. Search space: Click to add three enumerated values: 16, 32, and 64. lr Constraint type: Select choice. Search space: Click to add three enumerated values: 0.0001, 0.001, and 0.01. This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.

Distributed training

Parameter	Description
Job Type	Select DLC.
Resource Group	Select Public Resource Group.
Framework	Select PyTorch.
Datasets	Select the dataset that you created in Step 1.
Node Image	Select PAI Image > `pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04`.
Instance Type	Select CPU > `ecs.g6.4xlarge`.
Nodes	Set to 3.
Startup Command	Set the command to `python -m torch.distributed.launch --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK /mnt/data/mnist.py --data_dir=/mnt/data/examples/search/data --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}`.
Hyperparameters	batch_size Constraint type: Select choice. Search space: Click to add three enumerated values: 16, 32, and 64. lr Constraint type: Select choice. Search space: Click to add three enumerated values: 0.0001, 0.001, and 0.01. This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.

Nested parameter training

Parameter	Description
Job Type	Select DLC.
Resource Group	Select Public Resource Group.
Framework	Select PyTorch.
Datasets	Select the dataset that you created in Step 1.
Node Image	Select PAI Image > `pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04`.
Instance Type	Select CPU > `ecs.g6.4xlarge`.
Nodes	Set to 1.
Startup Command	Set it to `python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${nested_params}.{batch_size} --lr=${nested_params}.{lr} --gamma=${gamma}`.
Hyperparameters	nested_params Constraint type: Select choice. Search space: Click to add two enumerated values: `{"_name":"large","{lr}":{"_type":"choice","_value":[0.02,0.2]},"{batch_size}":{"_type":"choice","_value":[256,128]}}` and `{"_name":"small","{lr}":{"_type":"choice","_value":[0.01,0.1]},"{batch_size}":{"_type":"choice","_value":[64,32]}}`. gamma Constraint type: Select choice. Search space: Click to add three enumerated values: 0.8, 0.7, and 0.9. This configuration generates nine hyperparameter combinations. The experiment creates a trial for each combination.

Configure the trial settings.

Parameter		Description
Metric	Metric type	Select stdout. This setting extracts the final metric from the standard output (stdout) during the run.
	Method	Select best.
	Metric weight	Configure the following items: key: validation: accuracy=([0-9\\.]+) Value: 1
	Metric source	Set the command keyword to cmd1.
	Optimization direction	Select Maximize.
Model storage path		Set to the OSS path for saving the model. This solution is configured as `oss://examplebucket/examples/model/model_${exp_id}_${trial_id}`.

Configure the search settings.

Parameter	Description
Search algorithm	Select TPE. For more information about the algorithm, see Supported search algorithms.
Maximum trials	Set to 3.
Maximum concurrent trials	Set to 2.
Enable early stopping	Turn on this switch to stop a trial early if it performs poorly.
Start step	Set to 5. A trial can be stopped early only after it completes at least five evaluations.

Step 3: View experiment details and results

In the experiment list, click the experiment name to open the Experiment Details page.

The Experiment Details page has four main areas. The Basic configuration area displays the experiment ID, name, visibility, status, creator, creation time, and update time. The Trial status statistics area shows a donut chart with the number of trials that are completed, failed, running, or in other states. The Trial configuration area includes the metric type, calculation method, metric weight regular expression, metric source, and model storage path. Finally, the Search configuration area includes the search algorithm, maximum trials, maximum concurrent trials, optimization direction, and early stopping settings.

On this page, you can monitor the progress and status of all trials. The experiment creates three trials based on your search algorithm and maximum trial settings.
Click the Trials tab. You can view a list of all trials generated for the experiment, including the status, final metric, and hyperparameter combination for each trial.

Platform For AI:Best practices for DLC MNIST training