Train a model for multi-label image classification - Platform For AI

Platform for AI (PAI) provides a multi-label image classification algorithm that you can use to train a model based on tens of millions of images. This topic describes how to use PAI commands to train a model for multi-label image classification.

Sample PAI commands

You can run PAI commands by using the SQL Script component. For more information, see SQL Script. You can also run PAI commands by using the MaxCompute client or DataWorks nodes. For more information, see MaxCompute client (odpscmd) or Create and manage ODPS nodes.

Training for single-label image classification on a single server

pai -name easy_vision_ext
           -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
           -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
           -DossHost='{oss_host}'
           -DgpuRequired=100
           -Dcmd train
           -Dparam_config '--model_type Classification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4 --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'

Training for single-label image classification on multiple servers

pai -name easy_vision_ext
           -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
           -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
           -DossHost='{oss_host}'
           -Dcmd train
           -Dcluster='{
             \"ps\": {
                 \"count\" : 1,
                 \"cpu\" : 600
             },
             \"worker\" : {
                 \"count\" : 3,
                 \"cpu\" : 800,
                 \"gpu\" : 100
             }
           }'
           -Dparam_config='--model_type Classification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4_dis --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'

Training for multi-label image classification on a single server

pai -name easy_vision_ext
           -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
           -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
           -DossHost='{oss_host}'
           -DgpuRequired=100
           -Dcmd train
           -Dparam_config '--model_type MultiLabelClassification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4 --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'

Training for multi-label image classification on multiple servers

pai -name easy_vision_ext
           -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
           -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
           -DossHost='{oss_host}'
           -Dcmd train
           -Dcluster='{
             \"ps\": {
                 \"count\" : 1,
                 \"cpu\" : 600
             },
             \"worker\" : {
                 \"count\" : 3,
                 \"cpu\" : 800,
                 \"gpu\" : 100
             }
           }'
           -Dparam_config='--model_type MultiLabelClassification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4_dis --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'

Command parameters

Parameter	Required	Description	Type or example	Default Value
buckets	Yes	The URL of the Object Storage Service (OSS) bucket that you want to use. The URL must end with a forward slash (/).	Example: oss://{bucket_name}.{oss_host}/{path}	N/A
arn	Yes	The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that has the permissions to access the OSS bucket. To obtain the ARN, log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS, click View Authorization in the Actions column, and then copy the value of the Role Name parameter. For more information, see Grant the permissions that are required to use Machine Learning Designer.	Example: acs:ram::*:role/AliyunODPSPAIDefaultRole	N/A
ossHost	No	The domain name of the OSS bucket. For more information, see Regions and endpoints. If you do not specify this parameter, the system sets this parameter to the domain name that is specified in the buckets parameter.	Example: oss-{region}.aliyuncs.com	The domain name that is specified in the buckets parameter
cluster	No	The configuration of distributed training.	Type: JSON string	""
gpuRequired	No	The number of GPUs that each worker uses. A value of 200 specifies that each worker uses two GPUs.	Example: 100	Example: 100
cmd	Yes	The type of the task that is run on EasyVision. Set this parameter to train.	Example: train	N/A
param_config	Yes	The model training parameters in the command line format that is required by the Python argpars module. For more information, see the "param_config" section of this topic.	Type: STRING	N/A

param_config

This parameter specifies the model training parameters in the command line format that is required by the Python argpars module. Example:

-Dparam_config = '--model_type MultiLabelClassification --backbone inception_v4 --num_classes 200 --model_dir oss://your/bucket/exp_dir'

Note

When you specify this parameter, do not enclose the value of a string parameter in double quotation marks (") or single quotation marks (').

Parameter	Required	Description	Type or example	Default Value
model_type	Yes	The type of the model that you want to train. Set this parameter to MultiLabelClassification.	Type: STRING	N/A
backbone	No	The name of the neural network that is used by the model. Valid values: lenetcifarnetalexnetvgg_16 vgg_19 inception_v1 inception_v2 inception_v3 inception_v4 mobilenet_v1 mobilenet_v2 resnet_v1_50 resnet_v1_101 resnet_v1_152	Type: STRING	inception_v4
num_classes	Yes	The number of classes that are used to categorize the images.	Example: 100	N/A
image_size	No	The size of the images after they are resized for training. Unit: pixels.	Type: INT	224
use_crop	No	Specifies whether to crop the images to achieve data augmentation.	Type: BOOL	true
eval_each_category	No	Specifies whether to separately evaluate the model for each class.	Type: BOOL	false
optimizer	No	The type of the optimizer. Valid values: momentum: uses the stochastic gradient descent (SGD) algorithm. adam: uses the Adam optimization algorithm.	Type: STRING	momentum
lr_type	No	The policy that is used to adjust the learning rate. Valid values: exponential_decay: The learning rate is subject to exponential decay. polynomial_decay: The learning rate is subject to polynomial decay. In this policy, the number of training steps is automatically specified by using the num_steps parameter, and the end_learning_rate parameter is automatically set to one thousandth of the value of the initial_learning_rate parameter. manual_step: The learning rate is manually adjusted. In this policy, use the decay_epochs parameter to specify the epochs for which you want to adjust the learning rate and the learning_rates parameter to specify the learning rates for the specified epochs. cosine_decay: The learning rate is adjusted based on the cosine curve. For more information, see SGDR: Stochastic Gradient Descent with Warm Restarts.	Type: STRING	exponential_decay
initial_learning_rate	No	The initial learning rate.	Type: FLOAT	0.01
decay_epochs	No	If you set the lr_type parameter to exponential_decay, this parameter is equivalent to the decay_steps parameter of the tf.train.exponential.decay function. The system automatically uses the value of this parameter to calculate the value of the decay_steps parameter based on the total number of samples that are used for training. In this case, you can set this parameter to half of the total number of epochs. Example: 10. If you set the lr_type parameter to manual_step, this parameter specifies the epochs for which you want to adjust the learning rate. For example, a value of 16 18 specifies that the learning rate is adjusted in the 16th and 18th epochs. In most cases, you can set this parameter to 8/10 × N and 9/10 × N, where N is the total number of epochs.	Type: INTEGER list. Example: 20 20 40 60	20
decay_factor	No	The factor by which the learning rate decreases. This parameter is equivalent to the decay_factor parameter of the tf.train.exponential.decay function.	Type: FLOAT	0.95
staircase	No	Specifies whether to decrease the learning rate at discrete intervals. This parameter is equivalent to the staircase parameter of the tf.train.exponential.decay function.	Type: BOOL	true
power	No	The power of the polynomial. This parameter is equivalent to the power parameter of the tf.train.polynomial.decay function. This parameter is valid only if you set the lr_type parameter to polynomial_decay.	Type: FLOAT	0.9
learning_rates	No	The learning rate for each of the specified epochs. If you set the lr_type parameter to manual_step, this parameter is required. The number of elements in this parameter must be the same as the number of elements in the decay_epochs parameter. For example, if you set the decay_epochs parameter to 20 40, you must specify two learning rates for this parameter, such as 0.001 0.0001. This specifies that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rate to 1/10, 1/100, and 1/1000 of the initial learning rate in sequence.	Type: FLOAT list	N/A
train_data	Yes	The OSS path of the training dataset.	Example: oss://path/to/train_*.tfrecord	N/A
test_data	Yes	The OSS path of the evaluation dataset.	Example: oss://path/to/test_*.tfrecord	N/A
train_batch_size	Yes	The number of samples that are used to train the model per iteration.	Type: INT. Example: 32	N/A
test_batch_size	Yes	The number of samples that are used to evaluate the model per iteration.	Type: INT. Example: 32	N/A
train_num_readers	No	The number of concurrent threads used to read the samples that are used for training.	Type: INT	4
model_dir	Yes	The OSS path of the trained model.	Example: oss://path/to/model	N/A
pretrained_model	No	The OSS path of a pretrained model. If you specify this parameter, the output model is a fine-tuned version of the pretrained model.	Example: oss://pai-vision-data-sh/pretrained_models/inception_v4.ckpt	""
use_pretrained_model	No	Specifies whether to use a pretrained model.	Type: BOOL	true
num_epochs	Yes	The number of epochs. An epoch is a full cycle of exposing each sample in the training dataset to the model. A value of 1 specifies that each sample in the training dataset is processed once.	Type: INT. Example: 40	N/A
num_test_example	No	The number of samples that are used for model evaluation. A value of -1 specifies that the entire training dataset is used for model evaluation.	Type: INT. Example: 2000	-1
num_visualizations	No	The number of samples that can be visualized during model evaluation.	Type: INT	10
save_checkpoint_epochs	No	The interval at which a checkpoint is saved. Unit: epoch. A value of 1 specifies that a checkpoint is saved each time an epoch is completed.	Type: INT	1
num_train_images	No	The total number of samples that are used for training. If you use custom TFRecord files, this parameter is required.	Type: INT	0
label_map_path	No	The .pbtxt file that defines the mapping between label ID and label name. If you use custom TFRecord files, this parameter is required.	Type: STRING	""

Reference

The trained model for multi-label image classification assigns labels to an image based on a predefined probability threshold. You can deploy the trained model as an online service in Elastic Algorithm Service (EAS). For more information, see Model service deployment by using the PAI console.