Machine Learning Platform for AI (PAI) provides the algorithm of multi-label image classification. You can use the algorithm to train models based on tens of millions of images. This topic describes how to use PAI commands to train a model for multi-label image classification.

In image classification models, an image has only one label. In models for multi-label image classification, an image can have multiple labels. You can use models for multi-label image classification to recognize images and obtain the labels of which the recognition probability reaches a specific threshold. You can use Elastic Algorithm Service (EAS) of PAI to deploy the trained models as RESTful API operations. These operations can be called by using the MaxCompute console or ODPS SQL nodes of DataWorks. For more information, see Calling method.

Training of image classification

  • Training of image classification on a single server
    pai -name easy_vision_ext
               -Dbuckets='{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -DgpuRequired=100
               -Dcmd train
               -Dparam_config '
                 --model_type Classification
                 --backbone  inception_v4
                 --num_classes 10
                 --num_epochs 1
                 --model_dir oss://pai-vision-data-sh/test/cifar_inception_v4
                 --use_pretrained_model true
                 --train_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --test_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --num_test_example 20
                 --train_batch_size 32
                 --test_batch_size=32
                 --image_size 299
                 --initial_learning_rate 0.01
                 --staircase true   '
  • Training of image classification on multiple servers
    pai -name easy_vision_ext
               -Dbuckets='{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -Dcmd train
               -Dcluster='{
                 \"ps\": {
                     \"count\" : 1,
                     \"cpu\" : 600
                 },
                 \"worker\" : {
                     \"count\" : 3,
                     \"cpu\" : 800,
                     \"gpu\" : 100
                 }
               }'
               -Dparam_config='
                 --model_type Classification
                 --backbone  inception_v4
                 --num_classes 10
                 --num_epochs 1
                 --model_dir oss://pai-vision-data-sh/test/cifar_inception_v4_dis
                 --use_pretrained_model true
                 --train_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --test_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --num_test_example 20
                 --train_batch_size 32
                 --test_batch_size=32
                 --image_size 299
                 --initial_learning_rate 0.01
                 --staircase true
               '
  • Training of multi-label image classification on a single server
    pai -name easy_vision_ext
               -Dbuckets='{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -DgpuRequired=100
               -Dcmd train
               -Dparam_config '
                 --model_type MultiLabelClassification
                 --backbone  inception_v4
                 --num_classes 10
                 --num_epochs 1
                 --model_dir oss://pai-vision-data-sh/test/cifar_inception_v4
                 --use_pretrained_model true
                 --train_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --test_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --num_test_example 20
                 --train_batch_size 32
                 --test_batch_size=32
                 --image_size 299
                 --initial_learning_rate 0.01
                 --staircase true   '
  • Training of multi-label image classification on multiple servers
    pai -name easy_vision_ext
               -Dbuckets='{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -Dcmd train
               -Dcluster='{
                 \"ps\": {
                     \"count\" : 1,
                     \"cpu\" : 600
                 },
                 \"worker\" : {
                     \"count\" : 3,
                     \"cpu\" : 800,
                     \"gpu\" : 100
                 }
               }'
               -Dparam_config='
                 --model_type MultiLabelClassification
                 --backbone  inception_v4
                 --num_classes 10
                 --num_epochs 1
                 --model_dir oss://pai-vision-data-sh/test/cifar_inception_v4_dis
                 --use_pretrained_model true
                 --train_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --test_data oss://pai-vision-data-sh/data/test/cifar10/*.tfrecord
                 --num_test_example 20
                 --train_batch_size 32
                 --test_batch_size=32
                 --image_size 299
                 --initial_learning_rate 0.01
                 --staircase true
               '

Parameters

ParameterRequiredDescriptionValue format or example valueDefault value
bucketsYesThe endpoint of the Object Storage Service (OSS) bucket.oss://{bucket_name}.{oss_host}/{path}N/A
arnYesThe Alibaba Cloud Resource Name (ARN) of the RAM role that has the permissions to access OSS resources. For more information about how to obtain the ARN, see the "I/O parameters" section of the Parameters of PAI-TensorFlow tasks topic.acs:ram::*:role/aliyunodpspaidefaultroleN/A
hostNoThe domain name of the OSS bucket. If you do not specify this parameter, the domain name of the OSS bucket is obtained from the buckets parameter.oss-{region}.aliyuncs.comThe domain name that is specified in the buckets parameter.
clusterNoThe configuration of parameters that are used for distributed training.JSON string""
gpuRequiredNoSpecifies whether to use GPUs. Each worker uses one GPU by default. If you set this parameter to 200, each worker uses two GPUs.100100
cmdYesThe type of the EasyVision task. Set this parameter to train when you train a model.trainN/A
param_configYesThe configuration of parameters that are used for model training. The format of the param_config parameter is the same as that of the ArgumentParser() object in Python. For more information, see param_config.STRINGN/A

param_config

The param_config parameter contains several parameters that are used for model training. The format of the param_config parameter is the same as that of the ArgumentParser() object in Python. The following example shows the configuration of the param_config parameter:
-Dparam_config = '
--model_type MultiLabelClassification
--backbone inception_v4
--num_classes 200
--model_dir oss://your/bucket/exp_dir
'
Note The values of all string parameters in the param_config parameter are not enclosed in double quotation marks (") or single quotation marks (').
ParameterRequiredDescriptionValue format or example valueDefault value
model_typeYesThe type of the model to train. Set this parameter to MultiLabelClassification when you train a model for multi-label image classification.STRINGN/A
backboneNoThe name of the backbone network that is used by the model. Valid values:
  • lenetcifarnetalexnetvgg_16
  • vgg_19
  • inception_v1
  • inception_v2
  • inception_v3
  • inception_v4
  • mobilenet_v1
  • mobilenet_v2
  • resnet_v1_50
  • resnet_v1_101
  • resnet_v1_152
STRINGinception_v4
num_classesYesThe number of labels.100N/A
image_sizeNoThe size of the images after they are resized for the training. Unit: Pixel.INT224
use_cropNoSpecifies whether to crop images for data enhancement.BOOLtrue
eval_each_categoryNoSpecifies whether to separately evaluate each label.BOOLfalse
optimizerNoThe type of the optimizer. Valid values:
  • momentum: stochastic gradient descent (SGD) with momentum
  • adam
STRING momentum
lr_typeNoThe policy that is used to adjust the learning rate. Valid values:
  • exponential_decay: the exponential decay.
  • polynomial_decay: the polynomial decay.

    If you set the lr_type parameter to polynomial_decay, the num_steps parameter is automatically set to the total number of training iterations. The value of the end_learning_rate parameter is automatically set to one thousandth of the value of the initial_learning_rate parameter.

  • manual_step: manually adjusts learning rates for epochs.

    If you set the lr_type parameter to manual_step, you must set the decay_epochs parameter to specify the epochs for which you want to adjust the learning rates. You must also set the learning_rates parameter to specify the learning rates as needed.

  • cosine_decay: adjusts the learning rate by following the cosine curve. For more information, see SGDR: Stochastic Gradient Descent with Warm Restarts.
STRINGexponential_decay
initial_learning_rateNoThe initial learning rate.FLOAT0.01
decay_epochsNoIf you set the lr_type parameter to exponential_decay, the decay_epochs parameter is equivalent to the decay_steps parameter of tf.train.exponential.decay. In this case, the decay_epochs parameter specifies the epoch interval at which you want to adjust the learning rate. The system automatically converts the value of the decay_epochs parameter to the value of the decay_steps parameter based on the total number of training data entries. Typically, you can set the decay_epochs parameter to half of the total number of epochs. For example, you can set this parameter to 10 if the total number of epochs is 20. If you set the lr_type parameter to manual_step, the decay_epochs parameter specifies the epochs for which you want to adjust the learning rates. For example, the value 16 18 indicates that you want to adjust the learning rates for the 16th and 18th epochs. Typically, if the total number of epochs is N, you can set the two values of the decay_epochs parameter to 8/10 × N and 9/10 × N.INTEGER list. Example value: 20 20 40 60.20
decay_factorNoThe decay rate. This parameter is equivalent to the decay_rate parameter of tf.train.exponential.decay.FLOAT0.95
staircaseNoSpecifies whether the learning rate changes based on the decay_epochs parameter. This parameter is equivalent to the staircase parameter of tf.train.exponential.decay.BOOLtrue
powerNoThe power of the polynomial. This parameter is equivalent to the power parameter of tf.train.polynomial.decay.FLOAT0.9
learning_ratesNoThe learning rates that you want to set for the specified epochs. This parameter is required when you set the lr_type parameter to manual_step. If you want to adjust the learning rates for two epochs, set two learning rates in the value. For example, if the decay_epoches parameter is set to 20 40, you must specify two learning rates in the learning_rates parameter, such as 0.001 0.0001. This indicates that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rates to one tenth, one hundredth, and one thousandth of the initial learning rate in sequence.FLOAT listN/A
train_dataYesThe OSS endpoint of the data that is used to train the model.oss://path/to/train_*.tfrecordN/A
test_dataYesThe OSS endpoint of the data that is evaluated during the training.oss://path/to/test_*.tfrecordN/A
train_batch_sizeYesThe size of the data that is used to train the model in the current batch.INT. Example value: 32.N/A
test_batch_sizeYesThe size of the data that is evaluated in the current batch.INT. Example value: 32.N/A
train_num_readersNoThe number of concurrent threads that are used to read the training data.INT4
model_dirYesThe OSS endpoint of the model.oss://path/to/modelN/A
pretrained_modelNoThe OSS endpoint of the pretrained model. If this parameter is specified, the actual model is finetuned based on the pretrained model.oss://pai-vision-data-sh/pretrained_models/inception_v4.ckpt""
use_pretrained_modelNoSpecifies whether to use a pretrained model.BOOLtrue
num_epochsYesThe number of times the data is iterated for the training. The value 1 indicates that all data is iterated once for the training.INT. Example value: 40.N/A
num_test_exampleNoThe number of data entries that are evaluated during the training. The value -1 indicates that all training data is evaluated.INT. Example value: 2000.-1
num_visualizationsNoThe number of data entries that can be visualized during the evaluation.INT10
save_checkpoint_epochsNoThe epoch interval at which a checkpoint is saved. The value 1 indicates that a checkpoint is saved each time an epoch is complete.INT1
num_train_imagesNoThe total number of data entries that are used for the training. This parameter is required if you use custom TFRecord files to train the model.INT0
label_map_pathNoThe label mapping file. This parameter is required if you use custom TFRecord files to train the model.STRING""