Platform for AI (PAI) provides a multi-label image classification algorithm that you can use to train a model based on tens of millions of images. This topic describes how to use PAI commands to train a model for multi-label image classification.
Sample PAI commands
You can run PAI commands by using the SQL Script component. For more information, see SQL Script. You can also run PAI commands by using the MaxCompute client or DataWorks nodes. For more information, see MaxCompute client (odpscmd) or Create and manage ODPS nodes.
Training for single-label image classification on a single server
pai -name easy_vision_ext -Dbuckets='oss://{bucket_name}.{oss_host}/{path}' -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole' -DossHost='{oss_host}' -DgpuRequired=100 -Dcmd train -Dparam_config '--model_type Classification --backbone inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4 --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'Training for single-label image classification on multiple servers
pai -name easy_vision_ext -Dbuckets='oss://{bucket_name}.{oss_host}/{path}' -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole' -DossHost='{oss_host}' -Dcmd train -Dcluster='{ \"ps\": { \"count\" : 1, \"cpu\" : 600 }, \"worker\" : { \"count\" : 3, \"cpu\" : 800, \"gpu\" : 100 } }' -Dparam_config='--model_type Classification --backbone inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4_dis --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'Training for multi-label image classification on a single server
pai -name easy_vision_ext -Dbuckets='oss://{bucket_name}.{oss_host}/{path}' -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole' -DossHost='{oss_host}' -DgpuRequired=100 -Dcmd train -Dparam_config '--model_type MultiLabelClassification --backbone inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4 --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'Training for multi-label image classification on multiple servers
pai -name easy_vision_ext -Dbuckets='oss://{bucket_name}.{oss_host}/{path}' -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole' -DossHost='{oss_host}' -Dcmd train -Dcluster='{ \"ps\": { \"count\" : 1, \"cpu\" : 600 }, \"worker\" : { \"count\" : 3, \"cpu\" : 800, \"gpu\" : 100 } }' -Dparam_config='--model_type MultiLabelClassification --backbone inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4_dis --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'
Command parameters
Parameter | Required | Description | Type or example | Default Value |
buckets | Yes | The URL of the Object Storage Service (OSS) bucket that you want to use. The URL must end with a forward slash (/). | Example: oss://{bucket_name}.{oss_host}/{path} | N/A |
arn | Yes | The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that has the permissions to access the OSS bucket. To obtain the ARN, log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS, click View Authorization in the Actions column, and then copy the value of the Role Name parameter. For more information, see Grant the permissions that are required to use Machine Learning Designer. | Example: acs:ram::*:role/AliyunODPSPAIDefaultRole | N/A |
ossHost | No | The domain name of the OSS bucket. For more information, see Regions and endpoints. If you do not specify this parameter, the system sets this parameter to the domain name that is specified in the buckets parameter. | Example: oss-{region}.aliyuncs.com | The domain name that is specified in the buckets parameter |
cluster | No | The configuration of distributed training. | Type: JSON string | "" |
gpuRequired | No | The number of GPUs that each worker uses. A value of 200 specifies that each worker uses two GPUs. | Example: 100 | Example: 100 |
cmd | Yes | The type of the task that is run on EasyVision. Set this parameter to train. | Example: train | N/A |
param_config | Yes | The model training parameters in the command line format that is required by the Python argpars module. For more information, see the "param_config" section of this topic. | Type: STRING | N/A |
param_config
This parameter specifies the model training parameters in the command line format that is required by the Python argpars module. Example:
-Dparam_config = '--model_type MultiLabelClassification --backbone inception_v4 --num_classes 200 --model_dir oss://your/bucket/exp_dir'When you specify this parameter, do not enclose the value of a string parameter in double quotation marks (") or single quotation marks (').
Parameter | Required | Description | Type or example | Default Value |
model_type | Yes | The type of the model that you want to train. Set this parameter to MultiLabelClassification. | Type: STRING | N/A |
backbone | No | The name of the neural network that is used by the model. Valid values:
| Type: STRING | inception_v4 |
num_classes | Yes | The number of classes that are used to categorize the images. | Example: 100 | N/A |
image_size | No | The size of the images after they are resized for training. Unit: pixels. | Type: INT | 224 |
use_crop | No | Specifies whether to crop the images to achieve data augmentation. | Type: BOOL | true |
eval_each_category | No | Specifies whether to separately evaluate the model for each class. | Type: BOOL | false |
optimizer | No | The type of the optimizer. Valid values:
| Type: STRING | momentum |
lr_type | No | The policy that is used to adjust the learning rate. Valid values:
| Type: STRING | exponential_decay |
initial_learning_rate | No | The initial learning rate. | Type: FLOAT | 0.01 |
decay_epochs | No | If you set the lr_type parameter to exponential_decay, this parameter is equivalent to the decay_steps parameter of the tf.train.exponential.decay function. The system automatically uses the value of this parameter to calculate the value of the decay_steps parameter based on the total number of samples that are used for training. In this case, you can set this parameter to half of the total number of epochs. Example: 10. If you set the lr_type parameter to manual_step, this parameter specifies the epochs for which you want to adjust the learning rate. For example, a value of 16 18 specifies that the learning rate is adjusted in the 16th and 18th epochs. In most cases, you can set this parameter to 8/10 × N and 9/10 × N, where N is the total number of epochs. | Type: INTEGER list. Example: 20 20 40 60 | 20 |
decay_factor | No | The factor by which the learning rate decreases. This parameter is equivalent to the decay_factor parameter of the tf.train.exponential.decay function. | Type: FLOAT | 0.95 |
staircase | No | Specifies whether to decrease the learning rate at discrete intervals. This parameter is equivalent to the staircase parameter of the tf.train.exponential.decay function. | Type: BOOL | true |
power | No | The power of the polynomial. This parameter is equivalent to the power parameter of the tf.train.polynomial.decay function. This parameter is valid only if you set the lr_type parameter to polynomial_decay. | Type: FLOAT | 0.9 |
learning_rates | No | The learning rate for each of the specified epochs. If you set the lr_type parameter to manual_step, this parameter is required. The number of elements in this parameter must be the same as the number of elements in the decay_epochs parameter. For example, if you set the decay_epochs parameter to 20 40, you must specify two learning rates for this parameter, such as 0.001 0.0001. This specifies that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rate to 1/10, 1/100, and 1/1000 of the initial learning rate in sequence. | Type: FLOAT list | N/A |
train_data | Yes | The OSS path of the training dataset. | Example: oss://path/to/train_*.tfrecord | N/A |
test_data | Yes | The OSS path of the evaluation dataset. | Example: oss://path/to/test_*.tfrecord | N/A |
train_batch_size | Yes | The number of samples that are used to train the model per iteration. | Type: INT. Example: 32 | N/A |
test_batch_size | Yes | The number of samples that are used to evaluate the model per iteration. | Type: INT. Example: 32 | N/A |
train_num_readers | No | The number of concurrent threads used to read the samples that are used for training. | Type: INT | 4 |
model_dir | Yes | The OSS path of the trained model. | Example: oss://path/to/model | N/A |
pretrained_model | No | The OSS path of a pretrained model. If you specify this parameter, the output model is a fine-tuned version of the pretrained model. | Example: oss://pai-vision-data-sh/pretrained_models/inception_v4.ckpt | "" |
use_pretrained_model | No | Specifies whether to use a pretrained model. | Type: BOOL | true |
num_epochs | Yes | The number of epochs. An epoch is a full cycle of exposing each sample in the training dataset to the model. A value of 1 specifies that each sample in the training dataset is processed once. | Type: INT. Example: 40 | N/A |
num_test_example | No | The number of samples that are used for model evaluation. A value of -1 specifies that the entire training dataset is used for model evaluation. | Type: INT. Example: 2000 | -1 |
num_visualizations | No | The number of samples that can be visualized during model evaluation. | Type: INT | 10 |
save_checkpoint_epochs | No | The interval at which a checkpoint is saved. Unit: epoch. A value of 1 specifies that a checkpoint is saved each time an epoch is completed. | Type: INT | 1 |
num_train_images | No | The total number of samples that are used for training. If you use custom TFRecord files, this parameter is required. | Type: INT | 0 |
label_map_path | No | The .pbtxt file that defines the mapping between label ID and label name. If you use custom TFRecord files, this parameter is required. | Type: STRING | "" |
Reference
The trained model for multi-label image classification assigns labels to an image based on a predefined probability threshold. You can deploy the trained model as an online service in Elastic Algorithm Service (EAS). For more information, see Model service deployment by using the PAI console.