All Products
Search
Document Center

Platform For AI:Train a model for multi-label image classification

Last Updated:Sep 26, 2024

Platform for AI (PAI) provides a multi-label image classification algorithm that you can use to train a model based on tens of millions of images. This topic describes how to use PAI commands to train a model for multi-label image classification.

Sample PAI commands

You can run PAI commands by using the SQL Script component. For more information, see SQL Script. You can also run PAI commands by using the MaxCompute client or DataWorks nodes. For more information, see MaxCompute client (odpscmd) or Create and manage ODPS nodes.

  • Training for single-label image classification on a single server

    pai -name easy_vision_ext
               -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -DgpuRequired=100
               -Dcmd train
               -Dparam_config '--model_type Classification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4 --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'
  • Training for single-label image classification on multiple servers

    pai -name easy_vision_ext
               -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -Dcmd train
               -Dcluster='{
                 \"ps\": {
                     \"count\" : 1,
                     \"cpu\" : 600
                 },
                 \"worker\" : {
                     \"count\" : 3,
                     \"cpu\" : 800,
                     \"gpu\" : 100
                 }
               }'
               -Dparam_config='--model_type Classification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4_dis --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'
  • Training for multi-label image classification on a single server

    pai -name easy_vision_ext
               -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -DgpuRequired=100
               -Dcmd train
               -Dparam_config '--model_type MultiLabelClassification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4 --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'
  • Training for multi-label image classification on multiple servers

    pai -name easy_vision_ext
               -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
               -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
               -DossHost='{oss_host}'
               -Dcmd train
               -Dcluster='{
                 \"ps\": {
                     \"count\" : 1,
                     \"cpu\" : 600
                 },
                 \"worker\" : {
                     \"count\" : 3,
                     \"cpu\" : 800,
                     \"gpu\" : 100
                 }
               }'
               -Dparam_config='--model_type MultiLabelClassification --backbone  inception_v4 --num_classes 10 --num_epochs 1 --model_dir oss://examplebucket/test/cifar_inception_v4_dis --use_pretrained_model true --train_data oss://examplebucket/data/test/cifar10/*.tfrecord --test_data oss://examplebucket/data/test/cifar10/*.tfrecord --num_test_example 20 --train_batch_size 32 --test_batch_size=32 --image_size 299 --initial_learning_rate 0.01 --staircase true'

Command parameters

Parameter

Required

Description

Type or example

Default Value

buckets

Yes

The URL of the Object Storage Service (OSS) bucket that you want to use. The URL must end with a forward slash (/).

Example: oss://{bucket_name}.{oss_host}/{path}

N/A

arn

Yes

The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that has the permissions to access the OSS bucket. To obtain the ARN, log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS, click View Authorization in the Actions column, and then copy the value of the Role Name parameter. For more information, see Grant the permissions that are required to use Machine Learning Designer.

Example: acs:ram::*:role/AliyunODPSPAIDefaultRole

N/A

ossHost

No

The domain name of the OSS bucket. For more information, see Regions and endpoints. If you do not specify this parameter, the system sets this parameter to the domain name that is specified in the buckets parameter.

Example: oss-{region}.aliyuncs.com

The domain name that is specified in the buckets parameter

cluster

No

The configuration of distributed training.

Type: JSON string

""

gpuRequired

No

The number of GPUs that each worker uses. A value of 200 specifies that each worker uses two GPUs.

Example: 100

Example: 100

cmd

Yes

The type of the task that is run on EasyVision. Set this parameter to train.

Example: train

N/A

param_config

Yes

The model training parameters in the command line format that is required by the Python argpars module. For more information, see the "param_config" section of this topic.

Type: STRING

N/A

param_config

This parameter specifies the model training parameters in the command line format that is required by the Python argpars module. Example:

-Dparam_config = '--model_type MultiLabelClassification --backbone inception_v4 --num_classes 200 --model_dir oss://your/bucket/exp_dir'
Note

When you specify this parameter, do not enclose the value of a string parameter in double quotation marks (") or single quotation marks (').

Parameter

Required

Description

Type or example

Default Value

model_type

Yes

The type of the model that you want to train. Set this parameter to MultiLabelClassification.

Type: STRING

N/A

backbone

No

The name of the neural network that is used by the model. Valid values:

  • lenetcifarnetalexnetvgg_16

  • vgg_19

  • inception_v1

  • inception_v2

  • inception_v3

  • inception_v4

  • mobilenet_v1

  • mobilenet_v2

  • resnet_v1_50

  • resnet_v1_101

  • resnet_v1_152

Type: STRING

inception_v4

num_classes

Yes

The number of classes that are used to categorize the images.

Example: 100

N/A

image_size

No

The size of the images after they are resized for training. Unit: pixels.

Type: INT

224

use_crop

No

Specifies whether to crop the images to achieve data augmentation.

Type: BOOL

true

eval_each_category

No

Specifies whether to separately evaluate the model for each class.

Type: BOOL

false

optimizer

No

The type of the optimizer. Valid values:

  • momentum: uses the stochastic gradient descent (SGD) algorithm.

  • adam: uses the Adam optimization algorithm.

Type: STRING

momentum

lr_type

No

The policy that is used to adjust the learning rate. Valid values:

  • exponential_decay: The learning rate is subject to exponential decay.

  • polynomial_decay: The learning rate is subject to polynomial decay.

    In this policy, the number of training steps is automatically specified by using the num_steps parameter, and the end_learning_rate parameter is automatically set to one thousandth of the value of the initial_learning_rate parameter.

  • manual_step: The learning rate is manually adjusted.

    In this policy, use the decay_epochs parameter to specify the epochs for which you want to adjust the learning rate and the learning_rates parameter to specify the learning rates for the specified epochs.

  • cosine_decay: The learning rate is adjusted based on the cosine curve. For more information, see SGDR: Stochastic Gradient Descent with Warm Restarts.

Type: STRING

exponential_decay

initial_learning_rate

No

The initial learning rate.

Type: FLOAT

0.01

decay_epochs

No

If you set the lr_type parameter to exponential_decay, this parameter is equivalent to the decay_steps parameter of the tf.train.exponential.decay function. The system automatically uses the value of this parameter to calculate the value of the decay_steps parameter based on the total number of samples that are used for training. In this case, you can set this parameter to half of the total number of epochs. Example: 10. If you set the lr_type parameter to manual_step, this parameter specifies the epochs for which you want to adjust the learning rate. For example, a value of 16 18 specifies that the learning rate is adjusted in the 16th and 18th epochs. In most cases, you can set this parameter to 8/10 × N and 9/10 × N, where N is the total number of epochs.

Type: INTEGER list. Example: 20 20 40 60

20

decay_factor

No

The factor by which the learning rate decreases. This parameter is equivalent to the decay_factor parameter of the tf.train.exponential.decay function.

Type: FLOAT

0.95

staircase

No

Specifies whether to decrease the learning rate at discrete intervals. This parameter is equivalent to the staircase parameter of the tf.train.exponential.decay function.

Type: BOOL

true

power

No

The power of the polynomial. This parameter is equivalent to the power parameter of the tf.train.polynomial.decay function. This parameter is valid only if you set the lr_type parameter to polynomial_decay.

Type: FLOAT

0.9

learning_rates

No

The learning rate for each of the specified epochs. If you set the lr_type parameter to manual_step, this parameter is required. The number of elements in this parameter must be the same as the number of elements in the decay_epochs parameter. For example, if you set the decay_epochs parameter to 20 40, you must specify two learning rates for this parameter, such as 0.001 0.0001. This specifies that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rate to 1/10, 1/100, and 1/1000 of the initial learning rate in sequence.

Type: FLOAT list

N/A

train_data

Yes

The OSS path of the training dataset.

Example: oss://path/to/train_*.tfrecord

N/A

test_data

Yes

The OSS path of the evaluation dataset.

Example: oss://path/to/test_*.tfrecord

N/A

train_batch_size

Yes

The number of samples that are used to train the model per iteration.

Type: INT. Example: 32

N/A

test_batch_size

Yes

The number of samples that are used to evaluate the model per iteration.

Type: INT. Example: 32

N/A

train_num_readers

No

The number of concurrent threads used to read the samples that are used for training.

Type: INT

4

model_dir

Yes

The OSS path of the trained model.

Example: oss://path/to/model

N/A

pretrained_model

No

The OSS path of a pretrained model. If you specify this parameter, the output model is a fine-tuned version of the pretrained model.

Example: oss://pai-vision-data-sh/pretrained_models/inception_v4.ckpt

""

use_pretrained_model

No

Specifies whether to use a pretrained model.

Type: BOOL

true

num_epochs

Yes

The number of epochs. An epoch is a full cycle of exposing each sample in the training dataset to the model. A value of 1 specifies that each sample in the training dataset is processed once.

Type: INT. Example: 40

N/A

num_test_example

No

The number of samples that are used for model evaluation. A value of -1 specifies that the entire training dataset is used for model evaluation.

Type: INT. Example: 2000

-1

num_visualizations

No

The number of samples that can be visualized during model evaluation.

Type: INT

10

save_checkpoint_epochs

No

The interval at which a checkpoint is saved. Unit: epoch. A value of 1 specifies that a checkpoint is saved each time an epoch is completed.

Type: INT

1

num_train_images

No

The total number of samples that are used for training. If you use custom TFRecord files, this parameter is required.

Type: INT

0

label_map_path

No

The .pbtxt file that defines the mapping between label ID and label name. If you use custom TFRecord files, this parameter is required.

Type: STRING

""

Reference

The trained model for multi-label image classification assigns labels to an image based on a predefined probability threshold. You can deploy the trained model as an online service in Elastic Algorithm Service (EAS). For more information, see Model service deployment by using the PAI console.