All Products
Search
Document Center

Platform For AI:Train a model for semantic image segmentation

Last Updated:Jul 24, 2024

EasyVision of Platform for AI (PAI) allows you to train models for semantic image segmentation and use the trained models to make predictions. This topic describes how to use PAI commands to train a model for semantic image segmentation.

Train a model for semantic image segmentation

The DeepLab-V3 semantic segmentation model provides the image segmentation component. For more information, see Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. You can use the SQL Script component to run PAI commands. You can also use the MaxCompute client or SQL nodes for DataWorks to run PAI commands. For more information, see MaxCompute client (odpscmd) or Develop a MaxCompute SQL task. The following sample code provides an example on how to use PAI commands to train a model for semantic image segmentation by using a single GPU.

pai -name easy_vision_ext
           -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
           -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
           -DossHost='{oss_host}'
           -DgpuRequired=100
           -Dcmd train
           -Dparam_config '--model_type DeeplabV3 --backbone  resnet_v1_50 --backbone_feature_stride 16 --bn_trainable true --num_classes 21 --num_epochs 1 --model_dir oss://YOUR_BUCKET_NAME/test/test_deeplabv3 --train_data oss://YOUR_BUCKET_NAME/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord --test_data oss://YOUR_BUCKET_NAME/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord --num_test_example 2 --train_batch_size 6 --test_batch_size 1 --image_crop_size 513 --lr_type polynomial_decay --initial_learning_rate 0.007 --power 0.9'

Parameters

Parameter

Required

Description

Format

Default Value

buckets

Yes

The URL of the Object Storage Service (OSS) bucket. The URL must end with a forward slash (/).

oss://{bucket_name}.{oss_host}/{path}

N/A

arn

Yes

The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that has the permissions to access the OSS bucket. To obtain the ARN, log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS, click View Authorization in the Actions column, and then copy the value of the Role Name parameter. For more information, see Grant the permissions that are required to use Machine Learning Designer.

acs:ram::*:role/AliyunODPSPAIDefaultRole

N/A

ossHost

No

The domain name of the OSS bucket. For more information, see Regions and endpoints.

oss-{region}.aliyuncs.com

By default, the domain name is specified by the buckets parameter.

cluster

No

The configuration of distributed training.

JSON string

""

gpuRequired

No

The number of GPUs that each worker uses. By default, each worker uses one GPU. If you set this parameter to 200, a worker requests two cards.

100

100

cmd

Yes

The type of the job that runs on EasyVision. Set this parameter to train.

train

N/A

param_config

Yes

The configuration for model training. The format of the param_config parameter is the same as the format of the ArgumentParser() object in Python. For more information, see param_config.

STRING

N/A

param_config

The param_config parameter specifies the model training parameters in the command line format that is required by the Python argpars module. Example:

-Dparam_config = '--backbone resnet_v1_50 --num_classes 200 --model_dir oss://YOUR_BUCKET_NAME/exp_dir'
Note

When you specify this parameter, do not enclose the value of a string parameter in double quotation marks (") or single quotation marks (').

Parameter

Required

Description

Format

Default Value

model_type

Yes

The type of the model that you want to train. When you train a model for semantic image segmentation, set this parameter to DeeplabV3.

STRING

N/A

backbone

Yes

The name of the backbone network that is used by the model. Valid values:

  • resnet_v1_50

  • resnet_v1_101

  • resnet_v1a_18

  • resnet_v1a_34

  • resnet_v1d_50

  • resnet_v1d_101

  • xception_41

  • xception_65

  • xception_71

STRING

N/A

weight_decay

No

The value of L2 regularization.

FLOAT

1e-4

num_classes

Yes

The number of segmented classes, including background classes.

21

N/A

backbone_feature_stride

No

The feature downsampling step size of the backbone network.

INT

16

bn_trainable

No

Specifies whether the batch normalization (BN) layer is trainable. If the value of the train_batch_size parameter is greater than 8, set this parameter to true.

BOOL

true

image_crop_size

No

The size of the image after cropping.

INT

513

optimizer

No

The type of the optimizer. Valid values:

  • momentum: uses the stochastic gradient descent (SGD) algorithm.

  • adam

STRING

momentum

lr_type

No

The policy that is used to adjust the learning rate. Valid values:

  • exponential_decay: The learning rate is subject to exponential decay.

  • polynomial_decay: The learning rate is subject to polynomial decay.

    If you set the lr_type parameter to polynomial_decay, the num_steps parameter is automatically set to the total number of training iterations. The value of the end_learning_rate parameter is automatically set to one thousandth of the value of the initial_learning_rate parameter.

  • manual_step: The learning rate of each epoch is manually adjusted.

    If you set the lr_type parameter to manual_step, you must configure the decay_epochs parameter to specify the epochs for which you want to adjust the learning rate. You must also configure the learning_rates parameter to specify the learning rate of each epoch based on your business requirements.

  • cosine_decay

    The learning rate of each epoch is adjusted based on the cosine curve. For more information, see SGDR: Stochastic Gradient Descent with Warm Restarts. If you set the lr_type parameter to cosine_decay, you must configure the decay_epochs parameter to specify the epochs for which you want to adjust the learning rate.

STRING

exponential_decay

initial_learning_rate

No

The initial learning rate.

FLOAT

0.01

decay_epochs

No

If you set the lr_type parameter to exponential_decay, this parameter is equivalent to the decay_steps parameter of the tf.train.exponential.decay function. The system automatically uses the value of this parameter to calculate the value of the decay_steps parameter based on the total number of samples that are used for training. In this case, you can set this parameter to half of the total number of epochs. Example: 10. If you set the lr_type parameter to manual_step, this parameter specifies the epochs for which you want to adjust the learning rate. For example, a value of 16 18 specifies that the learning rate is adjusted in the 16th and 18th epochs. In most cases, you can set this parameter to 8/10 × N and 9/10 × N, where N is the total number of epochs.

INTEGER list

20

decay_factor

No

The factor by which the learning rate decreases. This parameter is equivalent to the decay_factor parameter of the tf.train.exponential.decay function.

FLOAT

0.95

staircase

No

Specifies whether to decrease the learning rate at discrete intervals. This parameter is equivalent to the staircase parameter of the tf.train.exponential.decay function.

BOOL

true

power

No

The power of the polynomial. This parameter is equivalent to the power parameter of the tf.train.polynomial.decay function. This parameter is valid only if you set the lr_type parameter to polynomial_decay.

FLOAT

0.9

learning_rates

No

The learning rate for each of the specified epochs. If you set the lr_type parameter to manual_step, this parameter is required. If you want to adjust the learning rate for two epochs, specify two learning rates in the value. For example, if the decay_epoches parameter is set to 20 40, you must specify two learning rates in the learning_rates parameter, such as 0.001 0.0001. This indicates that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rate to one tenth, one hundredth, and one thousandth of the initial learning rate in sequence.

FLOAT list

N/A

lr_warmup

No

Specifies whether to warm up the learning rate.

BOOL

false

lr_warm_up_epochs

No

The number of epochs for which you want to warm up the learning rate.

FLOAT

1

train_data

Yes

The OSS path of the training dataset.

oss://path/to/train_*.tfrecord

N/A

test_data

Yes

The OSS path of the evaluation dataset.

oss://path/to/test_*.tfrecord

N/A

train_batch_size

Yes

The size of the data that is used in the current batch.

INT

N/A

test_batch_size

Yes

The size of the data that is evaluated in the current batch.

INT

N/A

train_num_readers

No

The number of concurrent threads used to read the samples that are used for training.

INT

4

model_dir

Yes

The OSS path of the model.

oss://path/to/model

N/A

pretrained_model

No

The OSS path of the pretrained model. If you specify this parameter, the actual model is finetuned based on the pretrained model.

oss://examplebucket/pretrained_models/inception_v4.ckpt

""

use_pretrained_model

No

Specifies whether to use a pretrained model.

BOOL

true

num_epochs

Yes

The number of training iterations. If you set this parameter to 1, each sample in the training dataset is processed once.

INT

N/A

num_test_example

No

The number of samples that are used for model evaluation. If you set this parameter to -1, the entire training dataset is used for model evaluation.

INT

-1

num_visualizations

No

The number of samples that can be visualized during model evaluation.

INT

10

save_checkpoint_epochs

No

The interval at which a checkpoint is saved. Unit: epoch. If you set this parameter to 1, a checkpoint is saved each time an epoch is completed.

INT

1

save_summary_epochs

No

The epoch interval at which a summary is saved. If you set this parameter to 0.01, a summary is saved each time 1% of the training data is iterated.

FLOAT

0.01

num_train_images

No

The total number of samples that are used for training. If you use custom TFRecord files, this parameter is required.

INT

0

label_map_path

No

The .pbtxt file that defines the mapping between the label ID and label name. If you use custom TFRecord files, this parameter is required.

STRING

""