Train a model for semantic image segmentation

EasyVision of Machine Learning Platform for AI (PAI) allows you to train models for semantic image segmentation and use the trained models to make predictions. This topic describes how to use PAI commands to train a model for semantic image segmentation.

A semantic segmentation model based on DeepLabv3 is implemented. For more information, see Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In this example, a single GPU is used for the training. You can use the following sample code to train a model for semantic image segmentation:

pai -name easy_vision_ext
           -Dbuckets='{bucket_name}.{oss_host}/{path}'
           -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
           -DossHost='{oss_host}'
           -DgpuRequired=100
           -Dcmd train
           -Dparam_config '
             --model_type DeeplabV3
             --backbone  resnet_v1_50
             --backbone_feature_stride 16
             --bn_trainable true
             --num_classes 21
             --num_epochs 1
             --model_dir oss://{bucket_name}/test/test_deeplabv3
             --train_data oss://pai-vision-data-sh/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord
             --test_data oss://pai-vision-data-sh/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord
             --num_test_example 2
             --train_batch_size 6
             --test_batch_size 1
             --image_crop_size 513
             --lr_type polynomial_decay
             --initial_learning_rate 0.007
             --power 0.9
           '

Parameters


Parameter	Required	Description	Value format or example value	Default value
buckets	Yes	The endpoint of the Object Storage Service (OSS) bucket.	oss://{bucket_name}.{oss_host}/{path}	N/A
arn	Yes	The Alibaba Cloud Resource Name (ARN) of the RAM role that has the permissions to access OSS resources. For more information about how to obtain the ARN, see the "I/O parameters" section of the Task parameters of PAI-TensorFlow topic.	acs:ram::*:role/aliyunodpspaidefaultrole	N/A
host	No	The domain name of the OSS bucket.	oss-{region}.aliyuncs.com	The domain name that is specified in the buckets parameter.
cluster	No	The configuration of parameters that are used for distributed training.	JSON string	""
gpuRequired	No	Specifies whether to use GPUs. Each worker uses one GPU by default. If you set this parameter to 200, each worker uses two GPUs.	100	100
cmd	Yes	The type of the EasyVision task. Set this parameter to train when you train a model.	train	N/A
param_config	Yes	The configuration of parameters that are used for model training. The format of the param_config parameter is the same as that of the ArgumentParser() object in Python. For more information, see param_config.	STRING	N/A

param_config

The param_config parameter contains several parameters that are used for model training. The format of the param_config parameter is the same as that of the ArgumentParser() object in Python. The following example shows the configuration of the param_config parameter:

-Dparam_config = '
--backbone resnet_v1_50
--num_classes 200
--model_dir oss://your/bucket/exp_dir
'

Note The values of all string parameters in the param_config parameter are not enclosed in double quotation marks (") or single quotation marks (').


Parameter	Required	Description	Value format or example value	Default value
model_type	Yes	The type of the model to train. Set this parameter to DeeplabV3 when you train a model for semantic image segmentation.	STRING	N/A
backbone	Yes	The name of the backbone network that is used by the model. Valid values: resnet_v1_50 resnet_v1_101 resnet_v1a_18 resnet_v1a_34 resnet_v1d_50 resnet_v1d_101 xception_41 xception_65 xception_71	STRING	N/A
weight_decay	No	The value of L2 regularization.	FLOAT	1e-4
num_classes	Yes	The number of categories, including background categories.	21	N/A
backbone_feature_stride	No	The feature downsampling stride of the backbone network.	INT. Example value: 8 or 16.	16
bn_trainable	No	Specifies whether the batch normalization (BN) layer is trainable. Set this parameter to true if the value of the train_batch_size parameter is greater than 8.	BOOL	true
image_crop_size	No	The size of the image after cropping.	INT	513
optimizer	No	The type of the optimizer. Valid values: momentum: stochastic gradient descent (SGD) with momentum adam	STRING	momentum
lr_type	No	The policy that is used to adjust the learning rate. Valid values: exponential_decay: the exponential decay. polynomial_decay: the polynomial decay. If you set the lr_type parameter to polynomial_decay, the num_steps parameter is automatically set to the total number of training iterations. The value of the end_learning_rate parameter is automatically set to one thousandth of the value of the initial_learning_rate parameter. manual_step: manually adjusts learning rates for epochs. If you set the lr_type parameter to manual_step, you must set the decay_epochs parameter to specify the epochs for which you want to adjust the learning rates. You must also set the learning_rates parameter to specify the learning rates as needed. cosine_decay adjusts the learning rate by following the cosine curve. For more information, see SGDR: Stochastic Gradient Descent with Warm Restarts. If you set the lr_type parameter to cosine_decay, you must set the decay_epochs parameter to specify the epochs for which you want to adjust the learning rates.	STRING	exponential_decay
initial_learning_rate	No	The initial learning rate.	FLOAT	0.01
decay_epochs	No	If you set the lr_type parameter to exponential_decay, the decay_epochs parameter is equivalent to the decay_steps parameter of tf.train.exponential.decay. In this case, the decay_epochs parameter specifies the epoch interval at which you want to adjust the learning rate. The system automatically converts the value of the decay_epochs parameter to the value of the decay_steps parameter based on the total number of training data entries. Typically, you can set the decay_epochs parameter to half of the total number of epochs. For example, you can set this parameter to 10 if the total number of epochs is 20. If you set the lr_type parameter to manual_step, the decay_epochs parameter specifies the epochs for which you want to adjust the learning rates. For example, the value 16 18 indicates that you want to adjust the learning rates for the 16th and 18th epochs. Typically, if the total number of epochs is N, you can set the two values of the decay_epochs parameter to 8/10 × N and 9/10 × N.	INTEGER list. Example value: 20 20 40 60.	20
decay_factor	No	The decay rate. This parameter is equivalent to the decay_factor parameter of tf.train.exponential.decay.	FLOAT	0.95
staircase	No	Specifies whether the learning rate changes based on the decay_epochs parameter. This parameter is equivalent to the staircase parameter of tf.train.exponential.decay.	BOOL	true
power	No	The power of the polynomial. This parameter is equivalent to the power parameter of tf.train.polynomial.decay.	FLOAT	0.9
learning_rates	No	The learning rates that you want to set for the specified epochs. This parameter is required when you set the lr_type parameter to manual_step. If you want to adjust the learning rates for two epochs, set two learning rates in the value. For example, if the decay_epoches parameter is set to 20 40, you must specify two learning rates in the learning_rates parameter, such as 0.001 0.0001. This indicates that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rates to one tenth, one hundredth, and one thousandth of the initial learning rate in sequence.	FLOAT list	N/A
lr_warmup	No	Specifies whether to warm up the learning rate.	BOOL	false
lr_warm_up_epochs	No	The number of epochs for which you want to warm up the learning rate.	FLOAT	1
train_data	Yes	The OSS endpoint of the data that is used to train the model.	oss://path/to/train_*.tfrecord	N/A
test_data	Yes	The OSS endpoint of the data that is evaluated during the training.	oss://path/to/test_*.tfrecord	N/A
train_batch_size	Yes	The size of the data that is used to train the model in the current batch.	INT. Example value: 32.	N/A
test_batch_size	Yes	The size of the data that is evaluated in the current batch.	INT. Example value: 32.	N/A
train_num_readers	No	The number of concurrent threads that are used to read the training data.	INT	4
model_dir	Yes	The OSS endpoint of the model.	oss://path/to/model	N/A
pretrained_model	No	The OSS endpoint of the pretrained model. If this parameter is specified, the actual model is finetuned based on the pretrained model.	oss://pai-vision-data-sh/pretrained_models/inception_v4.ckpt	""
use_pretrained_model	No	Specifies whether to use a pretrained model.	BOOL	true
num_epochs	Yes	The number of training iterations. The value 1 indicates that all data is iterated once for the training.	INT. Example value: 40.	N/A
num_test_example	No	The number of data entries that are evaluated during the training. The value -1 indicates that all training data is evaluated.	INT. Example value: 2000.	-1
num_visualizations	No	The number of data entries that can be visualized during the evaluation.	INT	10
save_checkpoint_epochs	No	The epoch interval at which a checkpoint is saved. The value 1 indicates that a checkpoint is saved each time an epoch is complete.	INT	1
save_summary_epochs	No	The epoch interval at which a summary is saved. The value of 0.01 indicates that a summary is saved each time 1% of the training data is iterated.	FLOAT	0.01
num_train_images	No	The total number of data entries that are used for the training. This parameter is required if you use custom TFRecord files to train the model.	INT	0
label_map_path	No	The category mapping file. This parameter is required if you use custom TFRecord files to train the model.	STRING	""