Train a model for semantic image segmentation

EasyVision of Platform for AI (PAI) allows you to train models for semantic image segmentation and use the trained models to make predictions. This topic describes how to use PAI commands to train a model for semantic image segmentation.

The DeepLab-V3 semantic segmentation model provides the image segmentation component. For more information, see Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. You can use the SQL Script component to run PAI commands. You can also use the MaxCompute client or SQL nodes for DataWorks to run PAI commands. For more information, see MaxCompute client (odpscmd) or Develop a MaxCompute SQL task. The following sample code provides an example on how to use PAI commands to train a model for semantic image segmentation by using a single GPU.

pai -name easy_vision_ext
           -Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
           -Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
           -DossHost='{oss_host}'
           -DgpuRequired=100
           -Dcmd train
           -Dparam_config '--model_type DeeplabV3 --backbone  resnet_v1_50 --backbone_feature_stride 16 --bn_trainable true --num_classes 21 --num_epochs 1 --model_dir oss://YOUR_BUCKET_NAME/test/test_deeplabv3 --train_data oss://YOUR_BUCKET_NAME/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord --test_data oss://YOUR_BUCKET_NAME/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord --num_test_example 2 --train_batch_size 6 --test_batch_size 1 --image_crop_size 513 --lr_type polynomial_decay --initial_learning_rate 0.007 --power 0.9'

Parameters

Parameter	Required	Description	Format	Default Value
buckets	Yes	The URL of the Object Storage Service (OSS) bucket. The URL must end with a forward slash (/).	oss://{bucket_name}.{oss_host}/{path}	N/A
arn	Yes	The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that has the permissions to access the OSS bucket. To obtain the ARN, log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS, click View Authorization in the Actions column, and then copy the value of the Role Name parameter. For more information, see Grant the permissions that are required to use Machine Learning Designer.	acs:ram::*:role/AliyunODPSPAIDefaultRole	N/A
ossHost	No	The domain name of the OSS bucket. For more information, see Regions and endpoints.	oss-{region}.aliyuncs.com	By default, the domain name is specified by the buckets parameter.
cluster	No	The configuration of distributed training.	JSON string	""
gpuRequired	No	The number of GPUs that each worker uses. By default, each worker uses one GPU. If you set this parameter to 200, a worker requests two cards.	100	100
cmd	Yes	The type of the job that runs on EasyVision. Set this parameter to train.	train	N/A
param_config	Yes	The configuration for model training. The format of the param_config parameter is the same as the format of the ArgumentParser() object in Python. For more information, see param_config.	STRING	N/A

param_config

The param_config parameter specifies the model training parameters in the command line format that is required by the Python argpars module. Example:

-Dparam_config = '--backbone resnet_v1_50 --num_classes 200 --model_dir oss://YOUR_BUCKET_NAME/exp_dir'

Note

When you specify this parameter, do not enclose the value of a string parameter in double quotation marks (") or single quotation marks (').

Parameter	Required	Description	Format	Default Value
model_type	Yes	The type of the model that you want to train. When you train a model for semantic image segmentation, set this parameter to DeeplabV3.	STRING	N/A
backbone	Yes	The name of the backbone network that is used by the model. Valid values: resnet_v1_50 resnet_v1_101 resnet_v1a_18 resnet_v1a_34 resnet_v1d_50 resnet_v1d_101 xception_41 xception_65 xception_71	STRING	N/A
weight_decay	No	The value of L2 regularization.	FLOAT	1e-4
num_classes	Yes	The number of segmented classes, including background classes.	21	N/A
backbone_feature_stride	No	The feature downsampling step size of the backbone network.	INT	16
bn_trainable	No	Specifies whether the batch normalization (BN) layer is trainable. If the value of the train_batch_size parameter is greater than 8, set this parameter to true.	BOOL	true
image_crop_size	No	The size of the image after cropping.	INT	513
optimizer	No	The type of the optimizer. Valid values: momentum: uses the stochastic gradient descent (SGD) algorithm. adam	STRING	momentum
lr_type	No	The policy that is used to adjust the learning rate. Valid values: exponential_decay: The learning rate is subject to exponential decay. polynomial_decay: The learning rate is subject to polynomial decay. If you set the lr_type parameter to polynomial_decay, the num_steps parameter is automatically set to the total number of training iterations. The value of the end_learning_rate parameter is automatically set to one thousandth of the value of the initial_learning_rate parameter. manual_step: The learning rate of each epoch is manually adjusted. If you set the lr_type parameter to manual_step, you must configure the decay_epochs parameter to specify the epochs for which you want to adjust the learning rate. You must also configure the learning_rates parameter to specify the learning rate of each epoch based on your business requirements. cosine_decay The learning rate of each epoch is adjusted based on the cosine curve. For more information, see SGDR: Stochastic Gradient Descent with Warm Restarts. If you set the lr_type parameter to cosine_decay, you must configure the decay_epochs parameter to specify the epochs for which you want to adjust the learning rate.	STRING	exponential_decay
initial_learning_rate	No	The initial learning rate.	FLOAT	0.01
decay_epochs	No	If you set the lr_type parameter to exponential_decay, this parameter is equivalent to the decay_steps parameter of the tf.train.exponential.decay function. The system automatically uses the value of this parameter to calculate the value of the decay_steps parameter based on the total number of samples that are used for training. In this case, you can set this parameter to half of the total number of epochs. Example: 10. If you set the lr_type parameter to manual_step, this parameter specifies the epochs for which you want to adjust the learning rate. For example, a value of 16 18 specifies that the learning rate is adjusted in the 16th and 18th epochs. In most cases, you can set this parameter to 8/10 × N and 9/10 × N, where N is the total number of epochs.	INTEGER list	20
decay_factor	No	The factor by which the learning rate decreases. This parameter is equivalent to the decay_factor parameter of the tf.train.exponential.decay function.	FLOAT	0.95
staircase	No	Specifies whether to decrease the learning rate at discrete intervals. This parameter is equivalent to the staircase parameter of the tf.train.exponential.decay function.	BOOL	true
power	No	The power of the polynomial. This parameter is equivalent to the power parameter of the tf.train.polynomial.decay function. This parameter is valid only if you set the lr_type parameter to polynomial_decay.	FLOAT	0.9
learning_rates	No	The learning rate for each of the specified epochs. If you set the lr_type parameter to manual_step, this parameter is required. If you want to adjust the learning rate for two epochs, specify two learning rates in the value. For example, if the decay_epoches parameter is set to 20 40, you must specify two learning rates in the learning_rates parameter, such as 0.001 0.0001. This indicates that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rate to one tenth, one hundredth, and one thousandth of the initial learning rate in sequence.	FLOAT list	N/A
lr_warmup	No	Specifies whether to warm up the learning rate.	BOOL	false
lr_warm_up_epochs	No	The number of epochs for which you want to warm up the learning rate.	FLOAT	1
train_data	Yes	The OSS path of the training dataset.	oss://path/to/train_*.tfrecord	N/A
test_data	Yes	The OSS path of the evaluation dataset.	oss://path/to/test_*.tfrecord	N/A
train_batch_size	Yes	The size of the data that is used in the current batch.	INT	N/A
test_batch_size	Yes	The size of the data that is evaluated in the current batch.	INT	N/A
train_num_readers	No	The number of concurrent threads used to read the samples that are used for training.	INT	4
model_dir	Yes	The OSS path of the model.	oss://path/to/model	N/A
pretrained_model	No	The OSS path of the pretrained model. If you specify this parameter, the actual model is finetuned based on the pretrained model.	oss://examplebucket/pretrained_models/inception_v4.ckpt	""
use_pretrained_model	No	Specifies whether to use a pretrained model.	BOOL	true
num_epochs	Yes	The number of training iterations. If you set this parameter to 1, each sample in the training dataset is processed once.	INT	N/A
num_test_example	No	The number of samples that are used for model evaluation. If you set this parameter to -1, the entire training dataset is used for model evaluation.	INT	-1
num_visualizations	No	The number of samples that can be visualized during model evaluation.	INT	10
save_checkpoint_epochs	No	The interval at which a checkpoint is saved. Unit: epoch. If you set this parameter to 1, a checkpoint is saved each time an epoch is completed.	INT	1
save_summary_epochs	No	The epoch interval at which a summary is saved. If you set this parameter to 0.01, a summary is saved each time 1% of the training data is iterated.	FLOAT	0.01
num_train_images	No	The total number of samples that are used for training. If you use custom TFRecord files, this parameter is required.	INT	0
label_map_path	No	The .pbtxt file that defines the mapping between the label ID and label name. If you use custom TFRecord files, this parameter is required.	STRING	""