EasyVision of Platform for AI (PAI) allows you to train models for semantic image segmentation and use the trained models to make predictions. This topic describes how to use PAI commands to train a model for semantic image segmentation.
Train a model for semantic image segmentation
The DeepLab-V3 semantic segmentation model provides the image segmentation component. For more information, see Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. You can use the SQL Script component to run PAI commands. You can also use the MaxCompute client or SQL nodes for DataWorks to run PAI commands. For more information, see MaxCompute client (odpscmd) or Develop a MaxCompute SQL task. The following sample code provides an example on how to use PAI commands to train a model for semantic image segmentation by using a single GPU.
pai -name easy_vision_ext
-Dbuckets='oss://{bucket_name}.{oss_host}/{path}'
-Darn='acs:ram::*********:role/aliyunodpspaidefaultrole'
-DossHost='{oss_host}'
-DgpuRequired=100
-Dcmd train
-Dparam_config '--model_type DeeplabV3 --backbone resnet_v1_50 --backbone_feature_stride 16 --bn_trainable true --num_classes 21 --num_epochs 1 --model_dir oss://YOUR_BUCKET_NAME/test/test_deeplabv3 --train_data oss://YOUR_BUCKET_NAME/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord --test_data oss://YOUR_BUCKET_NAME/data/test/pascal_voc_seg_aug/voc_ev_train.tfrecord --num_test_example 2 --train_batch_size 6 --test_batch_size 1 --image_crop_size 513 --lr_type polynomial_decay --initial_learning_rate 0.007 --power 0.9'Parameters
Parameter | Required | Description | Format | Default Value |
buckets | Yes | The URL of the Object Storage Service (OSS) bucket. The URL must end with a forward slash (/). | oss://{bucket_name}.{oss_host}/{path} | N/A |
arn | Yes | The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that has the permissions to access the OSS bucket. To obtain the ARN, log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS, click View Authorization in the Actions column, and then copy the value of the Role Name parameter. For more information, see Grant the permissions that are required to use Machine Learning Designer. | acs:ram::*:role/AliyunODPSPAIDefaultRole | N/A |
ossHost | No | The domain name of the OSS bucket. For more information, see Regions and endpoints. | oss-{region}.aliyuncs.com | By default, the domain name is specified by the buckets parameter. |
cluster | No | The configuration of distributed training. | JSON string | "" |
gpuRequired | No | The number of GPUs that each worker uses. By default, each worker uses one GPU. If you set this parameter to 200, a worker requests two cards. | 100 | 100 |
cmd | Yes | The type of the job that runs on EasyVision. Set this parameter to train. | train | N/A |
param_config | Yes | The configuration for model training. The format of the param_config parameter is the same as the format of the ArgumentParser() object in Python. For more information, see param_config. | STRING | N/A |
param_config
The param_config parameter specifies the model training parameters in the command line format that is required by the Python argpars module. Example:
-Dparam_config = '--backbone resnet_v1_50 --num_classes 200 --model_dir oss://YOUR_BUCKET_NAME/exp_dir'When you specify this parameter, do not enclose the value of a string parameter in double quotation marks (") or single quotation marks (').
Parameter | Required | Description | Format | Default Value |
model_type | Yes | The type of the model that you want to train. When you train a model for semantic image segmentation, set this parameter to DeeplabV3. | STRING | N/A |
backbone | Yes | The name of the backbone network that is used by the model. Valid values:
| STRING | N/A |
weight_decay | No | The value of L2 regularization. | FLOAT | 1e-4 |
num_classes | Yes | The number of segmented classes, including background classes. | 21 | N/A |
backbone_feature_stride | No | The feature downsampling step size of the backbone network. | INT | 16 |
bn_trainable | No | Specifies whether the batch normalization (BN) layer is trainable. If the value of the train_batch_size parameter is greater than 8, set this parameter to true. | BOOL | true |
image_crop_size | No | The size of the image after cropping. | INT | 513 |
optimizer | No | The type of the optimizer. Valid values:
| STRING | momentum |
lr_type | No | The policy that is used to adjust the learning rate. Valid values:
| STRING | exponential_decay |
initial_learning_rate | No | The initial learning rate. | FLOAT | 0.01 |
decay_epochs | No | If you set the lr_type parameter to exponential_decay, this parameter is equivalent to the decay_steps parameter of the tf.train.exponential.decay function. The system automatically uses the value of this parameter to calculate the value of the decay_steps parameter based on the total number of samples that are used for training. In this case, you can set this parameter to half of the total number of epochs. Example: 10. If you set the lr_type parameter to manual_step, this parameter specifies the epochs for which you want to adjust the learning rate. For example, a value of 16 18 specifies that the learning rate is adjusted in the 16th and 18th epochs. In most cases, you can set this parameter to 8/10 × N and 9/10 × N, where N is the total number of epochs. | INTEGER list | 20 |
decay_factor | No | The factor by which the learning rate decreases. This parameter is equivalent to the decay_factor parameter of the tf.train.exponential.decay function. | FLOAT | 0.95 |
staircase | No | Specifies whether to decrease the learning rate at discrete intervals. This parameter is equivalent to the staircase parameter of the tf.train.exponential.decay function. | BOOL | true |
power | No | The power of the polynomial. This parameter is equivalent to the power parameter of the tf.train.polynomial.decay function. This parameter is valid only if you set the lr_type parameter to polynomial_decay. | FLOAT | 0.9 |
learning_rates | No | The learning rate for each of the specified epochs. If you set the lr_type parameter to manual_step, this parameter is required. If you want to adjust the learning rate for two epochs, specify two learning rates in the value. For example, if the decay_epoches parameter is set to 20 40, you must specify two learning rates in the learning_rates parameter, such as 0.001 0.0001. This indicates that the learning rate of the 20th epoch is adjusted to 0.001 and the learning rate of the 40th epoch is adjusted to 0.0001. We recommend that you adjust the learning rate to one tenth, one hundredth, and one thousandth of the initial learning rate in sequence. | FLOAT list | N/A |
lr_warmup | No | Specifies whether to warm up the learning rate. | BOOL | false |
lr_warm_up_epochs | No | The number of epochs for which you want to warm up the learning rate. | FLOAT | 1 |
train_data | Yes | The OSS path of the training dataset. | oss://path/to/train_*.tfrecord | N/A |
test_data | Yes | The OSS path of the evaluation dataset. | oss://path/to/test_*.tfrecord | N/A |
train_batch_size | Yes | The size of the data that is used in the current batch. | INT | N/A |
test_batch_size | Yes | The size of the data that is evaluated in the current batch. | INT | N/A |
train_num_readers | No | The number of concurrent threads used to read the samples that are used for training. | INT | 4 |
model_dir | Yes | The OSS path of the model. | oss://path/to/model | N/A |
pretrained_model | No | The OSS path of the pretrained model. If you specify this parameter, the actual model is finetuned based on the pretrained model. | oss://examplebucket/pretrained_models/inception_v4.ckpt | "" |
use_pretrained_model | No | Specifies whether to use a pretrained model. | BOOL | true |
num_epochs | Yes | The number of training iterations. If you set this parameter to 1, each sample in the training dataset is processed once. | INT | N/A |
num_test_example | No | The number of samples that are used for model evaluation. If you set this parameter to -1, the entire training dataset is used for model evaluation. | INT | -1 |
num_visualizations | No | The number of samples that can be visualized during model evaluation. | INT | 10 |
save_checkpoint_epochs | No | The interval at which a checkpoint is saved. Unit: epoch. If you set this parameter to 1, a checkpoint is saved each time an epoch is completed. | INT | 1 |
save_summary_epochs | No | The epoch interval at which a summary is saved. If you set this parameter to 0.01, a summary is saved each time 1% of the training data is iterated. | FLOAT | 0.01 |
num_train_images | No | The total number of samples that are used for training. If you use custom TFRecord files, this parameter is required. | INT | 0 |
label_map_path | No | The .pbtxt file that defines the mapping between the label ID and label name. If you use custom TFRecord files, this parameter is required. | STRING | "" |