EasyCV Mask2Former easily implements image segmentation

Related Tags:1.Alibaba Cloud Big Data Consulting Services for Retail
2. What Is Big Data


Image segmentation refers to the pixel level classification of pictures. According to the different classification granularity, it can be divided into three categories: semantic segmentation, instance segmentation, and panoptic segmentation. Image segmentation is one of the main research directions in computer vision. It has important application value in medical image analysis, automatic driving, video surveillance, augmented reality, image compression and other fields. We integrate these three types of segmentation SOTA algorithms in the EasyCV framework, and provide relevant model weights. EasyCV can easily predict the image segmentation spectrum and train customized segmentation models. This paper mainly introduces how to use EasyCV to realize instance segmentation, panoramic segmentation and semantic segmentation, and related algorithm ideas.

Use EasyCV to forecast the split graph

EasyCV provides the instance segmentation model and panoramic segmentation model trained on the coco dataset, as well as the semantic segmentation model trained on ADE20K. Refer to EasyCV quick start( https://github.com/alibaba/EasyCV/blob/master/docs/source/quick_start.md )After the configuration of the dependent environment is completed, these models can be directly used to complete the segmentation spectrum prediction of the image. The relevant model links are given in Reference.

Case segmentation prediction

Since the mask2friend algorithm in this example uses Deformable attention (using this operator in the DETR series algorithm can effectively improve the algorithm convergence speed and calculation efficiency), additional compilation of this operator is required

cd thirdparty/deformable_ attention

python setup. py build install

Predicting image instance segmentation image through Mask2FormerPredictor

import cv2

from easycv.predictors. segmentation import Mask2formerPredictor

predictor = Mask2formerPredictor(model_path='mask2former_instance_export.pth',task_mode='instance')

img = cv2.imread('000000123213.jpg')

predict_ out = predictor(['000000123213.jpg'])

instance_ img = predictor. show_ instance(img, **predict_out[0])


The output results are shown as follows:

Panoramic segmentation prediction

Predicting panoramic image segmentation through Mask2FormerPredictor

import cv2

from easycv.predictors. segmentation import Mask2formerPredictor

predictor = Mask2formerPredictor(model_path='mask2former_pan_export.pth',task_mode='panoptic')

img = cv2.imread('000000123213.jpg')

predict_ out = predictor(['000000123213.jpg'])

pan_ img = predictor. show_ panoptic(img, **predict_out[0])


The output results are shown as follows:

Semantic segmentation prediction

Predicting image semantic segmentation map through Mask2FormerPredictor

import cv2

from easycv.predictors. segmentation import Mask2formerPredictor

predictor = Mask2formerPredictor(model_path='mask2former_semantic_export.pth',task_mode='semantic')

img = cv2.imread('000000123213.jpg')

predict_ out = predictor(['000000123213.jpg'])

semantic_ img = predictor. show_ panoptic(img, **predict_out[0])


Example image source: cocodadataset

Use the Mask2Fformer model on Alibaba Cloud machine learning platform PAI

PAI-DSW (Data Science Workshop) is an on cloud IDE developed by Alibaba Cloud machine learning platform PAI, which provides an interactive programming environment for various developers. In the DSW Gallery (link), various Notebook examples are provided for users to easily learn DSW and build various machine learning applications. We have also launched the Sample Notebook (see the figure below) for image segmentation with Mask2Formar in the DSW Gallery. Welcome to experience!

Interpretation of Mask2Fformer algorithm

The model used in the above example is implemented based on Mask2Former. Mask2Former is a unified segmentation architecture, which can perform semantic segmentation, instance segmentation and panoramic segmentation at the same time, and obtain SOTA results. The panoramic segmentation accuracy is 57.8 PQ on COCO dataset, the instance segmentation accuracy is 50.1 AP, and the semantic segmentation accuracy is 57.7 mIoU on ADE20K dataset.

Core ideas

Mask2Former uses the form of mask classification for segmentation, that is, a set of binary masks are predicted through the model and then combined into the final segmentation graph. Each binary mask can represent a category or an instance, which enables different segmentation tasks such as semantic segmentation and instance segmentation.

In the mask classsification task, a core problem is how to find a good form to learn binary Mask. For example, in the previous work, Mask R-CNN uses bounding boxes to limit the feature regions and predict their respective segmentation spectra within the regions. This method also leads to the fact that Mask R-CNN can only split instances. Mask2Former refers to the method of DETR. It uses a fixed number of object queries to represent binary masks, and uses Transformer Decoder to decode to predict this group of masks. (ps: For the interpretation of DETR, please refer to the correct opening mode of DETR, DAB-DETR and Object Query based on EasyCV.)

Among the algorithms of the DETR series, one of the more important defects is that the cross attention in Transformer Decoder will process the global features, which makes it difficult for the model to focus on the areas that it really wants to focus on, and will reduce the convergence speed of the model and the final algorithm accuracy. For this problem, Mask2Former proposes Transformer Decoder with mask attention. Each Transformer Decoder block will predict an attention mask and binarize it with the threshold value of 0.5. Then the attention mask will be used as the input of the next block, so that the attention module will only focus on the foreground part of the mask when calculating.

model structure

Mask2Former consists of three parts:

1. Backbone (ResNet, Swin Transformer) extracts low resolution features from images

2. Pixel Decoder gradually performs up sampling decoding from low resolution features to obtain feature pyramids from low resolution to high resolution, and circularly serves as the input of V and K in Transformer Decoder. Multi scale features are used to ensure the prediction accuracy of the model for targets of different scales.

The trasformer code of one layer is as follows (ps: To further accelerate the convergence speed of the model, Deformable attention module is used in Pixel Decoder):

3. Transformer Decoder with mask attention uses the multi scale feature obtained in Object query and Pixel Decoder to refine the binary mask map layer by layer to get the final result.

The mask cross attention of the core will use the mask predicted by the previous layer as the attention of the MultiheadAttention_ Mask input to limit the calculation of attention to the foreground of the query. The specific implementation code is as follows:


1.efficient multi-scale strategy

In the pixel decoder, the feature pyramids with scales of 1/32, 1/16, and 1/8 of the original figure will be decoded and used as the inputs of K and V of the corresponding transformer decoder block. Referring to the method of deformable detr, sinusoidal positional embedding and learnable scale level embedding are added to each input. Input in order of resolution from low to high, and cycle for L times.


The memory consumption in the training process is saved through PointRent, which is mainly reflected in two parts: a When the Hungarian algorithm is used to match the prediction mask and truth tag, the match cost is calculated by replacing the complete mask map with K point sets that are uniformly sampled b. When calculating losses, the loss is calculated by replacing the complete mask map with K point sets that are sampled according to the importance sampling strategy (ps experiment shows that calculating losses based on the pointreind method can effectively improve the accuracy of the model)

3.Optimization improvements

a. The order of self attention and cross attention has been changed. Self attention ->cross attention becomes cross attention ->self attention.

b. Make query a learnable parameter. The supervised learning of query can play a similar role to that of region proposal. Experiments can prove that learnable queries can generate mask proposals.

c. Dropout operation in transformer receiver is removed. The experiment shows that this operation will reduce the accuracy.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us