EasyCV recurrence self-supervised algorithm-FastConvMAE

Self-Supervised Learning uses a large amount of unlabeled data for representation learning, and fine-tunes parameters on specific downstream tasks, which greatly reduces the heavy labeling work of image tasks and saves a lot of labor costs. In recent years, self-supervised learning has shined in the field of vision and has received more and more attention. A series of works such as SIMCLR, MOCO, SwAV, DINO, MoBY, MAE, etc. have emerged in the CV field. Among them, the performance of MAE is particularly amazing. Everyone is attracted by the simple and efficient performance of MAE, and they have made improvements on MAE, such as MixMIM, VideoMAE and other work. For a detailed explanation of MAE, please refer to previous articles: Introduction to MAE Self-Supervised Algorithm and Recurrence Based on EasyCV.
ConvMAE is a work jointly published by Shanghai Artificial Intelligence Laboratory and mmlab in NeurIPS2022. Compared with MAE, training the same number of epochs, the finetune accuracy rate of the ImageNet-1K data set increased by 1.4%, fine-tuning 25 on the COCO2017 data set Compared with the fine-tuning of 100 epochs, the MAE AP box is improved by 2.9, the AP mask is increased by 2.2, and the semantic segmentation task is improved by 3.6% compared with MAE mIOU. On this basis, the author proposes FastConvMAE, which further optimizes the training performance. With only 50 epochs pre-trained, the accuracy of ImageNet Finetuning exceeds the accuracy of MAE pre-trained 1600 epochs by 0.77 points (83.6/84.37). On the detection task, the accuracy also exceeds ViTDet and Swin.
EasyCV is Alibaba's open source Pytorch-based all-in-one visual algorithm modeling tool with self-supervised learning and Transformer technology as the core, covering mainstream visual modeling tasks such as image classification, metric learning, target detection, instance/speech /Panoramic segmentation, key point detection and other fields, with strong ease of use and scalability, while focusing on performance tuning, aiming to bring more, faster and stronger algorithms to the community.
Recently, the work of FastConvMAE has been open-sourced for the first time within the framework of EasyCV. This article will focus on the main work of ConvMAE and FastConvMAE, as well as the corresponding code implementation, and finally provide a detailed tutorial example on how to perform FastConvMAE pre-training and finetune downstream tasks.
ConvMAE is a work jointly published by Shanghai Artificial Intelligence Laboratory and mmlab in NeurIPS2022. The proposal of ConvMAE proves that using local inductive bias and multi-scale pyramid structure, better feature representation can be learned through MAE training. . The work proposes:
1. Use block-wise mask strategy to ensure computational efficiency.
2. Output the multi-scale features of the encoder, capturing both fine-grained and coarse-grained image information.
Original reference: https://arxiv.org/abs/2205.03892
The experimental results show that the above two strategies are concise and effective, making ConvMAE significantly improved compared to MAE in multiple visual tasks. Take the comparison between ConvMAE-Base and MAE-Base as an example: on the image classification task, the fine-tuning accuracy of the ImageNet-1K dataset has increased by 1.4%; on the target detection task, COCO2017 fine-tuned the AP box of 25 epochs to reach 53.2% , the AP mask reaches 47.1%, which is 2.9% and 2.2% higher than the fine-tuned MAE-Base of 100 epochs respectively; on the semantic segmentation task, using the UpperNet network head, the mIoU of ConvMAE-Base on ADE20K reaches 51.7%, which is comparable to 3.6% higher than MAE-Base.
The overall process of ConvMAE
Different from MAE, the encoder of ConvMAE gradually abstracts the input image into multi-scale token embedding, while the decoder reconstructs the pixels corresponding to the masked tokens. For the high-resolution token embedding in the previous stage, the convolution block is used to encode the local part, and for the low-resolution token embedding in the back, the transformer is used to aggregate the global information. Therefore, the encoder of ConvMAE can simultaneously obtain local and global information at different stages and generate multi-scale features.
The current masked auto encoding framework, such as BEiT and SimMIM, adopts a mask strategy that cannot be directly used in ConvMAE, because all tokens need to be retained in the subsequent transformer stage. This leads to the high computational cost of pre-training large models, and loses the efficiency advantage of MAE in the transformer encoder by omitting masked tokens. In addition, directly using the encoder of the convolution-transformer structure for pre-training will cause the convolution part to leak pre-trained information due to random masks, which will also reduce the quality of the pre-trained model.
In response to these problems, ConvMAE proposes a hybrid convolution-transformer architecture. ConvMAE adopts a block-wise masking strategy: first, randomly generate a later mask in the later acquired transformer token, and then gradually upsample the fixed position of the mask to the high resolution of the early convolution stage. In this way, the post-processing tokens can be completely separated into masked tokens and visible tokens, thereby inheriting the computational efficiency of MAE using sparse encoders.
The following will introduce the encoder, mask strategy and decoder respectively.
As shown in the overall flowchart, the encoder includes 3 stages, and the feature dimensions output by each stage are: H/4 × W/4, H/8 × W/8, H/16 × W/16, where H × W is the input image resolution. The first two are convolution stages, using the convolution module to convert the input into token embeddings E1 ∈ R^(H/4 × W/4 ×C1) and E2 ∈ R^(H/8 × W/8 ×C2) . The convolution module replaces the self-attention operation with a 5 × 5 convolution. The receptive field in the first two stages is small and mainly captures the local features of the image. The third stage uses the transformer module to fuse the coarse-grained features and expand the receptive field to the entire image to obtain token embeddings E3 ∈ R(H/16 × W/16 x C3). Between each stage, the tokens are downsampled using a convolution with a stride of 2.
For other structures containing transformers, such as CPT, Container, Uniformer, CMT, Swin, etc., the input in the first stage uses relative position encoding or zero-filled convolution to replace absolute position encoding, and the author found that in the third transformer stage Use absolute Position encoding achieves the best performance. The class token is also removed from the encoder.
Mask strategy
MAE, BEiT, etc., use a random mask for the input patch. But the same strategy cannot be directly applied to the ConvMAE encoder: if the mask is randomly drawn from the H/4 × W/4 tokens of stage-1 independently, it will result in almost all tokens of the downsampled stage-3 having partial Visible information makes the encoder no longer sparse. Therefore, the author proposes to generate a mask from the input tokens of stage-3 in the same proportion (for example, 75%), and then upsample the mask by 2 times and 4 times, as the masks of stage-2 and stage-1 respectively. In this way, ConvMAE only contains few (for example, 25%) visible tokens in the three stages, so that the efficiency of the encoder during pre-training is not affected. The task e of the decoder remains the same, that is, to reconstruct the tokens that were masked out during the encoding process.
At the same time, the 5X5 convolution operation in the first 2 stages leaks the reconstruction answers of invisible tokens at the edges of masked patches. In order to avoid this situation and ensure the quality of pre-training, the author uses masked convolution in the first two stages, so that the masked area does not participate in the encoding process.
The input of the original MAE decoder takes the output of the encoder and the masked tokens as input, and then performs image reconstruction through stacked transformer blocks. The ConvMAE encoder obtains multi-scale features E1, E2, E3 while capturing both fine-grained and coarse-grained image information. For better pre-training, the author downsamples E1 and E2 to the same size as E3 through stride-4 and stride-2 convolution, and performs multi-scale feature fusion, and then obtains the final input to the decoder through a linear layer Visible token. The objective function is the same as MAE. Only MSE is used as the loss function to calculate the prediction vector and the MSE loss before the masked pixel value, that is, only the reconstruction of the masked patches is considered.
downstream tasks
After pre-training, ConvMAE can output multi-scale features for detection and segmentation tasks.
In the detection task, the output feature E3 of the stage-3 is first obtained by 2x2 maximum pooling to obtain E4. Since ConvMAE stage-3 has 11 self-attention layers (ConvMAE-base), the calculation cost is too high. The author refers to ViT's benchmark and uses all global self-attention in stage-3 except the 1st, 4th, 7th, and 11th. The layers are replaced by the local self-attention layer of Window size7×7. The modified local self-attention is still initialized by the pre-trained global self-attention. The global relative position bias is shared between global transformer blocks, and the local relative position bias is shared between local transformer blocks, which greatly reduces the calculation and GPU memory overhead of stage-3. Then the multi-scale features E1, E2, E3, and E4 are sent to the MaskRCNN head for target detection.
The split task retains the stage-3 structure.
image classification
ConvMAE is based on ImageNet-1K, the mask drops 25% of the input token for pre-training, the Decoder part is an 8-layer transformer, the embedding dimension is 512, and the head is 12. The pre-training parameters and classification finetuning results are as follows:
BEiT pre-trained 300 epochs, the accuracy of finetune reached 83.0%, and the accuracy of linear-prob was 37.6%. Compared with BEiT, ConVMAE only needs 25% token and a lightweight decoder finetune can reach 85%, linear-prob can reach 70.9%. Compared with the original MAE, ConVMAE is 1.4 points higher than MAE with the same pre-training 1600 epochs. Compared with SimMIM (backbone uses Swin-B), it has improved by 1 point.
The author replaces the backbone of Mask-RCNN with ConvMAE, and loads the pre-trained model of ConvMAE to train the COCO dataset.
Compared with ViT's results of finetune 100 epochs on the COCO dataset, ConVMAE only finetune 25 epochs to improve APbox and APmask by 2.9 and 2.2 points.
Compared with ViTDet and MIMDet, ConvMAE has fewer finetune epochs and fewer parameters, surpassing them by 2.0% and 1.7%, respectively.
Compared with Swin and MViTv2, its performance is 4.0%/3.6% and 2.2%/1.4% higher in APbox/APmask, respectively.
The author replaces the backbone of UpperNet with ConvMAE, and loads the pre-trained model of ConvMAE to train the ADE20K data set.
It can be seen from the results that compared with DeiT, Swin, MoCo-v3 and other networks ConvMAE has achieved higher performance (51.7%). It is shown that the multi-scale features of ConvMAE greatly narrow the transfer gap between the pre-trained Backbone and the downstream network.
Fast ConvMAE
Although ConvMAE has improved the accuracy of downstream tasks such as classification, detection, and segmentation, and solved the problem of pretraining-finetuning differences, the pre-training of the model is still time-consuming. In the results of ConvMAE, the model was pre-trained for 1600 epochs, so The author made further performance optimization on the basis of ConvMAE, and proposed Fast ConvMAE. FastConvMAE proposed a scheme of mask complementarity and deocder fusion to realize a fast mask modeling scheme, further shortening the pre-training time. From the original The pre-trained 1600epoch is shortened to 50epoch. The official paper author of FastConvMAE will be issued in the future.
First of all, FastConvMAE innovatively designed a Mixture of Reconstructor (MoR) that integrates decoders, allowing masked patches to learn complementary information from different tokenizers, including the self-ensembling nature of EMA, the similarity-discrimination capability of DINO, and CLIP Knowledge of multimodal. MoR mainly includes two parts, Partially-Shared Decoder (PS-Decoder) and Mixture of Tokenizer (MoT). PS-Decoder can avoid gradient conflicts between different knowledge of different tokenizers. MoT is used to generate different tokens As a target for masked patches.
At the same time, the Mask part adopts a complementary strategy. The original mask only retains, for example, 25% of the tokens each time. FastConvMAE divides the mask into 4 parts, each of which retains 25%, and the 4 masks are complementary. In this way, it is equivalent to dividing 1 picture into 4 pictures for learning, which theoretically achieves 4 times the learning effect.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us