By Qianyan and Linzai
In recent years, self-supervised learning and Transformer have taken off in the vision field. Self-supervised pre-training for images significantly reduces the heavy labeling workload of image tasks and saves a lot of labor costs. The great success of transformer technology in the NLP field also provides a large imagination space for further improvement of CV model effects. Alibaba Cloud Platform of Artificial Intelligence (PAI) has developed EasyCV (an all-in-one visual modeling tool) to promote the implementation of self-supervised learning and Vision Transformer in Alibaba Group and Alibaba Cloud. EasyCV builds a rich and comprehensive self-supervised algorithm system and provides a SOTA pre-training model of Vision Transformer. EasyCV model zoo covers areas, including self-supervised training for images, image classification, metric learning, object detection, and key point detection. EasyCV also provides out-of-the-box training and inference capabilities for developers. At the same time, it has made in-depth optimization in training and inference efficiency. In addition, EasyCV is compatible with the Alibaba AI system. Users can use all the features of EasyCV in the Alibaba Cloud environment.
After EasyCV was improved by Alibaba’s internal businesses, we hope to promote the EasyCV framework to the community to serve the vast number of CV algorithm developers and enthusiasts, so they can experience the latest self-supervised learning and transformer technology quickly and conveniently and apply them to business production.
How can we design the algorithm framework of EasyCV? What features can developers use? What are the future plans? Let's gain an insight into EasyCV.
EasyCV is Alibaba's open-source Pytorch-based all-in-one visual algorithm modeling tool with self-supervised learning and Transformer technology as the core. EasyCV supports multiple business units (BU) in Alibaba Group (such as Search, Amoy, Youku, and Fliggy). It also serves several enterprises on Alibaba Cloud. EasyCV meets customers' needs for customizing models and solving business problems through platform-based components.
The Address of Open-Source EasyCV: https://github.com/alibaba/EasyCV
Self-supervised pre-training technology for images based on unlabeled training data has developed rapidly over the past two years. Its effects in various visual tasks have been comparable to (or exceeded) the effect of supervised training that requires a large number of labeled data. At the same time, Transformer technology, a great success in the NLP field, has further presented a better SOTA effect in various image tasks, and its application has surged. As a combination of the two, the self-supervised pre-training of Vision Transformers was created.
Self-supervised learning and Vision Transformer algorithm technology updates and iterates quickly, bringing CV algorithm developers many troubles (such as scattered open-source code and uneven implementation methods and styles). These problems lead to high learning and reproduction costs, poor training and reasoning performance, etc. With EasyCV, a flexible and easy-to-use algorithm framework, the Alibaba Cloud PAI systematically combines SOTA self-supervised algorithms and Transformer pre-training models, provides encapsulated, unified, simple, and easy-to-use interfaces, and optimizes the performance of self-supervised big data training, making it easy for users to try the latest self-supervised pre-training technology and Transformer models, promoting the application and implementation in business.
In addition, based on the deep learning training and inference acceleration technology, the PAI Team integrates features (such as I/O optimization, model training acceleration, quantization, and pruning) in EasyCV. So, it has advantages in performance. Based on the product ecosystem of Alibaba Cloud PAI, users can easily perform model management, online service deployment, and large-scale offline inference tasks.
EasyCV Architecture Diagram
The underlying engine of EasyCV is based on PyTorch and is connected to the PyTorch training accelerator for training acceleration. The algorithm framework is divided into the following layers:
EasyCV supports running and debugging in the local environment. At the same time, if users want to execute large-scale production tasks, EasyCV supports easy deployment in Alibaba Cloud PAI products.
Self-supervised learning does not require data labeling. Thanks to the introduction of contrastive learning, the effect of self-supervised learning is gradually comparable to supervised learning. Therefore, self-supervised learning has become one of the focuses of academic and industrial fields in recent years. EasyCV provides mainstream self-supervised algorithms based on contrastive learning (such as SimCLR, MoCo v1/v2, Swav, Moby, and DINO). The MAE algorithm based on mask image modeling is also reproduced. In addition, we provide comprehensive benchmarking tools to evaluate the effect of self-supervised pre-training models on ImageNet.
Based on systematic self-supervised algorithms and benchmarking tools, users can easily improve models, compare effects, and innovate models. Users can also train better pre-training models for their businesses based on a large amount of unlabeled data.
The following table shows the pre-training speed of existing self-supervised algorithms based on ImageNet data and the effect of linear eval/finetune on the ImageNet validation set.
Model | DALITFRecord (samples/s) | JPG (samples/s) | Performance Improvement | Remarks |
dino_deit_small_p16 | 492.3 | 204.8 | 140% | fp16 batch_size=32x8 |
moby_deit_small_p16 | 1312.8 | 1089.3 | 20.5% | fp16 batch_size=128x8 |
mocov2_resnet50 | 2164.9 | 1365.3 | 58.56% | fp16 batch_size=128x8 |
swav_resnet50 | 1024.0 | 853.3 | 20% | fp16 batch_size=128x8 |
As the backbone network, CNN works with the head of various downstream tasks and is a commonly used structure for CV models. EasyCV provides a variety of traditional CNN network structures, including resnet, resnext, hrNet, darknet, inception, mobilenet, genet, and mnasnet. With the development of Vision Transformer, Transformer has replaced CNN in a growing number of fields and become a backbone network with stronger expression capability. In addition to implementing the commonly used ViT and SwinTransformer, the framework introduces the PytorchImageModel(Timm) to support a more comprehensive Transformer structure.
Combined with the self-supervised algorithm, all models support self-supervised pre-training and ImageNet supervised data training, providing users with a variety of pre-training backbones. Users can simply configure and use them in the downstream tasks preset by the framework and connect them to the custom downstream tasks.
1. The framework provides parameterized methods and Python API to perform training, evaluation, and model export and provides a complete prediction interface to conduct end-to-end inference.
1.# Configuration file method
2.python tools/train.py configs/classification/cifar10/r50.py --work_dir work_dirs/classification/cifar10/r50 --fp16
3.
4.
5.# Simple method for passing parameters
6.python tools/train.py --model_type Classification --model.num_classes 10 --data.data_source.type ClsSourceImageList --data.data_source.list data/train.txt
1.import easycv.tools
2.config_path = 'configs/classification/cifar10/r50.py'
3.easycv.tools.train(config_path, gpus=8, fp16=False, master_port=29527)
1.import cv2
2.from easycv.predictors.classifier import TorchClassifier
3.
4.output_ckpt = 'work_dirs/classification/cifar10/r50/epoch_350_export.pth'
5.tcls = TorchClassifier(output_ckpt)
6.
7.img = cv2.imread('aeroplane_s_000004.png')
8.# input image should be RGB order
9.img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
10.output = tcls.predict([img])
11.print(output)
2. The framework focuses on high-level visual tasks. For the three major tasks of classification, detection, and segmentation, based on application scenarios (such as content risk control, smart retail, intelligent monitoring, graph matching, product category prediction, product detection, product attribute recognition, industrial quality inspection, and Alibaba internal business practices and the experience of serving external customers of Alibaba Cloud, the framework filters the reproduction effect of SOTA algorithm, provides pre-training models and integrates the training, inference, and on-device deployment processes to facilitate customized development of applications in various scenarios. For example, in the detection field, we reproduce the YOLOX algorithm and integrate PAI-Blade model compression features (such as pruning and quantization), and the MNN model can be exported for on-device deployment. Please see the Model Compression and Quantization Tutorial for more information.
1. As shown on the right side of the technical architecture diagram, all modules support registration and automatic creation using Builder through configuration files, which enables modules to be flexibly combined and replaced through configuration. Let’s take the model and evaluator configurations as an example. Users can simply change the configuration file to switch between different backbones and different classification heads to adjust the model structure. In terms of evaluation, users can specify multiple datasets and use different evaluators for multi-metric evaluation.
1.model = dict(
2. type='Classification',
3. pretrained=None,
4. backbone=dict(
5. type='ResNet',
6. depth=50,
7. out_indices=[4], # 0: conv-1, x: stage-x
8. norm_cfg=dict(type='SyncBN')),
9. head=dict(
10. type='ClsHead', with_avg_pool=True, in_channels=2048,
11. num_classes=1000))
12.
13.eval_config = dict(initial=True, interval=1, gpu_collect=True)
14.eval_pipelines = [
15. dict(
16. mode='test',
17. data=data['val1'],
18. dist_eval=True,
19. evaluators=[dict(type='ClsEvaluator', topk=(1, 5))],
20. ),
21. dict(
22. mode='test',
23. data=data['val2'],
24. dist_eval=True,
25. evaluators=[dict(type='RetrivalEvaluator', topk=(1, 5))],
26. )
27.]
2. Based on the registration mechanism, users can write customized modules (such as neck, head, data pipeline, and evaluator), quickly register them into the framework, and create and call them through the specified type field in the configuration file.
1.@NECKS.register_module()
2.class Projection(nn.Module):
3. """Customized neck."""
4. def __init__(self, input_size, output_size):
5. self.proj = nn.Linear(input_size, output_size)
6.
7. def forward(self, input):
8. return self.proj(input)
The configuration file is listed below:
1.model = dict(
2. type='Classification',
3. backbone=dict(...),
4. neck=dict(
5. type='Projection',
6. input_size=2048,
7. output_size=512
8. ),
9. head=dict(
10. type='ClsHead',
11. embedding_size=512,
12. num_classes=1000)
In terms of training, the framework supports multi-machine multi-GPU mode and supports using FP16 to accelerate training and evaluation.
In addition, for specific tasks, the framework makes targeted optimizations. For example, since self-supervised training requires a large number of small images for pre-training, EasyCV uses TFRecord format data to encapsulate small files and uses DALI to perform GPU acceleration for preprocessing to improve training performance. The following figure shows the performance comparison between DALI + TFRecord training and original image training.
Model | DALITFRecord (samples/s) | JPG (samples/s) | Performance Improvement | Remarks |
dino_deit_small_p16 | 492.3 | 204.8 | 140% | fp16 batch_size=32x8 |
moby_deit_small_p16 | 1312.8 | 1089.3 | 20.5% | fp16 batch_size=128x8 |
mocov2_resnet50 | 2164.9 | 1365.3 | 58.56% | fp16 batch_size=128x8 |
swav_resnet50 | 1024.0 | 853.3 | 20% | fp16 batch_size=128x8 |
Test model: V100 16GB*8
As mentioned at the beginning, EasyCV supports more than ten BUs and more than 20 businesses in Alibaba Group. At the same time, it meets the needs of customers on the cloud to customize models and solve business problems through platform-based components.
For example, a BU uses 1 million images from the business image library to perform self-supervised pre-training. Based on the pre-training model, the BU fine-tune downstream tasks to achieve the best effect, which is 1% higher than the baseline model. A number of BU members use the self-supervised pre-training models for feature extraction, and they use image features to match the same graphs with the help of contrastive learning characteristics. At the same time, we introduce a solution similar to graph matching on the public cloud.
A smooth and out-of-the-box user experience is created by integrating data labeling, model training, and service deployment process. At the same time, algorithms in the fields of image classification, object detection, instance segmentation, semantic segmentation, and key point detection are covered. Entry-level users on the public cloud can complete model training by specifying data, adjusting parameters, and pulling up online service through one-click deployment. EasyCV provides a notebook development environment for advanced developers. The support for the training scheduling of cloud-native clusters allows users to use the framework to develop customized algorithms and use preset pre-training models for fine-tuning.
In the future, we plan to release the release version every month. The recent roadmap is listed below:
In addition, we will continue to invest in the following exploratory directions. We welcome feedback, suggestions for improvement, and technical discussion from various dimensions. At the same time, we look forward to the participation of colleagues interested in the construction of an open-source community.
The Address of Open-Source EasyCV: https://github.com/alibaba/EasyCV
TePDist (an HLO-Based Fully Automatic Distributed System) Has Opened Its Source Code to the Public!
35 posts | 1 followers
FollowAlibaba Cloud Community - March 9, 2023
Alibaba Clouder - October 15, 2020
dehong - July 8, 2020
Apache Flink Community China - September 15, 2022
Alibaba Cloud Community - November 14, 2022
shiming xie - November 4, 2019
35 posts | 1 followers
FollowOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreThis solution enables you to rapidly build cost-effective platforms to bring the best education to the world anytime and anywhere.
Learn MoreThis technology can be used to predict the spread of COVID-19 and help decision makers evaluate the impact of various prevention and control measures on the development of the epidemic.
Learn MoreMore Posts by Alibaba Cloud Data Intelligence