Community Blog EasyCV | Out-of-the-Box Visual Self-Supervision + Transformer Algorithm Library

EasyCV | Out-of-the-Box Visual Self-Supervision + Transformer Algorithm Library

This article introduces EasyCV and explains how it works, its features, and its application scenarios.

By Qianyan and Linzai

1. Introduction

In recent years, self-supervised learning and Transformer have taken off in the vision field. Self-supervised pre-training for images significantly reduces the heavy labeling workload of image tasks and saves a lot of labor costs. The great success of transformer technology in the NLP field also provides a large imagination space for further improvement of CV model effects. Alibaba Cloud Platform of Artificial Intelligence (PAI) has developed EasyCV (an all-in-one visual modeling tool) to promote the implementation of self-supervised learning and Vision Transformer in Alibaba Group and Alibaba Cloud. EasyCV builds a rich and comprehensive self-supervised algorithm system and provides a SOTA pre-training model of Vision Transformer. EasyCV model zoo covers areas, including self-supervised training for images, image classification, metric learning, object detection, and key point detection. EasyCV also provides out-of-the-box training and inference capabilities for developers. At the same time, it has made in-depth optimization in training and inference efficiency. In addition, EasyCV is compatible with the Alibaba AI system. Users can use all the features of EasyCV in the Alibaba Cloud environment.

After EasyCV was improved by Alibaba’s internal businesses, we hope to promote the EasyCV framework to the community to serve the vast number of CV algorithm developers and enthusiasts, so they can experience the latest self-supervised learning and transformer technology quickly and conveniently and apply them to business production.

How can we design the algorithm framework of EasyCV? What features can developers use? What are the future plans? Let's gain an insight into EasyCV.

2. What Is EasyCV?

EasyCV is Alibaba's open-source Pytorch-based all-in-one visual algorithm modeling tool with self-supervised learning and Transformer technology as the core. EasyCV supports multiple business units (BU) in Alibaba Group (such as Search, Amoy, Youku, and Fliggy). It also serves several enterprises on Alibaba Cloud. EasyCV meets customers' needs for customizing models and solving business problems through platform-based components.

The Address of Open-Source EasyCV: https://github.com/alibaba/EasyCV

1. Project Background


Self-supervised pre-training technology for images based on unlabeled training data has developed rapidly over the past two years. Its effects in various visual tasks have been comparable to (or exceeded) the effect of supervised training that requires a large number of labeled data. At the same time, Transformer technology, a great success in the NLP field, has further presented a better SOTA effect in various image tasks, and its application has surged. As a combination of the two, the self-supervised pre-training of Vision Transformers was created.

Self-supervised learning and Vision Transformer algorithm technology updates and iterates quickly, bringing CV algorithm developers many troubles (such as scattered open-source code and uneven implementation methods and styles). These problems lead to high learning and reproduction costs, poor training and reasoning performance, etc. With EasyCV, a flexible and easy-to-use algorithm framework, the Alibaba Cloud PAI systematically combines SOTA self-supervised algorithms and Transformer pre-training models, provides encapsulated, unified, simple, and easy-to-use interfaces, and optimizes the performance of self-supervised big data training, making it easy for users to try the latest self-supervised pre-training technology and Transformer models, promoting the application and implementation in business.

In addition, based on the deep learning training and inference acceleration technology, the PAI Team integrates features (such as I/O optimization, model training acceleration, quantization, and pruning) in EasyCV. So, it has advantages in performance. Based on the product ecosystem of Alibaba Cloud PAI, users can easily perform model management, online service deployment, and large-scale offline inference tasks.

2. Main Features

  • Rich and Perfect Self-Supervised Algorithm Systems: EasyCV provides the industry's representative image self-supervised algorithms (such as SimCLR, MoCO, Swav, Moby, and DINO) and the MAE algorithm based on mask image modeling. It also provides detailed benchmarking tools and reproducible results.
  • Rich Pre-Training Model Library: EasyCV provides a wide range of pre-training models. In addition to the Transformer model, the mainstream CNN model is provided. ImageNet pre-training and self-supervised pre-training are supported. It is compatible with PytorchImageModels and supports richer Vision Transformer backbones.
  • Usability and Extensibility: EasyCV supports training, evaluation, and model export using configuration and API calls. Its framework adopts mainstream modular design and is flexible and extensible.
  • High Performance: EasyCV supports multi-machine multi-GPU training and evaluation and FP16 training acceleration. In view of a large amount of data in self-supervised scenarios, DALI and TFRecord files are used to accelerate I/O. EasyCV can also integrate with the Alibaba Cloud PAI to conduct training acceleration and model inference optimization.

3. Major Technical Features

3.1 Technical Architecture

EasyCV Architecture Diagram

The underlying engine of EasyCV is based on PyTorch and is connected to the PyTorch training accelerator for training acceleration. The algorithm framework is divided into the following layers:

  • Framework Layer: The framework layer reuses the openmmlab/mmcv interface that is widely used in the open-source field. Trainer is used to control the main training process, and custom Hooks are used to control the learning rate, log printing, gradient update, model saving, and evaluation. Distributed training and evaluation are also supported. Evaluators provide evaluation metrics for different tasks and support multiple datasets evaluation, optimal ckpt saving, and user-defined evaluation metrics. Visualization supports the visualization of prediction results and input images.
  • Data Layer: This layer abstracts different data sources, supports multiple open-source datasets (such as Cifar, ImageNet, and CoCo) and supports image files in the format of raw and TFRecord. For data in the format of TFRecord, DALI is used for data processing acceleration, and for data in the format of raw, caching mechanism is used for accelerating data reading. The data preprocessing (data augmentation) process is abstracted into several independent pipelines, and different preprocessing processes can be flexibly configured using configuration files.
  • Model Layer: The model layer is divided into modules and algorithms. The modules provide basic backbones, commonly used losses, necks, and heads for various downstream tasks. ModelZoo provides self-supervised learning algorithms, image classification, metric learning, target detection, and key point detection algorithms. It will continue to expand and support more high-level algorithms in the future.
  • Inference: EasyCV provides end-to-end inference APIs to support PAI-Blade inference optimization and online and offline inference on the cloud.
  • API Layer: This layer provides a unified API for training, evaluation, model export, and prediction.

EasyCV supports running and debugging in the local environment. At the same time, if users want to execute large-scale production tasks, EasyCV supports easy deployment in Alibaba Cloud PAI products.

3.2 Perfect Self-Supervised Algorithm System

Self-supervised learning does not require data labeling. Thanks to the introduction of contrastive learning, the effect of self-supervised learning is gradually comparable to supervised learning. Therefore, self-supervised learning has become one of the focuses of academic and industrial fields in recent years. EasyCV provides mainstream self-supervised algorithms based on contrastive learning (such as SimCLR, MoCo v1/v2, Swav, Moby, and DINO). The MAE algorithm based on mask image modeling is also reproduced. In addition, we provide comprehensive benchmarking tools to evaluate the effect of self-supervised pre-training models on ImageNet.


Based on systematic self-supervised algorithms and benchmarking tools, users can easily improve models, compare effects, and innovate models. Users can also train better pre-training models for their businesses based on a large amount of unlabeled data.

The following table shows the pre-training speed of existing self-supervised algorithms based on ImageNet data and the effect of linear eval/finetune on the ImageNet validation set.

Model DALITFRecord (samples/s) JPG (samples/s) Performance Improvement Remarks
dino_deit_small_p16 492.3 204.8 140% fp16 batch_size=32x8
moby_deit_small_p16 1312.8 1089.3 20.5% fp16 batch_size=128x8
mocov2_resnet50 2164.9 1365.3 58.56% fp16 batch_size=128x8
swav_resnet50 1024.0 853.3 20% fp16 batch_size=128x8

3.3 Rich Pre-Training Model Library

As the backbone network, CNN works with the head of various downstream tasks and is a commonly used structure for CV models. EasyCV provides a variety of traditional CNN network structures, including resnet, resnext, hrNet, darknet, inception, mobilenet, genet, and mnasnet. With the development of Vision Transformer, Transformer has replaced CNN in a growing number of fields and become a backbone network with stronger expression capability. In addition to implementing the commonly used ViT and SwinTransformer, the framework introduces the PytorchImageModel(Timm) to support a more comprehensive Transformer structure.

Combined with the self-supervised algorithm, all models support self-supervised pre-training and ImageNet supervised data training, providing users with a variety of pre-training backbones. Users can simply configure and use them in the downstream tasks preset by the framework and connect them to the custom downstream tasks.


3.4 Usability

1.  The framework provides parameterized methods and Python API to perform training, evaluation, and model export and provides a complete prediction interface to conduct end-to-end inference.

1.# Configuration file method
2.python tools/train.py  configs/classification/cifar10/r50.py --work_dir work_dirs/classification/cifar10/r50  --fp16
5.# Simple method for passing parameters
6.python tools/train.py --model_type Classification --model.num_classes 10 --data.data_source.type ClsSourceImageList --data.data_source.list data/train.txt

API Mode

1.import easycv.tools  
2.config_path = 'configs/classification/cifar10/r50.py'  
3.easycv.tools.train(config_path, gpus=8, fp16=False, master_port=29527)

Inference Example

1.import cv2
2.from easycv.predictors.classifier import TorchClassifier
4.output_ckpt = 'work_dirs/classification/cifar10/r50/epoch_350_export.pth'
5.tcls = TorchClassifier(output_ckpt)
7.img = cv2.imread('aeroplane_s_000004.png')
8.# input image should be RGB order
9.img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
10.output = tcls.predict([img])

2.  The framework focuses on high-level visual tasks. For the three major tasks of classification, detection, and segmentation, based on application scenarios (such as content risk control, smart retail, intelligent monitoring, graph matching, product category prediction, product detection, product attribute recognition, industrial quality inspection, and Alibaba internal business practices and the experience of serving external customers of Alibaba Cloud, the framework filters the reproduction effect of SOTA algorithm, provides pre-training models and integrates the training, inference, and on-device deployment processes to facilitate customized development of applications in various scenarios. For example, in the detection field, we reproduce the YOLOX algorithm and integrate PAI-Blade model compression features (such as pruning and quantization), and the MNN model can be exported for on-device deployment. Please see the Model Compression and Quantization Tutorial for more information.

3.5 Extensibility

1.  As shown on the right side of the technical architecture diagram, all modules support registration and automatic creation using Builder through configuration files, which enables modules to be flexibly combined and replaced through configuration. Let’s take the model and evaluator configurations as an example. Users can simply change the configuration file to switch between different backbones and different classification heads to adjust the model structure. In terms of evaluation, users can specify multiple datasets and use different evaluators for multi-metric evaluation.

1.model = dict(  
2.    type='Classification',  
3.    pretrained=None,  
4.    backbone=dict(  
5.        type='ResNet',  
6.        depth=50,  
7.        out_indices=[4],  # 0: conv-1, x: stage-x  
8.        norm_cfg=dict(type='SyncBN')),  
9.    head=dict(  
10.        type='ClsHead', with_avg_pool=True, in_channels=2048,  
11.        num_classes=1000))  
13.eval_config = dict(initial=True, interval=1, gpu_collect=True)  
14.eval_pipelines = [  
15.    dict(  
16.        mode='test',  
17.        data=data['val1'],  
18.        dist_eval=True,  
19.        evaluators=[dict(type='ClsEvaluator', topk=(1, 5))],  
20.    ),  
21.    dict(  
22.        mode='test',  
23.        data=data['val2'],  
24.        dist_eval=True,  
25.        evaluators=[dict(type='RetrivalEvaluator', topk=(1, 5))],  
26.    )  

2.  Based on the registration mechanism, users can write customized modules (such as neck, head, data pipeline, and evaluator), quickly register them into the framework, and create and call them through the specified type field in the configuration file.

2.class Projection(nn.Module):  
3.    """Customized neck."""  
4.    def __init__(self, input_size, output_size):  
5.        self.proj = nn.Linear(input_size, output_size)  
7.    def forward(self, input):  
8.        return self.proj(input)  

The configuration file is listed below:

1.model = dict(  
2.    type='Classification',  
3.    backbone=dict(...),  
4.    neck=dict(  
5.        type='Projection',  
6.        input_size=2048,  
7.        output_size=512  
8.    ),  
9.    head=dict(  
10.         type='ClsHead',  
11.         embedding_size=512,  
12.         num_classes=1000)  

3.6 High Performance

In terms of training, the framework supports multi-machine multi-GPU mode and supports using FP16 to accelerate training and evaluation.

In addition, for specific tasks, the framework makes targeted optimizations. For example, since self-supervised training requires a large number of small images for pre-training, EasyCV uses TFRecord format data to encapsulate small files and uses DALI to perform GPU acceleration for preprocessing to improve training performance. The following figure shows the performance comparison between DALI + TFRecord training and original image training.

Model DALITFRecord (samples/s) JPG (samples/s) Performance Improvement Remarks
dino_deit_small_p16 492.3 204.8 140% fp16 batch_size=32x8
moby_deit_small_p16 1312.8 1089.3 20.5% fp16 batch_size=128x8
mocov2_resnet50 2164.9 1365.3 58.56% fp16 batch_size=128x8
swav_resnet50 1024.0 853.3 20% fp16 batch_size=128x8

Test model: V100 16GB*8

4. Application Scenarios

As mentioned at the beginning, EasyCV supports more than ten BUs and more than 20 businesses in Alibaba Group. At the same time, it meets the needs of customers on the cloud to customize models and solve business problems through platform-based components.

For example, a BU uses 1 million images from the business image library to perform self-supervised pre-training. Based on the pre-training model, the BU fine-tune downstream tasks to achieve the best effect, which is 1% higher than the baseline model. A number of BU members use the self-supervised pre-training models for feature extraction, and they use image features to match the same graphs with the help of contrastive learning characteristics. At the same time, we introduce a solution similar to graph matching on the public cloud.

A smooth and out-of-the-box user experience is created by integrating data labeling, model training, and service deployment process. At the same time, algorithms in the fields of image classification, object detection, instance segmentation, semantic segmentation, and key point detection are covered. Entry-level users on the public cloud can complete model training by specifying data, adjusting parameters, and pulling up online service through one-click deployment. EasyCV provides a notebook development environment for advanced developers. The support for the training scheduling of cloud-native clusters allows users to use the framework to develop customized algorithms and use preset pre-training models for fine-tuning.

  • A customer on the public cloud uses the object detection component to customize model training to intelligently judge whether the installation by workers in its business scenarios is qualified.
  • A suggested user uses the self-supervised training component and a large number of unlabeled advertisement images to train an image representation model and then connects the image features to the recommendation model. Combined with the recommendation model optimization, the CTR is improved by more than 10%.
  • Based on EasyCV, a panel developer customizes a defect detection model and completes cloud training and on-device deployment and inference.

5. Roadmap

In the future, we plan to release the release version every month. The recent roadmap is listed below:

  • Optimize the training performance of Transformer classification tasks and benchmark
  • Add self-supervised learning detection and segmentation benchmark
  • Develop more Transformer-based downstream tasks, detection, and segmentation
  • Support common image task dataset downloading and training and access interface
  • Access to model inference optimization feature
  • Support for on-device model deployment in more fields

In addition, we will continue to invest in the following exploratory directions. We welcome feedback, suggestions for improvement, and technical discussion from various dimensions. At the same time, we look forward to the participation of colleagues interested in the construction of an open-source community.

  • Combine self-supervised learning technology with Transformer to explore more efficient pre-training models
  • Make a lightweight Transformer to facilitate the implementation of Transformer in actual business scenarios based on the joint optimization of training and inference
  • Explore the application of a unified Transformer in visual high-level multitasking based on multimodal pre-training

The Address of Open-Source EasyCV: https://github.com/alibaba/EasyCV

0 1 0
Share on

You may also like


Related Products