FashionBERT multi-modal research in the field of e-commerce
With the development of Web technology, the Internet contains a large amount of multimodal information (including text, image, voice, video, etc.). Searching for important information from massive multimodal information has always been the focus of academic research. The core of multimodal matching is Text and Image Matching, which is also a basic research and has many applications in many fields, such as Cross-modality IR and Image Caption generation. ), vision question answering system (Vision Question Answering), image knowledge reasoning (Visual Commonsense Reasoning). However, the current academic research focuses on multimodal research in the general field, and there are relatively few multimodal studies in the e-commerce field. However, the e-commerce field also needs multi-modal matching models, and there are many application scenarios. This paper focuses on the research on graphic and text multimodal technology in the field of e-commerce.
A Brief History of Multimodal Matching Research
The core focus of cross-modal research is how to match multi-modal data, that is, how to map multi-modal information to a unified representation space. Early research was mainly divided into two main lines: Canonical Correlation Analysis (CCA) and Visual Semantic Embedding (VSE).
CCA family of methods
Mainly by analyzing the correlation of images and texts, and then putting images and texts into the same space. This series of problem papers is perfect, but the effect still needs to be improved compared with the deep learning method. Although there is also a deep learning-based solution (DCCA) in the later stage, there is still a certain gap compared with the later VSE method.
VSE system approach
Represent image and text as Latent Embedding respectively, and then fit multimodal Latent Embedding to the same space. The VSE method has extended a lot of methods such as SCAN and PFAN. These methods have achieved good results in general image-text matching.
With the application of pre-training and self-supervised technology in the field of CV and NLP. Beginning in 2019, some scholars began to try to use the pre-trained BERT model to fit graphic information into the same space based on large-scale data. These methods have achieved good results in the general field. For this series of methods, please refer to the VLBERT Paper.
The main process of the BERT-based pre-trained graphic model:
1) Use image target detection technology to first identify the Region of Interests (RoIs) in the image.
2) Treat the ROI as the token of the image, and do BERT multimodal fusion with the text token. There are two solutions:
Single-stream: Represented by VLBERT, image tokens and text tokens are directly put into BERT for multi-modal fusion.
Cross-stream: Represented by ViLBERT, image tokens and text tokens are initially interacted with, and then put into BERT.
We tried the ViLBERT method and found that it works really well in the general domain. However, in the field of e-commerce, due to the unsatisfactory ROI extracted, the effect is lower than expected. The main reasons are:
1) The ROI of e-commerce images is too small
The e-commerce image product is single, and the background is simple to extract with few ROIs, as shown in Figure 1(c). Statistically speaking, for MsCoCo data in the general field, 19.8 ROIs can be extracted per image, but only 6.4 ROIs can be extracted for e-commerce. Of course, we can force the minimum ROI to be extracted. For example, ViLBERT requires 10~36 ROIs, and VLBERT requires 100 ROIs. However, after setting the minimum extracted ROI, too many repeated ROIs are extracted, as shown in Figure 1(e).
2) E-commerce ROI is not fine-grained enough
The e-commerce image is single, and the extracted RoIs are mainly object-level products (for example, overall dresses, T-shirts, etc.). Compared with the text, it is not fine-grained enough. For example, the text can describe the very detailed attributes of the subject (such as round neck, cropped pants, cropped pants, etc.). As a result, the image ROI is not enough to match the text token. You can compare Figure 1(c) and Figure 1(d) in the e-commerce field. Looking at Figure 1(a) and Figure 1(b) in the general field, you will find that the general field is simpler, as long as the subject in the image and the text token can be aligned together, it is basically not too bad.
3) E-commerce image ROI noise is too large
The model head, hair, and fingers extracted in Fig. 1(f) are not very useful for item matching.
This also explains that the e-commerce field also adopts the existing ROI method, which cannot obtain very ideal results. If it is said that retraining an ROI extraction model in the e-commerce field for the e-commerce field requires a lot of data labeling work. So is there a simple and easy way to do graphic-text matching and fitting?
FashionBERT image-text matching model
In this paper, we propose the FashionBERT image-text matching model. The core problem is how to solve the extraction or expression of image features in the e-commerce field. Google published an article in the middle of 2019, the image self-supervised learning model selfie, the main idea is to divide the image into sub-images, and then predict the location information of the sub-images. So that the model can achieve the purpose of understanding image features, this work has inspired us a lot. We directly split the image into patches of the same size, and then use the patch as the token of the image to fit the text, as shown in Figure 2. Benefits of using Patch:
Image Patch contains detailed information of all images.
Image patches do not have duplicate ROIs or too many useless ROIs.
Image Patch naturally contains order, so it solves the sequence problem of BERT.
The overall structure of FashionBERT is shown in Figure 2, which mainly includes Text Embedding, Patch Embedding, Cross-modality FashionBERT, and Pretrain Tasks.
Like the original BERT, the sentence is first divided into tokens, and then we use Whole Word Masking technology to mask the entire token. The masking strategy is consistent with the original BERT.
Similar to Text Embedding, here we divide the image into 8*8 patches on average. Each patch extracts the image features of the patch through ResNet, and we extract 2048-dimensional image features. Patch mask strategy, we randomly masked 10% of the patches, and replaced the masked patches with 0. At the same time, in the Segment field, we use "T" and "I" to distinguish text token input and image patch input respectively.
The pre-trained BERT is used as the network, so the language model is naturally included in FashionBERT. The model can pay more attention to image-text matching and fusion.
The FashionBERT model includes three tasks in the pretrain stage:
1 Masked Language Modeling (MLM)
Predict Masked Text Token, this task training and parameters we keep consistent with the original BERT.
2 Masked Patch Modeling (MPM)
Predict Masked Patch, this task is similar to MLM. But since there is no id token in the image. Here we use patch as the target, hoping that BERT can reconstruct the patch information, here we choose KLD as the loss function.
3Text and Image Alignment
Similar to the Next Sentence Prediction task, predict whether the text matches. Positive samples are product titles and pictures, and negative samples are randomly sampled pictures of other products in the same category as negative samples.
This is a multi-task learning problem, how to balance the learning weights of these tasks? In addition, there is another problem. At present, many experiments point out that the effect of NSP in BERT is not necessarily very effective, and the impact on the final result is not particularly clear. But for image-text matching, the loss of Text and Image Alignment is crucial. So how to balance the learning of these tasks? Here we propose an adaptive loss algorithm, and we regard the weight of the learning task as a new optimization problem, as shown in Figure 3. The loss of FashionBERT is the sum of the overall loss. Since there are only three tasks, in fact, we can directly obtain the analytical solution of the task weight W (the specific solution process can refer to our paper, which will not be repeated here).
The entire learning process of w can be regarded as a student who wants to learn three subjects. The function of w is to control the attention of learning. For the specific adaptive loss algorithm, please refer to the paper. From the actual effect, w focuses on different tasks with the iteration of training, and achieves the purpose of balancing tasks.
At present, FashionBERT has begun to be applied in the Alibaba search multimodal vector retrieval. For the search multimodal vector retrieval, the matching task can be regarded as a text-text-image matching task, that is, User Query (Text)-Product Title (Text) - Product Image (Image) triple matching relationship. From the above model, FashionBERT can be seen as a basic image-text matching model, so we did the Continue Pretrain work, and added Query, Title, and Image Segment distinctions, as shown in Figure 4. The biggest difference from FashionBERT is that we introduce three segment types, "Q", "T", and "I" represent Query, Title, and Image respectively.
The model after Continue Pretrain can quickly get very good results on very small finetune data. At present, our vector retrieval model is as shown in Figure 5:
In the above figure, we use a two-tower model (parameter sharing between towers), which can facilitate online query vector generation and offline product vector generation. In addition, on the query side, we use co-occurring queries to assist the feature expression of queries, and on the product side, we use extended information to expand the semantic expression of products.
We use the FashionGen dataset to compare the mainstream image-text matching technology, and the latest ViLBERT and VLBERT. The effects of image-text matching and Cross-modality Retrieval are as follows, and FashionBERT has achieved a very significant improvement.
ICBU data on
Compared with the BERT model, the effect improvement is also very obvious. At the same time, due to online prediction performance problems, the finetune model has been reduced. We only use the first two layers of FashionBERT, and introduce a cache and dynamic variable length Variable Sequence Length (VSL) strategy, which greatly improves the online service performance of FashionBERT. . As shown in the table below.
At present, the paper has been accepted by SIGIR20 Industry Track, the top international conference in the field of information retrieval. See the preprint here: https://arxiv.org/abs/2005.09801. Interested students can read our papers for a more detailed comparison.
Although the direction of graphic-text matching has a long research history, the method based on pretrain BERT is still in the ascendant. In the future, we plan to further optimize in four aspects:
Multi-scale transformation of images: multi-scale transformation of multiple images to obtain fine-grained features of images at different scales.
Text & Image Alignment: Introduce other information or other methods to align text tokens and image regions during pre-training.
Introduction of industry knowledge: Introduce industry knowledge and learn image-text matching models in different industries.
Video understanding: do multi-modal understanding of text, image, and video.
It is believed that based on the powerful fitting ability of BERT, the matching and fusion of multi-modal information will become more and more intelligent.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00