in the past 10 years, AI technology has maintained rapid development. However, in Visual Q & A VQA(Visual Question Answering), a high-level cognitive task involving Visual-text multi-modal understanding, AI has never made a breakthrough beyond the human level.

The Challenge VQA Challenge set up to overcome this problem has been held successively in ICCV and CVPR since 2015, forming the largest and most recognized VQA data set in the world, it contains more than 200,000 real photos and 1.1 million test questions.

In the first VQA challenge, AI achieved the highest accuracy of only 55%. In August this year, Damo Academy set a global record with VQA Leaderboard accuracy rate of 81.26%, surpassing the human benchmark 80.83% for the first time.

This is the first time that AI has exceeded the human level since the VQA test, which is a symbolic breakthrough.

Development of VQA technology since 2015

01 what is VQA?

The integration of natural language technology and computer vision is an important frontier research direction in multi-modal field. Among them, VQA is one of the most difficult challenges in the AI field, which is of great significance to the development of general AI.

The task of VQA is to generate correct natural language answers based on the given images and natural language questions.

For example, in the following figure, AI first extracts the key information of the question-the toy man; And then answers it according to common sense-Star Wars.


What movie franchise are the action figures from? (Which movie does the toy man's IP come from in the picture?) to complete the VQA challenge, AI needs to extract problem-related information from images, including the monitoring of fine objects to the reasoning of abstract scenes, and based on the vision, answer the understanding of language and common sense knowledge, that is, "understanding the meaning of reading pictures"-understanding information through vision is a basic ability of human beings, but it is a very demanding cognitive task for AI.

The core difficulty of VQA challenge lies in this: a single AI model needs to integrate complex computer vision and natural language technology to generate correct answers based on given pictures and natural language questions. 02 behind the high score of VQA

in order to solve the VQA challenge, Damo Academy has systematically designed the AI vision-text inference system and integrated a large number of algorithm innovations to optimize the computing process:

improved image comprehension

in the test, AI needs to scan the image information first. In order to improve the image understanding ability, Damo Academy has applied a number of innovative algorithms.

Diverse visual feature representation: multiple visual features such as Region,Grid, and Patch are used to represent the local and global semantic information of the image from various aspects.

Better AI understanding of graphic-text Association

AI needs to combine the understanding of problem text to establish the relevance between images and text: multi-modal information fusion.

Multi-Modal pre-training model: DAMO Academy proposed SemVLP,Grid-VLP,E2E-VLP, Fusion-VLP and other pre-training models for multi-modal information fusion and semantic mapping.

Adaptive Cross-modal Semantic Fusion and alignment technology: to make this fusion more efficient, an adaptive cross-modal Semantic Fusion and alignment technology is developed. Learning to Attend mechanism is added to the pre-training model. Among them, the self-developed multi-modal pre-training model E2E-VLP,StructuralLM has been accepted by the international top conference acl2021.

make AI more common sense

on the basis of image-text fusion, more common sense content is added to AI to improve the understanding and reasoning ability of pictures and texts.

Knowledge-driven multi-skill AI integration: Mixture of Experts (MOE) technology is used for knowledge-driven multi-skill AI integration, which is similar to adding life skills such as counting and clock reading to AI, and "human common sense" such as encyclopedic knowledge ". 03 The future of VQA

VQA technology has a wide range of application scenarios, which can be used in the fields of graphic reading, cross-modal search, blind visual Q & A, medical inquiry, intelligent driving and so on, or will change the human-computer interaction mode.

Currently, VQA technology has been applied to intelligent customer service, live video interaction, cross-modal search, and other scenarios within Alibaba.

For example, some Taobao Tmall Xiaomi customer service stores have opened the VQA visual Q & A function: Generally, the product details poster contains a large amount of valuable product information. When consumers ask questions about the product, AI customer service can answer questions by understanding and searching product posters, such as cutting a small picture.

This not only helps consumers solve their questions quickly, but also saves sellers a lot of configuration costs. In the customer service scenarios of Hema and Kaola, the same image and text matching scenario of Xianyu also supports VQA capabilities. In the future, when VQA technology is mature and applied in the e-commerce field, it will be popularized to a wider range of social application fields such as medical inquiry.

