Community Blog Almost Human: Alibaba AI Ranks First in Visual Dialog Challenge

Almost Human: Alibaba AI Ranks First in Visual Dialog Challenge

In this article, we will highlight Alibaba's Artificial Intelligence (AI) team's achievement in the recent Visual Dialog Challenge – a feat comparable to humans.

Recently, in the second Visual Dialog Challenge, the Alibaba AI team defeated the other nine participants (including Microsoft and Seoul National University) and won the championship.

(Alibaba AI won the Visual Dialog Challenge)

Launched by institutions including Georgia Institute of Technology and Facebook AI Research (FAIR) and the premier computer vision event CVPR, the Visual Dialog Challenge is currently the most authoritative visual dialog competition.

In this Challenge, an AI agent is required to answer any question related to any picture after the agent sees nearly 10,000 pictures. According to the competition results, Alibaba AI won the event with an accuracy rate of 74.57%, 16.82% higher than the previous winner. When the data set is the same, the accuracy of this task for humans is only 64.27%.

Traditional visual AI mainly is about detecting and recognizing targets, for example, detecting whether the target picture contains a cat. However, traditional visual AI is not good at understanding and deducing the logical relations between targets in complex scenarios. Therefore, traditional AI cannot answer complex questions (for example, what color are the clothes on the boy next to the cat?) and cannot convert picture information to language output that humans can understand.

Alibaba's AI breakthrough is in the recursive exploration dialog model, which integrates three major capabilities: image recognition, relation inference, and natural language comprehension. By efficiently using the annotation information, Alibaba AI learns and simulates the thinking mode of humans in complex scenarios. Therefore, Alibaba AI can efficiently recognize entities in a picture and the entity relationship, deduce the content of the picture, understand questions raised by humans and their real intentions through the context-based modeling and provide accurate answers.

(In the visual dialog, the AI bot on the left side can correctly answer questions from the human on the right side)

Visual dialog is a new AI research trend that has gained significant momentum over recent years. The objective of visual dialog is to teach machines to discuss visual content with humans by using a natural language. If we say visual recognition gives machines the ability of seeing, the visual dialog technology gives machines the ability to understand the visual world and make correct references. This indicates that the AI cognition level has improved to a new level.

(The visual dialog technology is expected to improve the earthquake rescue efficiency)

It is reported that this technology will be applied in many human-machine interaction scenarios. Search and rescue robots can search for survivors in ruins in a more timely and efficient manner based on directives and scenario information. Visually impaired people can understand the content in pictures and know their surroundings by using Alibaba AI. Autonomous vehicles can have a more accurate understanding of the intentions of impact factors and provide passengers a better experience.

0 0 0
Share on

Alibaba Clouder

1,403 posts | 218 followers

You may also like