How to solve the prior problem of VQA language

1. Research motivation

Visual Question Answering (VQA) is a cross task between the fields of computer vision and natural language processing. In this task, the model needs to answer questions about images based on their content. VQA tasks have broad application prospects in application scenarios such as online education and visual assistants for the blind.

For example, a virtual teacher on an online education platform can answer a series of questions from lower grade students based on the pictures in Figure 1(a), such as "How many children are there in the picture?", "What season is the picture drawn on?". The visual assistant of the blind person can receive the picture shown in Figure 1(b) taken by the blind person and ask the blind person: "Do you see picnic tables across the parking lot?" To answer: "No (No)".

Recent studies [2-4] show that most existing VQA models rely too much on superficial correlations (linguistic priors) between questions and answers while ignoring image information. Regardless of the image given, they frequently answered "white" to questions about color, "tennis" to questions about sports, and "yes" to questions beginning with "is there a" . Most VQA models consist of three parts: extracting image and question representations, fusing image and question representations to obtain their joint representation, and using the joint representation to predict the final answer. However, these models do not explicitly distinguish and utilize different information in the question, so it is inevitable to use the co-occurrences (co-occurrences) of interrogative words in the question and the answer to infer the answer.

The authors propose to learn decomposed linguistic representations for different information in a problem and exploit these decomposed representations to overcome linguistic priors.

A question-answer pair usually contains three kinds of information: question type, referent and expected concept. For true-true questions (yes/no question) the concept is expected to be implied in the question, for other questions (non-yes/noquestion) the concept is expected to be contained in the answer. Humans can easily identify and leverage different information in questions to infer answers, independent of language priors.

For example, if a person is asked to answer the question: "Is the man's shirt white?", he or she will know that it is a true or false question as soon as he or she sees "is" , with possible answers yes and no. He or she can then locate the shirt in the image by the phrase "man's shirt" and judge from the shirt whether the answer is "white" (the concept expected in the question).

This process also applies to other questions, such as "What color is the man's shirt?". The only difference is that instead of "what color" he or she knows that the question is about color, several specific concepts like white, black and blue will come to his or her mind as alternatives The answer is for judgment. Inspired by this, the authors expect to build a model that can flexibly learn and utilize the decomposition representation of different information in the problem to alleviate the influence of language prior.

To this end, the authors propose a VQA method based on language attention. As shown in Figure 2, the method includes a language attention module, a question recognition module, an object referencing module and a visual verification module. The language attention module parses the question into three phrase expressions: type expression, object expression and concept expression. These decomposed language expressions will be input into subsequent modules respectively. The question recognition module uses type expressions to identify question types and possible answer sets (yes/no or specific concepts like colors or numbers).

By measuring the correlation between type expressions and candidate answers, this module generates a Q&A mask expressing whether a candidate answer is likely to be the correct answer. The object referencing module employs a top-down attention mechanism [5] to focus on relevant regions in an image guided by object representations. For true-false questions, the visual verification module measures the correlation between the region of interest and the concept expression, and infers the answer through threshold comparison. For other questions, the possible answers discovered by the question recognition module serve as proof-of-concepts. This module measures the correlation between the attention region and all possible answers, which are then fused with the Q&A mask to infer the final answer.

By identifying and utilizing different information in the question, the method decouples language-based concept discovery and vision-based concept verification from the answer reasoning process. Therefore, the superficial correlation between question and answer does not dominate the answer inference process, and the model must leverage the image content to infer the final answer from the set of possible answers. In addition, benefiting from a modular design, the method enables a transparent answering process. The intermediate results of the four modules (decomposed phrases, Q&A masks, regions of interest and visual scores) can be used as explanations for the model to arrive at a particular answer.

2. Method

As shown in Figure 2, the proposed method consists of four modules:

(a) A linguistic attention module that parses questions into type representations, object representations, and concept representations;
(b) a question identification module that uses type expressions to identify question types and possible answers;
(c) an object referencing module that uses object representations to focus on relevant regions of an image;
(d) A visual verification module for measuring the correlation between regions of interest and concept representations to infer answers.

2.1 Language attention module

The authors design a language attention module as shown in Figure 3 to obtain decomposed language representations. The language attention module combines hard attention mechanism and soft attention mechanism to separate the concept expression from the type expression of the judgment question. The soft attention mechanism adaptively weights all word embeddings and then aggregates them into phrase representations, while the hard attention mechanism only utilizes some of the word embeddings. As shown in Figure 3, the language attention module uses three types of attention, type attention, object attention and concept attention to learn these three decomposed representations respectively.

For question Q, the author uses the soft attention mechanism to calculate its type expression:

In order to ensure that type attention pays attention to interrogative words, the authors use interrogative word-based question categories provided in the VQA dataset as supervision information to guide type learning, and introduce a question category recognition loss:

The authors further introduce a scalar β as a threshold to filter words related to the question type. Comparing the type attention weight of each word with β, we can get all the words with weight less than β. These words are associated with referents and intended concepts. Further, two attention mechanisms are used to compute object representations and concept representations:

For the answer, GRU is used to obtain its representation.

2.2 Problem Identification Module

Given a type representation, the question type is first identified by a question recognition loss:

where CE stands for cross entropy function.

For true or false questions, the possible answers are "yes" and "no", and for other questions, the possible answers can be determined by measuring the correlation between the question and the candidate answer. The authors generate a Q&A mask for each other question, where each element represents the likelihood that the corresponding candidate answer is the correct answer. The authors calculate the correlation between the question Q and all candidate answers and use the sigmoid function to get the Q&A mask.

To effectively guide mask generation, the authors search all possible answers for each question category in the dataset to obtain the ground truth M of the Q&A mask. For each question category, possible answers are marked as 1, otherwise 0. The author uses KL divergence to measure the distance between the true value and the mask and proposes a mask generation loss:

2.3 Object Referral Module

The object referencing module utilizes object representations to utilize a top-down attention mechanism to focus on regions relevant to questions in images. Given the local features and object representations of an image, the authors compute the correlation of these local features with the object representation of the question, and weighted summation to obtain the final visual representation:

2.4 Visual Verification Module

The visual verification module judges the existence of the expected visual concept according to the region of interest to infer the final answer. For true-false questions, the authors directly calculate visual scores for the visual representation of the region of interest versus the conceptual representation. For other questions, we compute the visual score between the region of interest and all candidate answers and fuse it with the Q&A mask to get an overall score for each answer. In particular, given an image I and a question Q, the probability of a candidate answer being correct is:

The final overall validation loss is:

During the training phase, all the above losses work together to guide the model learning. In the testing stage, the author first determines the question type by Equation 4. Then, for judgment questions, the visual score is calculated and compared to 0.5 to arrive at the final answer. For non-yes/no questions, the model takes as input all candidate answers and chooses the one with the highest overall score as the answer.

3. Experiment

The authors use standard VQA evaluation metrics on the VQA-CP v2 dataset to evaluate the effectiveness of their method. VQA-CP v2 is a database specially used to evaluate the model affected by language prior. By reorganizing the training set and verification set of VQA v2, the training set and test set of VQA-CP v2 data set can be obtained.

In the VQA-CP v2 dataset, the distribution of answers for each question category (such as "what number" and "is there") is different in the test set and the training set. Therefore, VQA models that rely heavily on language priors do not perform well in this dataset. For the completeness of the experiments, the authors also report the results on the validation set of the VQA v2 dataset.

Table 1 lists the results of the proposed method and the state-of-the-art VQA model on the VQA-CP v2 dataset and the VQA v2 dataset. The results show that the method proposed in this paper outperforms other comparative methods. By learning and utilizing decomposed language representations, the method can decouple language-based concept discovery and vision-based concept verification from the answer reasoning process. For true-false questions, the linguistic attention module explicitly separates concept expressions from interrogative words. Therefore, the model needs to verify the existence of the visual concept in the question based on the image content to infer the answer, instead of relying on interrogative words. For other questions, the possible answers identified by the question recognition module will be validated through the participating regions. The answer with the highest score is used as the prediction result. Therefore, the model needs to leverage the image content to select the most relevant answer. Altogether, the approach guarantees that the model must leverage the visual information of the image to infer the correct answer from the set of possible answers, thereby significantly mitigating the impact of language priors.

In order to evaluate the effectiveness of various important components in the method, the authors obtained different versions of the model by ablating these components. Table 2 lists the results of these models on the VQA-CP v2 dataset.

The authors first investigate the effectiveness of the language attention module. To this end, the author replaces the language attention module with the original GRU network and obtains a model denoted "Ours w/o LA", where "w/o" means "none". "Ours w/o LA" obtains sentence-level question representations via GRU and feeds them into subsequent modules.

The authors further replace the language attention module with a common modular soft attention mechanism, and obtain a model named "Ours w/o threshold". As shown in Table 2, “Ours w/o LA” performs worse than the full model in all three subsets. This demonstrates the effectiveness of learning to decompose representations to overcome linguistic priors. "Ours w/o threshold" performs better than "Ours w/o LA", but still performs much worse than the full model. This demonstrates that simply using modular soft attention to learn to decompose representations without adding any constraints results in models that remain vulnerable to language priors.

The authors then investigate the impact of the mask generation process in the question recognition module. The authors exclude the mask generation process and denote the resulting model as "Ours w/o mask", which directly selects the answer with the highest visual score as the final answer. From Table 2, it can be found that the Q&A mask brings substantial improvement for the “other” subset. The model without the Q&A mask performs much worse than the full model on the "number" and "other" subsets.

Finally, Fig. 4 presents the qualitative comparison results of the proposed method with its base model UpDn [5]. For each example, one input question from the test set of the VQA-CP v2 dataset is shown on the upper left along with the image and the true answer (GT). The language attention map of the language attention module is shown on the upper right. The bottom row shows the visual attention maps for object referencing and predicted answers for the two methods, respectively. The regions with the highest weights are marked with green rectangles. These examples show the effectiveness of the proposed model on problem parsing and visual localization. It can be found that the proposed method can correctly identify various information in the question, locate relevant regions in the image more accurately than UpDn, and infer the correct answer.

Four. Summary

The authors of this paper propose a VQA method based on linguistic attention. The method learns a decomposed linguistic representation of the problem to overcome linguistic priors. Using a language attention module, the question can be flexibly parsed into three phrase expressions. These representations are appropriately used to decouple language-based concept discovery and vision-based concept verification from the answer inference process. The method thus guarantees that superficial correlations between questions and answers cannot dominate the answering process, and that the model must leverage images to infer answers.

Furthermore, the method enables a more transparent answering process and provides meaningful intermediate results. Experimental results on the VQA-CP dataset demonstrate the effectiveness of the method.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us