Cross modal retrieval based on generation model

Introduction

We have entered an era of big data, and data in different modalities such as text and images are growing at an explosive rate. These heterogeneous modality data also bring challenges to user's search.

For text-visual cross-modal representation, a common method is to first encode the data of each modality into a feature representation of its own modality, and then map it into a common space. It is optimized by ranking loss so that the distance between the feature vectors mapped by similar image-text pairs is smaller than the distance between dissimilar image-text pairs.

Although the features learned by this method can well describe the high-level semantics of multimodal data, the local similarity of images and the sentence-level similarity of sentences are not fully exploited. For example, when text retrieves images, we pay more attention to details such as the color, texture, and layout of the image. However, only high-level feature matching cannot take into account the local similarity.

The idea of this article comes from thinking about human thinking. For humans, given a text description to retrieve matching images, a trained painter can find better matching images than ordinary people, because the painter knows what the expected picture looks like; similarly, given a picture To retrieve matching text descriptions, a writer also tends to give better descriptions than the average person. We call this process of anticipating the retrieval target - "Imagine" or "Brain Supplement". Therefore, we propose a generative cross-modal feature learning framework (GXN) based on a generative model. The following figure shows the idea of this paper:

We changed the original Look and Match into three steps: Look, Imagine and Match, also known as "what you see, what you think, what you find". Look is called "what you see", "seeing" is understanding, and actually it is extracting features. Imagine is called "what you think", and it uses "what you see" to "brain-fill" the expected matching results, that is, to generate the data of the target mode from the obtained local features; The result of local level (sentence-level/pixel-level) matching and high-level semantic feature matching.

method

GXN consists of three modules: multimodal feature representation (upper region); image-text generative feature learning (blue path) and text-image generative adversarial feature learning (green path).

The first part (upper region) is similar to basic cross-modal feature representation, mapping data from different modalities to a common space. This includes an image encoder and two sentence encoders. The reason why the two sentence encoders are separated is that it is easy to learn features at different levels. Among them, is the high-level semantic feature and as a local level feature. The local hierarchical features here are learned by generative models.

The second part (blue path) generates a text description from the underlying visual features. Consists of an image encoder and a sentence decoder. When calculating the loss here, we combine the idea of reinforcement learning to ensure the maximum similarity between the generated sentence and the real sentence through rewards.

The third part (green path) generates an image from text features using a generator and a discriminator. The discriminator is used to distinguish text-generated images from real images.

Finally, we learn better cross-modal feature representations through two-way cross-modal feature generation learning. At test time, we only need to compute the similarity between and for cross-modal retrieval.

experiment

The method proposed in this paper is compared with the current cutting-edge method on the MSCOCO dataset, and has achieved state-of-the-art results.

Summarize

This paper innovatively introduces the image-text generation model and text-image generation model into the traditional cross-modal representation, so that it can not only learn the high-level abstract representation of multi-modal data, but also learn the underlying representation. Significantly better performance than state-of-the-art methods confirms the effectiveness of the proposed method.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us