AI Portrait Approaches the Real World

Guided Reading

User Generated Content (UGC) is an important component of multimodal content on the Internet. The continuous growth of UGC data level promotes the prosperity of most of the multimodal content platforms. With the help of massive multi-modal data and deep learning model, AI Generated Content (AIGC) shows an explosive growth trend. Among them, Text to image generation task is a popular cross modal generation task, which aims to generate an image corresponding to a given text. Typical document diagram models such as DALL-E and DALL-E2 developed by OpenAI. Recently, the industry has also trained larger and updated document and graph generation models, such as the Parti and Imagen proposed by Google, and the Stable Diffusion based on the diffusion model.

However, the above models cannot generally be used to handle Chinese requirements, and the parameters of the above models are huge, which makes it difficult for the majority of users in the open source community to directly use them for Fine tune and reasoning. In addition, the training process of the text graph generation model lacks the understanding of knowledge, and it is easy to generate anti common sense content. This time, the EasyNLP open source framework, on the basis of the document and map generation model (see here) previously launched based on Transformer, further introduced the document and map generation model ARTIST, which integrates rich knowledge of the knowledge map, and can generate more common sense pictures under the guidance of the knowledge map. We evaluated ARTIST's generation effect on MUGE, the Chinese document and map generation evaluation benchmark, and its generation effect ranked first on the list. We also open the checkpoint of the knowledge enhanced Chinese text map generation model, as well as the corresponding Fine tune and inference interface to the open source community for free. Users can perform a small amount of domain related fine-tuning based on our open checkpoint, and can perform various artistic creations with one click without consuming a lot of computing resources.

EasyNLP( https://github.com/alibaba/EasyNLP )It is an easy and rich Chinese NLP algorithm framework developed by the Alibaba Cloud Machine Learning PAI team based on PyTorch, a regular Chinese pre training model and a Chinese model landing technology, and provides a Chinese station type NLP development experience from training to deployment. EasyNLP provides a simple interface for users to develop NLP models, including NLP AppZoo and pre training ModelZoo, and provides technical assistance for users to effectively implement the super pre training model to the business. Due to the increasing demand for cross modal understanding, EasyNLP also promotes various cross modal models, especially those in the Chinese field, to the open source community, hoping to serve more NLP and multi-modal algorithm developers and researchers, and also hopes to work with the community to promote the development of NLP/multi-modal technology and the implementation of models.

This article briefly introduces the technical interpretation of ARTIST and how to use ARTIST model in EasyNLP framework.

ARTIST Model Details

ARTIST model construction is based on Transformer model, and the document and graph generation task is divided into two stages. The first stage is to vector quantize the image through VQGAN model, that is, for the input image, the image is encoded into a fixed length discrete sequence through the encoder, and the decoding stage is to take the discrete sequence as the input and output the reconstruction image. In the second stage, the text sequence and the encoded image sequence are used as input, and GPT model is used to learn the image sequence generation based on the text sequence. In order to enhance the model apriority, we designed a Word Lattice Fusion Layer to introduce the entity knowledge in the knowledge map into the model and assist the generation of corresponding entities in the image, so as to make the entity information of the generated image more accurate. The following figure is the system block diagram of ARTIST model. The scheme is introduced from two aspects: the overall process of document and diagram generation and knowledge injection.

The first stage: VQGAN based image vector quantization

In the training phase of VQGAN, we use the images in the data to train a codebook of an image dictionary with image reconstruction as the task objective, where this codebook stores the vector representation of each image token. In practice, for a picture, the intermediate feature vector is obtained after encoding by CNN Encoder, and then the nearest representation in the codebook is found for each encoding position in the feature vector, so as to convert the image into a discrete sequence represented by the imaga token in the codebook. In the second stage, the GPT model will generate an image sequence based on text, and the sequence will be input to VQGAN Decoder to reconstruct an image.

The second stage: generate image sequences using GPT with text sequences as input

In order to integrate the knowledge in the knowledge map into the text map generation model, we first trained the Chinese knowledge map CN-DBpedia through TransE, and obtained the entity representation in the knowledge map. In the GPT model training phase, for text input, first identify all entities, and then combine the trained entity representation with token embedding to enhance the entity representation. However, because each text token may belong to multiple entities, if all the representations of multiple entities are introduced into the model, knowledge noise may be caused. So we have designed the entity representation interaction module. By calculating the interaction between each entity representation and token embedding, we can weight all entity representations and selectively inject knowledge. In particular, we calculate the importance of each entity representation to the current token embedding, measure it by the inner product, and then inject the weighted average value of the entity representation into the current token embedding. The calculation process is as follows:

After getting the token embedding of knowledge injection, we build a Transformer based GPT model by building a self attachment network with layer norm. The process is as follows:

In the training phase of the GPT model, the text sequence and image sequence are spliced as input. Assuming that the text sequence is w, the discrete sequence probability represented by the imaga token of the generated image is as follows:

Finally, the model is trained by maximizing the negative logarithmic likelihood of the image part, and the values of the model parameters are obtained.

ARTIST model effect

Standard data set evaluation results

We evaluated the effect of ARTIST model on several Chinese data sets. The statistical data of these data sets are as follows:

In terms of baseline, we consider two situations: zero shot learning and standard fine tuning. We take the Chinese CogView model with 4 billion parameters as the zero shot learner. We also consider two models with the same scale as ARTIST model, namely, the open-source DALL-E model and OFA model. The experimental data are as follows:

It can be seen from the above that our model can also achieve better image and text generation effect under the condition of very small parameters (202M). In order to measure the effectiveness of knowledge injection, we further conducted relevant evaluation and removed the knowledge module. The experimental results are as follows:

The above results clearly show the role of knowledge injection.

case analysis

In order to more directly compare the quality of images generated by ARTIST and baseline models in different scenarios, we show the effects of images generated by each model in e-commerce commodity scenarios and natural scenery scenarios, as shown below:

Comparison of e-commerce scene effects

Comparison of natural scenery scene effects

The above figure shows the advantages of ARTIST's image quality. We further compare the effect of our previously disclosed model (see here) with that of the knowledgeable ARTIST model. In the first example "Handmade antique restoration hairpin Han clothing accessories palace hairpin pearl headdress hair crown", the original generation results mainly highlight the pearl hair crown. In ARTIST model, the knowledge injection process of words such as "ancient style" makes the model generation result more inclined to the pearl hairpins of ancient China.

Input: Handmade antique restoration hairpin Han clothing accessories palace hairpin pearl headdress hair crown

No knowledge injection model

ARTIST

The second example is "a green cauliflower is growing". Because the model does not have enough knowledge of the object style of "cauliflower" during training, when the knowledge injection module is not included, the model generates a single plant with large green leaves according to the prompts of "green" and "vegetable". In ARTIST model, the generated object is closer to the oval plant shaped like cauliflower.

A green cauliflower is growing

No knowledge injection model

Evaluation results of ARTIST model on MUGE list

MUGE (Multimodal Understanding and Generation Evaluation, Link) is the industry's first large-scale Chinese multimodal evaluation benchmark, including text-based image generation tasks. We used the ARTIST model introduced this time to verify the effect of the above text map generation model on the Chinese MUGE evaluation list. It can be seen from the figure below that the images generated by ARTIST model exceed other results on the list in the FID indicator (Frechet Inception Distance, the lower the value, the better the image quality generated).

Implementation of ARTIST model

In the EasyNLP framework, we built the Backbone of the ARTIST model at the model level, which is mainly the GPT. The input is the token id and the embedding of the contained entities, respectively, and the output is the discrete sequence corresponding to each patch of the image.

ARTIST Model Application Tutorial

Here we briefly introduce how to use the ⽤ ARTIST model in the EasyNLP framework.

Install EasyNLP

Users can refer to GitHub directly( https://github.com/alibaba/EasyNLP )Install the EasyNLP algorithm framework according to the instructions on.

Data preparation

1. Prepare your own data and encode the image in base64 format: ARTIST applications in specific fields require finetune, and users need to prepare training and verification data for downstream tasks as tsv files. This document contains three columns (idx, text, imgbase64) separated by tabs. The first column is the text number, the second column is the text, and the third column is the base64 code of the corresponding picture.

2. Splice the input data with the lattice and entity location information: the output format is several columns separated by tab characters (idx, text, lex_ids, pos_s, pos_e, seq_len, [Optional] imgbase64)

Use Transformer to generate documents and graphs on Alibaba Cloud machine learning platform PAI

PAI-DSW (Data Science Workshop) is an on cloud IDE developed by Alibaba Cloud machine learning platform PAI. It provides an interactive programming environment (document) for developers at different levels. In DSW Gallery, various Notebook examples are provided to facilitate users to learn DSW easily and build various machine learning applications. We also launched the Sample Notebook (see the figure below) in the DSW Gallery, which uses the Transformer model to generate Chinese text maps. Welcome to experience!

Future outlook

In this phase of work, we extended the Transformer based Chinese text map generation function in the EasyNLP framework, and opened the checkpoint of the model to facilitate users in the open source community to make small field related fine-tuning and various artistic creations when resources are limited. In the future, we plan to launch more relevant models in the EasyNLP framework. Please wait and see. We will also integrate more SOTA models (especially Chinese models) in the EasyNLP framework to support various NLP and multimodal tasks. In addition, the Alibaba Cloud Machine Learning PAI team is also continuing to promote the self-development of Chinese multimodal models. Users are welcome to continue to pay attention to us and join our open source community to jointly build Chinese NLP and multimodal algorithm libraries!

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us