EasyNLP realizes Chinese and English machine reading comprehension

guide
Machine reading comprehension is one of the most important research directions in the field of natural language processing (NLP), especially in the field of natural language understanding (NLU). Since it was first proposed in 1977, machine reading comprehension has a history of nearly 50 years of development, and has gone through multiple development stages such as "artificial rules", "traditional machine learning", "deep learning", and "large-scale pre-training models".
Machine reading comprehension aims to help humans quickly focus on relevant information from a large amount of text, reduce the cost of manual information acquisition, and increase the effectiveness of information retrieval. As the "master" of artificial intelligence in the direction of natural language understanding, the machine reading comprehension task examines the understanding ability of each language granularity from "word" to "sentence" and even "chapter". At this stage, it is a very difficult and important "battlefield for military strategists". The reading comprehension data set represented by SQuAD has witnessed that in the era of great development of deep learning, the innovative solutions of major companies have appeared on the stage, and then in the era of large-scale pre-training, it has become the leader of pre-training models such as BERT. Evaluation Benchmark. It can be said that in the past ten years, machine reading comprehension has greatly promoted and witnessed the prosperity and development of the field of natural language processing.
Formally speaking, the input of the machine reading comprehension task is a piece of text (context) and a question (question), and the predicted answer text (answer) is output through the learning model. According to the different methods of obtaining answers, the current industry mainstream divides reading comprehension tasks into four categories: Cloze tests, multi-choice, Span extraction and free generation (Free answering). Among them, the fragment extraction method predicts the start/end positions (start/end positions) of the answer text (answer) directly from the text text (context) according to the question sentence (question), thereby extracting the answer. Because it is close to the real scene, moderately difficult, easy to evaluate, and supported by high-quality data sets such as SQuAD, it has become the current mainstream reading comprehension task. With the development of pre-trained language models, the effect of segment extraction reading comprehension has repeatedly hit new highs in recent years. In the English scene, the traditional BERT, RoBERTa, ALBERT and other models can achieve better results than humans; in the Chinese scene, models such as MacBERT (Pre-Training with Whole Word Masking for Chinese BERT) introduce an error correction type The masked language model (Mac) pre-training task alleviates the inconsistency of "pre-training-downstream tasks", and better adapts to the Chinese scene. It has achieved remarkable results in various NLP tasks including machine reading comprehension. performance improvement. Therefore, we integrated the MacBERT algorithm and model in the EasyNLP framework, and cooperated with the original BERT, RoBERTa and other models in EasyNLP, so that users can easily use these models for training and prediction of Chinese and English machine reading comprehension tasks.
EasyNLP (https://github.com/alibaba/EasyNLP) is an easy-to-use and rich Chinese NLP algorithm framework developed by the Alibaba Cloud machine learning PAI team based on PyTorch, which supports commonly used Chinese pre-trained models and large models Landing technology, and provides a one-stop NLP development experience from training to deployment. EasyNLP provides a concise interface for users to develop NLP models, including NLP application AppZoo and pre-trained ModelZoo, and provides technology to help users efficiently implement super-large pre-trained models to business. As a master of natural language understanding, machine reading comprehension is also a basic task in the fields of text question answering and information extraction, and has high research value. Therefore, EasyNLP has added support for Chinese and English machine reading comprehension tasks, hoping to serve more NLP/NLU algorithm developers and researchers, and also hopes to work with the community to promote the development and implementation of NLU-related technologies.
This article will provide a technical interpretation of the MacBERT model, and how to use MacBERT and other pre-trained language models in the EasyNLP framework to train and predict Chinese and English machine reading comprehension tasks.
Interpretation of MacBERT model
The mainstream large-scale pre-trained language models (such as BERT, RoBERTa, etc.) are mainly designed for the English language. When they are directly transferred to the Chinese scene, they will face differences in the Chinese and English languages themselves, such as: there are no spaces between Chinese words, and there is no need to split elements. Words, words with complete meaning formed by multiple words, etc. For example, the original sentence in the figure below "Use the language model to predict the probability of the next word". Dropping one of the words will affect the effect of mask and language model pre-training. In addition, the traditional language model will use [MASK] characters for masking during pre-training, but there is no [MASK] mark in the downstream task text, which naturally introduces a gap in the two stages. In order to alleviate the above problems, models such as MacBERT modify the traditional MLM task and introduce an error-correcting masked language model (Mac) pre-training task, including wwm (whole word masking), NM (n-gram masking), similar word replacement Etc mask scheme better adapts to the Chinese language scene, reduces the inconsistency of "pre-training-downstream tasks", and improves the effect of pre-training models on various NLP tasks. In addition, since the main framework of MacBERT is completely consistent with BERT, it can seamlessly transition without modifying the existing code, which brings great convenience to developers' code migration.
Specifically, when MacBERT and other models are in MLM, all the words in the whole Chinese word are masked at the same time, and the n-gram mask strategy is adopted. The mask probability corresponding to unigram to 4-gram is 40%-10%. The [MASK] token is no longer used in the mask, but synonyms of words are used instead. Synonyms are obtained using the Synonyms toolkit based on word2vec similarity calculations. In rare cases where there are no synonyms, random words are used to replace them. The model generally masks 15% of the input words. When masking, 80% of the probability is replaced with synonyms, 10% is replaced with random words, and 10% is retained as the original word. In addition, the original NSP model of BERT has been criticized by researchers for a long time. In the MacBERT model, NSP is modified to SOP (Sentence Order Prediction). The positive example is continuous text, and the negative example is to exchange the order of the original text, which improves the model very well. Effects on multi-sentence textual tasks. The experimental results of the model show that removing any of the above MLM improvements will lead to a decrease in average performance, which indicates that several mask modifications are helpful for language model learning; at the same time, after removing the SOP task, the effect on the machine reading comprehension task will be obvious This also indicates the necessity of sentence-level pre-training tasks in discourse learning.
Machine reading comprehension model tutorial
Below we briefly introduce how to use MacBERT and other pre-trained language models in the EasyNLP framework to train and predict machine reading comprehension tasks.
Install EasyNLP
Users can directly refer to the instructions on GitHub (https://github.com/alibaba/EasyNLP) to install the EasyNLP algorithm framework.
Pipeline interface quick "early adopters" to experience the effect
For the convenience of developers, we have implemented the Inference Pipeline function within the EasyNLP framework. Users can directly "early adopt" without training or fine-tuning the model themselves, and use the pipeline interface to call Finetune's good Chinese and English machine reading comprehension models in one step. Just execute the following command:
from easylp.pipelines import pipeline
# Input data
data = [{
"query": "When did Hangzhou release the "Hangzhou Asian Games City Action Plan Outline"?",
"answer_text": "April 2020",
"context": "Xinhua News Agency, Hangzhou, September 22nd (Reporter Shang Yiying Xia Liang) The competition venues have all been completed and the function acceptance of the competition has been completed. The urban infrastructure construction of "Welcome to the Asian Games" has entered the fast lane. Fitness craze... On the 23rd, the postponed Hangzhou Asian Games will usher in the first anniversary of the countdown. Various preparations have also yielded fruitful results, and the "City of Paradise" is ready to go again. A successful meeting will improve a city. In April 2020, Hangzhou released the "Hangzhou Asian Games City Action Plan Outline", which includes eight specific actions, including infrastructure improvement, protection of green water and green mountains, and digital governance empowerment. As the Asian Games draws closer, Hangzhou West Railway Station , Xiaoshan International Airport Phase III, Hehang High-speed Railway Huhang Section, Airport Rail Express (Line 19) and other major projects of "two points and two lines" were officially put into operation. According to the information released by Hangzhou Urban and Rural Construction Committee, it is expected that by September this year By the end of the month, the total mileage of expressways in the city will reach 480 kilometers. People living here are experiencing the changes that have taken place quietly—the traffic is more convenient, the roads have become more beautiful, and the urban infrastructure has become more and more perfect.",
"qas_id": "CN_01"
},
{
"query": "How much will the total mileage of expressways in the city be at the end of September this year?",
"answer_text": "480 kilometers",
"context": "Xinhua News Agency, Hangzhou, September 22nd (Reporter Shang Yiying Xia Liang) The competition venues have all been completed and the function acceptance of the competition has been completed. The urban infrastructure construction of "Welcome to the Asian Games" has entered the fast lane. Fitness craze... On the 23rd, the postponed Hangzhou Asian Games will usher in the first anniversary of the countdown. Various preparations have also yielded fruitful results, and the "City of Paradise" is ready to go again. A successful meeting will improve a city. In April 2020, Hangzhou released the "Hangzhou Asian Games City Action Plan Outline", which includes eight specific actions, including infrastructure improvement, protection of green water and green mountains, and digital governance empowerment. As the Asian Games draws closer, Hangzhou West Railway Station , Xiaoshan International Airport Phase III, Hehang High-speed Railway Huhang Section, Airport Rail Express (Line 19) and other major projects of "two points and two lines" were officially put into operation. According to the information released by Hangzhou Urban and Rural Construction Committee, it is expected that by September this year By the end of the month, the total mileage of expressways in the city will reach 480 kilometers. People living here are experiencing the changes that have taken place quietly—the traffic is more convenient, the roads have become more beautiful, and the urban infrastructure has become more and more perfect.",
"qas_id": "CN_02"
}]
# The parameter of the pipeline is a good model of finetune
# Currently, EasyNLP supports fast experience of Chinese and English machine reading comprehension pipelines, which integrate finetune's good Chinese macbert model and English bert model respectively
# If you want to experience English reading comprehension, just change the model name to 'bert-base-rcen'
generator = pipeline('macbert-base-rczh')
results = generator(data)
for input_dict, result in zip(data, results):
context = result["context"]
query = result["query"]
answer_gold = result["gold_answer"]
answer_pred = result["best_answer"]
As shown in the code, the input data is a list, and each instance is a dict, including its query, answer, context, and id information. The parameters of the pipeline are finetune models. Currently, EasyNLP supports Chinese and English machine reading comprehension pipelines for fast experience. It integrates finetune's Chinese macbert model and English bert model respectively. The Chinese machine reading comprehension model is 'macbert-base-rczh' , if you want to experience English reading comprehension, just change the model name in the pipeline parameter of the above code to 'bert-base-rcen'.
The execution result of the above code is shown below. It can be seen that the machine accurately understands the meaning of the text and the question, and gives the correct result.
context: Xinhua News Agency, Hangzhou, September 22 (Reporter Shang Yiying Xia Liang) The competition venues have all been completed and the function acceptance of the competition has been completed. The urban infrastructure construction of "Welcome to the Asian Games" has entered the fast lane, and the early opening of the Asian Games venues has set off a national fitness boom... … On the 23rd, the postponed Hangzhou Asian Games will usher in the first anniversary of the countdown. Various preparations have also yielded fruitful results, and the "City of Paradise" is ready to go again. Hold a meeting well and upgrade a city. In April 2020, Hangzhou released the "Hangzhou Asian Games City Action Plan Outline", which includes eight specific actions such as infrastructure improvement, green water and green mountains protection, and digital governance empowerment. With the Asian Games approaching, Hangzhou West Railway Station, the third phase of Xiaoshan International Airport, the Hu-Hang section of the He-Hang high-speed railway, and the Airport Railroad Express (Line 19) and other major projects of "two points and two lines" have been officially put into operation. According to the information released by Hangzhou Urban and Rural Construction Committee, it is estimated that the total mileage of expressways in the city will reach 480 kilometers by the end of September this year. live The people here are experiencing the changes that have happened quietly - the traffic is more convenient, the roads have become more beautiful, and the urban infrastructure has become more and more perfect.
query: When did Hangzhou release the "Hangzhou Asian Games City Action Plan Outline"?
gold_answer: April 2020
pred_answer: April 2020
context: Xinhua News Agency, Hangzhou, September 22 (Reporter Shang Yiying Xia Liang) The competition venues have all been completed and the function acceptance of the competition has been completed. The urban infrastructure construction of "Welcome to the Asian Games" has entered the fast lane, and the early opening of the Asian Games venues has set off a national fitness boom... … On the 23rd, the postponed Hangzhou Asian Games will usher in the first anniversary of the countdown. Various preparations have also yielded fruitful results, and the "City of Paradise" is ready to go again. Hold a meeting well and upgrade a city. In April 2020, Hangzhou released the "Hangzhou Asian Games City Action Plan Outline", which includes eight specific actions such as infrastructure improvement, green water and green mountains protection, and digital governance empowerment. With the Asian Games approaching, Hangzhou West Railway Station, the third phase of Xiaoshan International Airport, the Hu-Hang section of the He-Hang high-speed railway, and the Airport Railroad Express (Line 19) and other major projects of "two points and two lines" have been officially put into operation. According to the information released by Hangzhou Urban and Rural Construction Committee, it is estimated that the total mileage of expressways in the city will reach 480 kilometers by the end of September this year. The people living here are experiencing the changes that have happened quietly—the traffic has become more convenient, the roads have become more beautiful, and the urban infrastructure has become more and more perfect.
query: How much will the total mileage of expressways in the city be at the end of September this year?
gold_answer: 480 kilometers
pred_answer: 480 kilometers
Below, we will introduce in detail the detailed implementation process of the Chinese and English machine reading comprehension model, from data preparation to model training, evaluation, and prediction, and give several convenient and fast one-step execution methods based on EasyNLP.
data preparation
When using a pre-trained language model to perform finetune for machine reading comprehension tasks, users are required to provide task-related training and verification data, both in tsv format. Each line in the file contains multiple columns separated by a tab character , including all the information required for reading comprehension training, from left to right: sample ID, text text (context), question sentence (question), answer The text (answer), the starting position of the answer text in the chapter text, and the chapter title. A sample is as follows:
DEV_125_QUERY_3 Glyoxal is an organic compound with the chemical formula OCCHHO, which is connected by two aldehyde groups -C. It is the simplest dialdehyde and is a yellow liquid at room temperature. Industrially, glyoxal can be produced by gas-phase oxidation of ethylene glycol under the catalysis of silver or copper, or oxidation of acetaldehyde with nitric acid solution. In the laboratory, glyoxal is prepared by oxidation of acetaldehyde with selenous acid. Anhydrous glyoxal can be prepared by co-heating solid hydrate and phosphorus pentoxide. The applications of glyoxal are: ‡Usually glyoxal is sold as a 40% solution. Similar to other small molecule aldehydes, it can form hydrates, and the hydrates condense to form a series of "oligomers", the structure of which is still unclear. At least the following two hydrates are currently sold: According to estimates, when the concentration of glyoxal in aqueous solution is lower than 1M, it mainly exists in the form of monomer or hydrate, namely OCHCHO, OCHCH(OH) or (HO)CHCH(OH) ). When the concentration is greater than 1M, it is mainly a dimer type, which may be an acetal/ketone structure, and the molecular formula is [(HO)CH]OCHCHO. How is glyoxal produced industrially? It can be obtained by the gas-phase oxidation of ethylene glycol under the catalysis of silver or copper, or the oxidation of acetaldehyde with nitric acid solution. 59 Glyoxal
The following files are preprocessed Chinese and English machine reading comprehension training and verification data, which can be used for model training and testing:
# Chinese machine reading comprehension data
http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/machine_reading_comprehension/train_cmrc2018.tsv
http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/machine_reading_comprehension/dev_cmrc2018.tsv
# English machine reading comprehension data
http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/machine_reading_comprehension/train_squad.tsv
http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/machine_reading_comprehension/dev_squad.tsv

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us