EasyNLP releases a Chinese pre training model CKBERT that integrates linguistic and factual knowledge
The pre training language model is widely used in various applications of NLP; However, classical pre training language models (such as BERT) lack understanding of knowledge, such as relational triples in the knowledge map. The pre training model of knowledge enhancement uses external knowledge (knowledge map, dictionary, text, etc.) or linguistic knowledge within a sentence to enhance. We found that the process of knowledge injection is accompanied by large-scale knowledge parameters, and the downstream task fine tune still needs the support of external data to achieve better results, so it cannot be well provided to users for use in the cloud environment. CKBERT (Chinese Knowledge enhanced BERT) is a Chinese pre training model developed by the EasyNLP team. It combines two types of knowledge (external knowledge atlas, internal linguistic knowledge) to inject knowledge into the model. At the same time, it makes the way of knowledge injection convenient for model expansion. Our experimental verification also shows that the precision of CKBERT model exceeds that of many classical Chinese models. In this framework upgrade, we will contribute CKBERT models of various scales to the open source community, and these CKBERT models are fully compatible with HuggingFace Models. In addition, users can also easily use cloud resources to use the CKBERT model on Alibaba Cloud machine learning platform PAI.
EasyNLP（ https://github.com/alibaba/EasyNLP ）It is an easy and rich Chinese NLP algorithm framework developed by the Alibaba Cloud Machine Learning PAI team based on PyTorch, a regular Chinese pre training model and a Chinese model landing technology, and provides a Chinese station type NLP development experience from training to deployment. EasyNLP provides a simple interface for users to develop NLP models, including NLP AppZoo and pre training ModelZoo, and provides technical assistance for users to effectively implement the super pre training model to the business. Due to the increasing demand for cross modal understanding, EasyNLP also promotes various cross modal models, especially those in the Chinese field, to the open source community, hoping to serve more NLP and multi-modal algorithm developers and researchers, and also hopes to work with the community to promote the development of NLP/multi-modal technology and the implementation of models.
This article briefly introduces the technical interpretation of CKBERT, and how to use the CKBERT model on the EasyNLP framework, HuggingFace Models and Alibaba Cloud machine learning platform PAI.
Overview of Chinese pre training language model
In this section, we first briefly review the classic Chinese pre training language model. At present, the Chinese pre training language model mainly includes two types:
• Pre training language models in general fields, mainly including BERT, MacBERT and PERT models;
• Chinese pre training model for knowledge enhancement, mainly including ERNIE baidu, Lattice BERT, K-BERT and ERNIE THU models.
A pre training language model for general domain
BERT directly uses Google's training model based on Chinese wiki text corpus. MacBERT is an improved version of BERT. It introduces the MLM as correction (Mac) pre training task, which alleviates the inconsistency between "pre training and downstream tasks". In the Mask Language Model (MLM), the [MASK] tag is introduced to MASK, but the [MASK] tag does not appear in downstream tasks. In MacBERT, similar words are used to replace the [MASK] tag. Similar words are obtained through Synonyms toolkit, and the algorithm is based on word2vec similarity calculation. At the same time, MacBERT also introduced Whole Word Masking and N-gram Masking technologies. When N-gram is masked, it will search for similar words for each word in N-gram; When there are no similar words to replace, random words will be used for replacement. Because the disordered text to a certain extent does not affect the semantic understanding, PBERT learns idiom semantic knowledge from the disordered text. It changes the word order of the original input text to form a disordered text (therefore, it does not introduce additional [MASK] tags). Its learning goal is to predict the location of the original Token.
Chinese Pre training Model for Knowledge Enhancement
The data used by BERT in the pre training process only shields single characters. For example, as shown in the figure below, when training BERT, the word "Er" is judged by the local co occurrence of "ha" and "bin". However, the model does not learn the knowledge related to "Harbin", that is, it only learns the word "Harbin", but does not know the meaning of "Harbin". The data used by ERNIE Baidu in the pre training is to shield the whole word, so as to learn the expression of words and entities, such as shielding words like "Harbin" and "Ice and Snow", so that the model can model the relationship between "Harbin" and "Heilongjiang", and learn the meaning that "Harbin" is the capital of "Heilongjiang" and "Harbin" is an ice and snow city.
Like ERNIE Baidu, Lattice BERT uses the Word Lattice structure to integrate word level information. Specifically, Lattice BERT designed a Lattice location attention mechanism to express word level information, and proposed the prediction task of Masked Segment Prediction to promote model learning from rich but redundant internal Lattice information.
In addition to linguistic knowledge, more work has been done to enrich the representation of the Chinese pre training model by using the factual knowledge in the knowledge map. Among them, K-BERT proposes a knowledge enhancement language model oriented to knowledge graph, which injects triples into sentences as domain knowledge. However, too much knowledge integration will lead to knowledge noise and make sentences deviate from their correct meanings. To overcome knowledge noise, K-BERT introduces Soft position and Visibel Matrix to limit the influence of knowledge. Because K-BERT can load model parameters from the pre trained BERT, it is easy to inject domain knowledge into the model without pre training the model by equipping KG. EasyNLP framework also integrates K-BERT model and functions (see here).
ERNIE-THU is a pre training model integrated with knowledge embedding. It first uses TAGME to extract the entities in the text, points these entity chains to the corresponding entity objects in KG, and then obtains the corresponding Embedding of these entity objects. The embedding of entity objects is trained by knowledge representation methods (such as TransE). In addition, ERNIE-THU improved the BERT model. In addition to MLM and NSP tasks, it added a new pre training goal related to KG: Mask the alignment relationship between Token and Entity, and required the model to select an appropriate Entity from the entities in the map to complete the alignment.
Technical details of self-developed CKBERT model
Because most of the current pre training models for knowledge enhancement use external knowledge (knowledge maps, dictionaries, texts, etc.) or linguistic knowledge within sentences for enhancement, and the process of knowledge injection is accompanied by large-scale knowledge parameters, the downstream task fine tune still needs the support of external data to achieve better results, which cannot be provided to users in a cloud environment. CKBERT (Chinese Knowledge enhanced BERT) is a Chinese pre training model developed by the EasyNLP team. It combines two types of knowledge (external knowledge atlas, internal linguistic knowledge) to inject knowledge into the model. At the same time, it makes the way of knowledge injection convenient for model expansion. For the actual business needs, we provide three models with different scale parameters.
In order to facilitate the expansion of the model parameters, the model has only been changed at the data input level and the pre training task level, and the model architecture has not been changed. Therefore, the model structure of CKBERT is aligned with the community version of BERT model. In the data input layer, there are two parts of knowledge to deal with, namely, the external map triplet and the internal linguistic knowledge at the sentence level. For linguistic knowledge, we used the LTP platform of process sentence data, tag semantic roles and analyze dependency syntax, and then tag important components in the recognition results according to rules. For the external triplet knowledge, the positive and negative triplet samples of entities are constructed according to the entities appearing in sentences. The positive samples are sampled according to the 1-hop entities in the atlas, and the negative samples are sampled according to the multi hop entities in the atlas. However, the sampling process of negative samples can only be within the specified multi hop range, and cannot be too far away in the atlas.
CKBERT uses two kinds of pre training tasks for model pre training, linguistic awareness mask language model and multi hop knowledge contrast learning:
• Linguistic aware MLM: the subject roles (agent AGT and party EXP) in semantic dependency are masked with [MASK], and [SDP] [/SDP] is added before and after the word, with the boundary information of the vocabulary. In the dependency syntax relationship, the subject predicate ice relationship, the centering relationship, the juxtaposition relationship, etc. are treated as [DEP] [/DEP] according to the above mask mechanism. The number of tokens for overall pre training is 15% of the whole sentence, of which 40% are randomly MASK, and 30% and 30% are allocated to semantic dependency and dependency syntax.
• Multi hop knowledge comparison learning: process the positive and negative sample data constructed above against the injected entity, construct one positive sample and four negative samples for each entity in the sentence, and learn external knowledge through the standard infoNCE loss task. The loss function is as follows:
Wherein, it is the context entity representation generated by the pre training model, representing the result of positive sample triple representation, and the result of negative sample triple representation.
Implementation of CKBERT model
In the EasyNLP framework, our model implementation is divided into three parts: data preprocessing, model structure fine-tuning and loss function design. First, in the data preprocessing phase, it mainly consists of the following two steps: 1. The extraction of NER entities and semantic relationships; 2. Information injection of knowledge map. For the extraction of NER entity and semantic information, LTP (Language Technology Platform) is mainly used to segment and parse the original sentences.
CKBERT Accelerated Pre Training
Because the pre training of CKBERT requires a lot of time and computing resources, we must accelerate the pre training of CKBERT. Because CKBERT is implemented by PyTorch framework, compared with Tensorflow 1. x Graph Execution, PyTorch runs by Eager Execution, which is easy to use, develop and debug. However, Python lacks the Graph IR (Intermediate Representation) expression of the model, so it cannot be further optimized. By LazyTensor and Python/XLA（ https://github.com/pytorch/xla ）Inspired by, the PAI team developed the TorchAccelerator in the PyTorch framework, aiming to solve the training optimization problem on PyTorch, and improve the user training speed on the basis of ensuring user usability and adjustable trial.
Because LazyTensor still has many defects in the process of converting Eager Execution to Graph Execution. By encapsulating Custom Operation into XLA CustomCall and analyzing Python code through AST, TorchAccelerator has improved the completeness and conversion performance of Eager Execution to Graph Execution, and improved the compilation optimization effect through multi stream optimization, Tensor asynchronous transmission and other means.
Use CKBERT model on Alibaba Cloud machine learning platform PAI
PAI-DSW (Data Science Workshop) is an on cloud IDE developed by Alibaba Cloud machine learning platform PAI. It provides an interactive programming environment (document) for developers at different levels. In DSW Gallery, various Notebook examples are provided to facilitate users to learn DSW easily and build various machine learning applications. We have also launched the Sample Notebook (see the figure below) in DSW Gallery, which uses CKBERT for Chinese named entity recognition. Welcome to experience!
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00