What changes have been experienced in language models

Language Model Overview

A language model is essentially answering a question: Are the sentences that appear reasonable.

In the historical development, the language model has gone through the expert grammar rule model (to the 1980s), the statistical language model (to the 2000s), and the neural network language model (to the present).

The expert grammatical rule model is the grammatical rule for natural language that is induced in the initial stage of computer and with the development of computer programming language. However, the diversity and colloquialism of natural language itself, the evolution in time and space, and the strong error correction ability of humans have led to a rapid expansion of grammatical rules, which is unsustainable.

The statistical language model uses a simple method and adds a large amount of corpus to produce better results. The statistical language model models the probability distribution of sentences. Statistically speaking, sentences with high probability are more reasonable than sentences with low probability. In the implementation, the next word of the sentence is predicted by the given above. If the predicted word is consistent with the next word (the probability of the word appearing under the premise of the above is higher than that of other words), then the above The probability of occurrence of the text+this word will be larger than the probability of the above text+other words, and the above text+this word is more reasonable.

Based on the statistical language model, the neural network language model can represent not only the morphology, but also the similarity, syntax, semantics, and pragmatics in many aspects through the superposition of the network and the layer-by-layer extraction of features.

Language models can be used for a variety of NLP tasks, such as prediction, error correction, reasoning, etc. In this article, we focus on the development of language models in statistics and (deep learning) neural networks.

statistical language model

Statistical language models predict the probability distribution of sentences from a statistical point of view, which usually requires a large amount of data. For a sentence, the sequence probability is calculated as, and the probability of the entire sentence can be obtained according to the chain rule:

Among them, the probability of occurrence of each word is obtained through statistical calculation:

This approach has the following problems:


n-gram is the most common statistical language model. Its basic idea is to perform a sliding window operation of size N on the content of the text to form a phrase subsequence of length N, and to count the frequency of occurrence of all phrase subsequences. Intuitively, n-gram is to shorten the sentence length to only consider the first n-1 words. It is worth noting that the actual meaning here is the probability of the first word of the sentence. In practical applications, the start character should be added. For example, for bigram, it should be. Similarly, the end character should be added at the end of the sentence. The significance is to model partial sequences of arbitrary length in sentences. Here is a small number of examples to illustrate: Suppose we have a corpus as follows:

Mice are disgusting, mice are ugly, you love your wife, I hate mice.

Want to predict the next word of the sentence "I love old". We make predictions via bigram and trigram respectively.

Through bigram, it is necessary to calculate P(w | old). According to the statistics, "mouse" appeared 3 times, and "wife" appeared 1 time. Through the maximum likelihood estimation, P(rat | old) = 0.75, P(wife|old)=0.25, so the whole sentence we predict through bigram is: I love mice.

Through the trigram, it is necessary to calculate P(w | love the old). According to the statistics, only "love my wife" appeared once, and P(w | love the old) can be obtained through the maximum likelihood estimation = 1, so the whole sentence we predict through trigram is: I love my wife. Obviously, the results predicted by this method are more reasonable.
The above example proves that as n increases, we have more pre-information and can predict the next word more accurately. But this also brings a problem, the data becomes more sparse as n increases, resulting in many predicted probability results being 0. When encountering the zero probability problem, we can alleviate the sparsity problem of n-grams by smoothing.

The fundamental reason why n-gram needs smoothing is the sparsity of data, and the sparsity of data is determined by the nature of natural language. When such sparsity exists, smoothing can always be used to alleviate the problem and improve performance. Intuitively, if you think you have enough data that sparsity is not an issue, we can always improve the performance by using a more complex model with more parameters, such as increasing the size of n. When n increases, the model parameter space increases exponentially, and sparsity becomes a problem again. At this time, better performance can be obtained through reasonable smoothing methods.

Smoothing in n-grams ( smoothing )

In a limited data set, the probability obtained by statistics of high-frequency events is more reliable, while the probability obtained by events with smaller frequency is less reliable. This phenomenon is more prominent in high-dimensional space. If the probability distribution of the training set is directly used as the distribution of the test set, then the probability of "unregistered words" in the training set will be 0, which is obviously inconsistent with our cognition, because there may be unregistered words in the test set. For login words, if you haven't seen it, it doesn't mean it doesn't exist. The problem solved by smoothing is: according to the frequency distribution of the training set data, estimate the probability distribution of "unregistered words" in the test set, so as to obtain a more ideal probability distribution estimate on the test set. Next, we introduce the commonly used smoothing methods.

Laplace Smoothing

, as shown in the formula, Laplace smoothing adds 1 to the frequency of all events. Laplace smoothing is the most intuitive and easy-to-understand smoothing method. Adding one to the frequency of events has basically negligible impact on the probability of high-frequency events. It also solves the problem of calculating the probability of unregistered words.

However, Laplace smoothing directly adding 1 is also a relatively "bad" smoothing method. Simple and crude addition of 1 sometimes has a great impact on the data distribution, but sometimes it is negligible. Laplace smoothing treats the probability prediction of unknown events equally, which is obviously unreasonable! Additive smoothing improves on this point, and improves the versatility and effect of Laplacian smoothing through hyperparameters.

Additive Smoothing

In additive smoothing, we make the following assumptions: we assume that each n-gram event has occurred δ times; in general, 0<δ≤, when δ is equivalent to laplace smoothing, <1, it means that the probability distribution weight of unregistered words decreases. this method improves versatility but still criticized by many people due its actual effect.

Good Turing Smoothing

Note: Generally speaking, the number of words that appear once is more than the number of words that appear twice, and the number of words that appear twice is more than three times. This law is called Zipf’s Law (Zipf’s Law)

n-gram summary

There are many advantages of n-gram. First of all, it is an intuitive natural language understanding and processing method, which optimizes the parameter space and has strong interpretability. It contains all the information of the first n-1 words without loss and forgetting. In addition, it also has the advantage of simple calculation logic. But at the same time, n-gram also has an essential defect: n-gram cannot establish long-term dependence, and it will still be seriously affected by the sparsity of data when n is too large. In actual use, only bigram or trigram is often used; n- gram is based on frequency statistics and does not have enough generalization ability. Therefore, the neural network language model has gradually replaced the traditional statistical natural language model and become the mainstream. Next, we will introduce the neural network language model.

Neural Network Language Models (2003)

As shown above, the sentence word is input in the form of onehot, and multiplied with the shared matrix C to obtain the word vector at the index position

through the nonlinear layer

Connect the fully connected softmax to get the probability distribution of the next word.

Compared with ngram, NNLM does not need to calculate and save all the probability values in advance, and it is calculated by the function; the word word vector is added, which can express the similarity of words (that is, semantic and grammatical features); use the neural network to solve the optimal parameters and the use of softmax , compared with ngram, it can predict the joint probability of sequence words more smoothly, and the prediction effect on sentences containing unregistered words is very good; but its calculation amount is still too large, and the main calculation amount is concentrated in the nonlinear layer h and the output layer z. Matrix W, U operation and softmax calculation.

RNN language model (2010)

The problem with the n-gram language model is that the ability to capture long-term dependencies in sentences is very limited. NNLM requires a fixed length of input data (generally 5-10). Intuitively, the n-gram model that uses neural network coding cannot solve the problem. long-term dependency problems. The language model task is a sequence prediction problem. RNN is a model naturally used to solve sequence problems. In 2010, Mikolov proposed RNNLM. The historical information of RNNLM is all the words in front of the sentence, so that it can capture longer historical information. RNNLM has an enlightening effect on the proposal of elmo later.

As shown in the figure, the output of RNNLM is the word prediction probability distribution output(t) of the next position; the input consists of two parts, the word vector input(t) of the current position word and the hidden state context(t-1) at time t-1 Addition, the hidden layer activation function is sigmoid.

Intuitively speaking, the RNN network breaks the limitation of the context window, and uses the state of the hidden layer to summarize all the historical context information. Compared with NNLM, it can capture longer dependencies and achieve better results in experiments. RNNLM has fewer hyperparameters and is more versatile; however, due to the gradient dispersion problem of RNN, it is difficult to capture longer distance dependencies.

CBOW&Skip-gram (2013)

At present, the most popular language model tool for word vectors is word2vec proposed by Mikolov. The network structure of word2vec is similar to that of NNLM, the main difference is in the training method. CBOW as shown below:

These two methods are very different from NNLM. The main task of NNLM is to predict the probability of sentences. The word vector only uses the above information, which is an intermediate product; word2vec is born to obtain the distributed representation of words. According to the sentence context Learn the semantic and grammatical information of words.

In addition to the network structure combining context information, another major contribution of word2vec is in computing optimization. Word2vec removes the non-linear hidden layer, and CBOW adds the input directly through softmanx to predict, reducing the large number of calculations in the hidden layer, and at the same time, the calculation of softmax is also done through hierarchical softmax (hierarchical Softmax) and negative sampling (Negative sampling). Great optimization.

Hierarchical Softmax

Different from the softmax output of the traditional neural network, the hierarchical softmax structure of word2vec changes the output layer into a Huffman tree (the huffman tree is pre-constructed during training, each edge is a randomly initialized vector, and the hidden layer output calculates sigmoid probability), where the white leaf nodes in the figure represent all |V| words in the vocabulary, and the black nodes represent non-leaf nodes, and each leaf node corresponds to a unique path starting from the root node. Our purpose is to maximize the probability of this path, that is: the maximum, assuming that the conditional probability of the final output is the maximum, then I only need to update the vector of the node on the path from the root node to this leaf node. There is no need to update the occurrence probabilities of all words, which greatly reduces the time for model training update.

When we calculate the output probability, we need to calculate the product of the probabilities from the root node to the leaf nodes.

Negative Sampling

Do word vectors need to be fine-tuned?

Intuitively speaking, word2vec has become the standard configuration of NLP deep learning, and the quality of the pre-trained word vector directly affects the effect of the model. Due to its model network structure, word2vec is very convenient for pre-training on massive data sets. The semi-supervised method of massive corpus pre-training + task data finetune has begun to rise, and it shows obvious advantages when the task labeling data is small. In the TextCNN paper, four forms of word vectors were compared:

CNN-rand: All word vectors are randomly initialized and updated during training;
CNN-static: The word vector is pre-trained with word2vec, which remains unchanged during training in a feature-based manner, and only other parameters are learned;
CNN-non-static : The pre-trained word vector will be fine-tuned during the training process;
CNN-multichannel: Two channels, one feature-based and one fine-tuned. The word2vec result is assigned directly during initialization, and each filter will also use two channels, but only one fine-tuned channel will perform BP during training.

From the experimental results, it can be seen that the non-static method is slightly better than the static method most of the time, and the static method is better than the rand method (rand refers to the random encoding of words in the word embedding process). Compared with the pure static and non-static method, the multichannel method combining static and non-static performs better than them on smaller data sets (because this mixed method reflects a compromise idea, that is, I don’t want the fine-tuned word vector to be too different from the original word vector, and I also want to maintain a certain dynamic change space).

GloVe (2014)
Word2Vec gems are ahead, GloVe is more complex, and users are obviously much less, but at the same time, experiments demonstrate that GloVe works better in some scenarios, and the idea of GloVe is still very worth learning.

GloVe was proposed in 2014. At that time, there were two mainstream generation methods of word vectors, one based on matrix factorization method and the other based on shallow sliding window. Glove combined the advantages of both into one. GloVe belongs to the statistical language model. Its main optimization point is to introduce the co-occurrence matrix, and use the similarity between word vectors to approximate the global co-occurrence number. The essence is to reduce the dimension of the co-occurrence matrix.

In practice, GloVe and Word2Vec have their own advantages, and which word vector to use for specific tasks still needs to be concluded through experiments. At the same time, GloVe supports better parallelism and tends to take less time to train, but consumes more memory.

ELMo (2018)

ELMo pre-trains to get the Embedding layer and two-way LSTM layer corresponding to the word, and these vectors will be used in downstream tasks.

When doing downstream tasks, the Embedding layer and multi-layer bidirectional LSTM layer of the word are obtained through the pre-trained network, and the normalized weighting is converted into a vector input to the downstream task. Compared with the static word vectors of word2vec, ELMo obtains word vectors at the embedding layer, and adds context information through a two-layer LSTM network to generate dynamic word vectors.

ELMo not only learns the word vector of the word, but also learns a two-layer bidirectional LSTM network, and experiments show that the lower-level LSTM captures the syntactic information of the vocabulary, and the higher-level LSTM vector captures the semantic information of the vocabulary . Through the dynamic adjustment of the bottom word Embedding by the LSTM network, the function of polysemous words is realized.

GPT (Generative Pre-Training) (2018)

The GPT proposed in 2018, the full name is generative pre-training, it is a one-way language model based on multi-layer transformer. GPT generally includes two stages: first, it uses the characteristics of its language model to pre-train on a massive corpus; after completing the pre-training, it solves downstream tasks through the fine-tune model. As shown in the figure below, GPT can be used for rich task types.

At first glance, GPT is very similar to ELMo, the main differences are:

Use transformer instead of LSTM to extract features. Transformer is currently the strongest feature extractor in the NLP field, which can more fully extract semantic features.

Stick to unidirectional language models. A notable feature of ELMo is the use of contextual information to represent word vectors, while GPT uses a one-way language model that only uses the above information to predict the following. This choice is more in line with the way humans read, but it also has certain limitations. For example, in reading comprehension, we usually need to combine the context to make decisions. Only considering the above will cause the pre-training to lose a lot of information. (Note: This will be optimized in BERT.)

The way GPT is used after pre-training is also different from ELMo. ELMo is a feature-based pre-training method, while GPT needs to perform finetune after pre-training (similar to the way of transfer learning in images).

GPT has achieved SOTA in 9 out of 12 NLP tasks, and the effect is amazing. GPT is an important part of the language model, which has had a lot of influence on the subsequent development of language models.

BERT (2018)
ELMo uses two-way splicing and fusion of word vectors to represent the context, and GPT uses a one-way language model to obtain vector expressions. The improvement of BERT to GPT proposed in 2018 is a bit borrowed from the improvement of NNLM to CBOW, changing the task from predicting the next word in the sentence to Cut out a word from the sentence and use the context to predict the word, while adding the task of predicting whether it is the next sentence. The model structure follows the idea of GPT, using transformer's self-attentnion and FFN feed-forward network superposition, and making full use of the powerful feature extraction ability of transformer.

Compared with GPT, which needs to add start symbols, terminators, and separators during fine-tuning, BERT adds them during training to ensure that there is no difference between the input of pre-training and fine-tuning, and improve the availability of pre-training.

BERT is a master of NLP progress (feature extractor, language model) in recent years, making full use of a large amount of unsupervised data, and implicitly introducing linguistic knowledge into specific tasks.

GPT-2 (2019)
The effect of GPT-2 text generation proposed in 2019 is amazing. Here is a brief discussion of GPT-2, focusing on its optimization points.

Filter data quality and use higher quality data;

The selection of data is wider, including multiple fields;

Use larger data volumes;

Enlarging the model and adding parameters (1.5 billion parameters), which is five times the number of parameters and twice the depth of BERT large (300 million parameters), reflects the powerful expressive ability of deep neural networks;

Fine-tune the transformer network structure as follows: adjust the position of layer-norm, adjust part of the initialization according to the depth of the network, and adjust some hyperparameters;

For example, the name of the paper Language Models are Unsupervised Multitask Learners (unsupervised multi-task learner, this title explains the essence of the language model very well), GPT-2 puts more emphasis on the natural unsupervised and multi-task features of the language model, These are also the two most notable trends in the current NLP field.

Intuitively speaking, GPT-2 uses more, better and more comprehensive data and increases the complexity of the model. In addition, GPT-2 actually contains a lot of deep meaning, which is worthy of in-depth research.

Interestingly, GPT-2 still insists on using a one-way language model. The conclusion obtained in the bert paper is that "the two-way language model has achieved the greatest improvement, and secondly, predicting the next sentence has a greater impact on some tasks." If so, why Does GPT-2 still insist on using a one-way language model? This question seems to be GPT-2's stubborn opinion, but it may involve the fundamental idea of a language model, that is, what is the essence of a language model. Should the language model be one-way? Will bert's mask operation cause certain limitations? Is this the reason why the GPT-2 Chinese text generation task is so amazing. GPT-2 may insist on using a one-way language model because it wants to solve these problems. Here I hope everyone can continue to discuss in depth, and look forward to GPT-3 will get a fuller explanation.

At the same time, due to the large amount of parameters of GPT-2, the effect of most tasks is not as good as bert due to the same amount of parameters, but it is still worthy of our in-depth exploration.


We compared the effects of the two models LSTM+CRF and BERT+CRF on the open source dataset MSRA, and BERT has obvious advantages.

In the slot extraction task of our dialogue system, using the current limited labeled data, the BERT+CRF model achieved an accuracy of 86.1%, a recall of 88.9%, and an F1 value of 87.5%;

In the intention classification task, the BERT+FST model was used to achieve an accuracy of 69.71%, a recall of 73.8%, and an F1 value of 71.7%.

Currently, we are validating a multi-task joint model using BERT+CRF model for classification and slot extraction

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us