An overview of the development of natural language generation

People's interest in artificial intelligence has grown stronger with the birth and development of sci-fi movies. Whenever we hear the word "artificial intelligence", we think of movies like "Terminator," "The Matrix," "I, Robot."

The ability for robots to think independently is still a long way off, but the fields of machine learning and natural language understanding have made significant strides over the past few years. Applications such as personal assistants (Siri/Alexa), chatbots, and question-and-answer bots are quietly changing the way people live.

The need to understand and derive meaning from a large number of ambiguous and structurally variable languages ​​has driven Natural Language Understanding (NLU) and Natural Language Generation (NLG) to become the fastest growing applications in artificial intelligence. Gartner predicts that "by 2019, natural language generation will be a standard feature of 90% of modern BI and analytics platforms". This article will review the history of NLG and look to its future.

What is NLG?
NLG conveys information by predicting the next word in a sentence. Using a language model can predict the next possible word, that is, find the probability distribution of words in the sequence. For example, to predict the next word of "I need to learn how to __", the language model will calculate the probability that the next word, such as "write", "drive" may appear. Advanced neural networks such as RNNs and LSTMs can process longer sentences and improve the accuracy of language model predictions.

Markov Chains
Markov chains are the earliest algorithms used for language generation. It predicts the next word in the sentence from the current word. For example, the model is trained on the following two sentences, "I drink coffee in the morning" and "I eat sandwiches with tea". The probability of "coffee" after "drink" is 100%, and the probability of "eat" and "drink" after "I" is 50% respectively. Markov chains take into account the relationship between each word when calculating the probability of the next word. The model was first used to generate next-word suggestions for smartphone input sentences.

However, due to only paying attention to the current word, the Markov model cannot detect the relationship between the current word and other words in the sentence and the structure of the sentence, which makes the prediction results inaccurate and limited in many application scenarios.

Recurrent Neural Network (RNN)
Inspired by how the human brain works, neural networks provide a new approach to computation by modeling the non-linear relationship between input and output data, which is called neuro-language modeling for language modeling.

RNN is a type of neural network that captures the sequence features of input data. Each item in the sequence is processed by a feedforward network, and the output of the model is used as the next item in the sequence. This process can help store the information of each previous step. Such "memory" makes RNNs excellent in language generation, because remembering past information can help better predict the future. Unlike Markov chains, when making predictions, RNNs focus not only on the current word, but also on words that have already been processed.

Language Generation with RNNs
At each iteration of the RNN, the model can store the words that have appeared in its "memory" unit, and calculate the probability of the next word appearing. For example, there is "We need to rent a __", at which point the next word in the sentence needs to be predicted. The model is able to remember the probability that each word appears in the dictionary with the preceding word. In the above example, "house" or "car" has a higher probability of occurrence than "river" and "dinner". The "memory" unit selects words with a higher probability, sorts them, and proceeds to the next iteration.

But RNNs have a big problem - vanishing gradients. As the sequence length increases, RNNs cannot store words that were encountered long ago, and can only make predictions based on recent words. This makes RNNs unsuitable for generating coherent long sentences.

Long Short-Term Memory (LSTM)

Long Short-Term Memory Networks are variants of RNNs that are more suitable for processing long sequences than vanilla RNNs. LSTM is widely used and its structure is similar to RNNs. The difference is that RNNs have only one simple layer structure, while LSTM has 4 layers inside. An LSTM consists of 4 parts: cell, input gate, output gate and forget gate.

Language Generation with LSTM

For example, the input sentence is "I am from Spain. I am fluent in ___". In order to correctly predict the next word "Spanish", LSTM will pay more attention to "Spain" in the previous sentence and use the cell to memorize it. The cell stores the information obtained when processing the sequence, which is used to predict the next word. When encountering a period, the forget gate will realize that the context in the sentence has changed, and ignore the state information stored in the current cell. In other words, the role of the forget gate is to let the recurrent neural network "forget" what it has not used before. information.

LSTM and its variants are able to solve the vanishing gradient problem and generate coherent sentences. However, LSTMs also have their limitations: they are computationally demanding and difficult to train.


Transformer was first proposed by the Google team in the paper "Attention Is All You Need" in 2017, and involves a new method called "self-attention mechanism". Transformers are currently widely used to solve NLP problems such as language modeling, machine translation, and text generation. The Transformer model consists of a set of encoders responsible for processing inputs of arbitrary length and a set of decoders responsible for outputting generated sentences.

In the above example, the encoder processes the input sentence and generates a representation for it. The decoder utilizes the representation to generate sentences for output. The initial representation or embedding of each word is represented by an open circle. Next, the Transformer model uses the self-attention mechanism to obtain the relationship between all other words and generates a new representation of each word, as shown in the filled circle. This step is repeated for each word, successively generating new representations, and similarly, the decoder generates words sequentially from left to right.

Unlike LSTMs, Transformer requires fewer steps, and applying self-attention mechanism can directly capture the relationship between all words in a sentence without considering word positions.

Recently, many researchers have improved the vanilla transformer model to improve the speed and accuracy. In 2018, Google proposed the BERT model, which achieved state-of-the-art results in various NLP tasks. In 2019, OpenAI released a transformer-based language model that can generate long articles with just a few lines of text input.

Language Generation with Transformers
The Transformer model can also be used for language generation, the most famous being the GPT-2 language model proposed by OpenAI. The model learns better and predicts the next word in a sentence by focusing its attention on words relevant to predicting the next word.

Text generation using Transformers is similar to the structure followed by machine translation. For example, "Her gown with the dots that are pink, white and ____". By using the self-attention mechanism to analyze the previous colors (white and pink), it is understood that the word to be predicted is also a color, and the output of the model is "blue". Self-attention can help the model selectively focus on the role each word plays in the sentence, rather than just remembering a few features by looping.

The future of language generation

This article takes us through the evolution of language generation, from using Markov chains to predict the next word, to using self-attention to generate coherent articles. However, we are still in the early days of generative language modeling, and we will move towards autonomous text generation in the future. Generative models will also be used in the development of images, video, audio, and more.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us