Text generation and understanding based on GPT model

Overview

The GPT model can better handle various tasks in the field of text generation, such as text completion, free question and answer, cloze, writing essays, writing abstracts, writing novels, writing poems, etc. ChatGPT, an artificial intelligence product that has become popular all over the Internet recently, is also based on the GPT text generation model. Although the GPT large model works well in these application areas, the training cost is very high.

In view of the large number of parameters of the GPT basic model and the excessive consumption of training and inference hardware resources, sparse training based on MoE is currently the most competitive way to reduce costs and increase efficiency. The full name of MoE is Mixture of Experts, where Expert corresponds to the MLP layer of the Transfomrer model. During training, one MLP is selected from multiple MLPs for activation (as shown in the figure below). This means that the model can increase the model parameter magnitude by increasing the number of MLP modules without increasing the computational intensity (FLOPS/Bytes), thereby improving the generalization performance of the model on downstream tasks. Compared with the dense model of the same quality (validation set loss and zeroshot nlu downstream task performance), the sparse Transformer model after using MoE has nearly 1.2 times the training throughput performance improvement, and 1.3 times the inference throughput performance improvement. In the overall design of the sparse architecture, we chose to organically combine MoE with the GPT of the pure Transformer Decoder architecture. The reason is that the combination of MoE and Decoder is usually better than the combination of Encoder. Specifically, the Encoder learns the language model through random masking, and this random masked token will make the expert's routing selection unbalanced. On the other hand, considering that the GPT model of the Decoder class has a wider range of usage scenarios than the Bert model of the Encoder class, we adopt the GPT+MoE technical architecture route to explore the most energy-efficient green and low-carbon GPT large model training & reasoning software and hardware for a single machine The feasibility of integrated adaptation technology in Chinese text generation scenarios.

Based on the current relatively mature distributed MoE expert routing technology, the top-1 routing mechanism in Switch Transformer [2] is adopted. Each Expert is given a probability value according to the following softmax function, and the Expert with the highest probability (top-1) is taken as the FFN layer of the network. Where W_r is a parameter that needs to be learned when doing routing selection.
GPT-MoE Training & Inference Energy Efficiency Analysis
Basic pre-training model training & inference performance analysis
For any dense (Dense) GPT model, there is a corresponding sparse (MoE) GPT model with faster training & inference speed. Our goal is to find this GPT-MoE model configuration under constrained hardware such as stand-alone conditions, and then further improve its training energy efficiency by improving the MoE algorithm. We compare the training & inference performance of dense & sparse models to find energy-efficient sparse models equivalent to dense models.
The parameter quantity, model structure, and training hyperparameters of the eight GPT models are shown in the following table:
GPT model
Parameter amount
Layers
Heads
hidden size
LR
Batch of Tokens
1.3B Dense
1.3B
twenty four
32
2048
2e-4
1M
2.7B Dense
2.7B
32
32
2560
1.6e-4
1M
3.6B Dense
3.6B
30
32
3072
1.6e-4
1M
0.35B+MoE-64
6.7B
twenty four
16
1024
3e-4
0.5M
1.3B+MoE-32
13B
twenty four
32
2048
2e-4
1M
1.3B+MoE-64
27B
twenty four
32
2048
1.6e-4
1M
2.7B+MoE-64
56B
32
32
2560
1.6e-4
1M
3.6B+MoE-64
75B
30
32
3072
1.6e-4
1M
As shown in the figure below, the 1.3B+MoE32/64 model shows a lower verification set loss than the 1.3B dense model at the same step, and the loss of the 1.3B+MoE-64 model is even lower than the 2.7B dense model
Among the five models, the training throughput of 0.35B+MoE-64 is the fastest, about twice that of other models. Among the other four models, the higher throughput speeds are 1.3B dense and 1.3B+MoE-32, and the speeds of 1.3B+MoE-64 and 2.7B dense are similar. As shown below:
In terms of inference throughput speed, 1.3B Dense has the least memory consumption, and 0.35B+MoE64 has the lowest latency.
input_len=20
output_len = 128
batch_size = 1
Model
latency (ms)
memory (MB)
num of gpus
1.3B Dense
399.66
9476
1
2.7B Dense
753.37
17340
1
3.6B Dense
777.54
22558
1
0.35B+MoE64
356.22
15772
1
1.3B+MoE32
581.34
33294
1
1.3B+MoE64
586.18
57880
1
2.7B+MoE64
742.43
61054
2
3.6B+MoE64
662.57
42724
4
Through the above chart analysis, we can roughly judge that the energy-efficient sparse model corresponding to the 2.7B-Dense model is based on the 1.3B dense model, and is equipped with a MoE model of 32 or 64 experts at the same time. Below we focus on the analysis of the cost performance of the 1.3B+MoE-32/64 and 2.7B dense models. After 200 hours of pre-training on the single-machine, draw the loss curve of the pre-training verification set with the help of Tensorboard. We found that when the verification set loss reaches 2.16, the convergence speed of the 1.3B+MoE-64 model is 1.17 of the 2.7B dense model, and the convergence speed of the 1.3B+MoE-32 lags behind the 2.7B dense model by 15%, as shown in the following figure:
From the performance of single-stage single-card reasoning based on Faster Transformer in the figure below. The throughput speed of 1.3B+MoE-32 and 1.3B+MoE64 is similar and higher than the 2.6B dense model, which is in line with expectations, because their base size is only 1.3B.
Chinese ZeroShot-NLU effect evaluation
Chinese text generation effect evaluation
text completion
poetry generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_text-generation_0.35B_MoE-64/summary
Ad copy generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_text-generation_1.3B_MoE-32/summary
composition generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_text-generation_1.3B_MoE-64/summary
Self-developed GPT-MoE Algorithm Innovation & Experimental Analysis
background
Top-1 Gating is currently the most mainstream and most effective routing algorithm, but it also has obvious shortcomings. For example, in Top-1 Gating, each Token will only be handed over to one expert for processing. Therefore, it often happens that some experts need to process many tokens, while some experts only need to process a very small number of tokens, which leads to Experts who deal with very few tokens cannot get enough information to be fully utilized.
Energy Efficient Expert Routing
Therefore, we developed a new routing algorithm by ourselves, as shown in the figure below. Our algorithm allows the expert to actively select a fixed number of tokens (capacity), and the same token can be processed by different experts at the same time, so that each expert can be fully trained.
The representation of the final generated token adopts the weighted sum method, and the representations generated by different experts are weighted and summed and passed through the Expert Residual module to obtain the final token representation. Such a representation is more robust due to the joint action of multiple experts.
1. Calculate the expert's preference for the token:

2. L-Softmax algorithm training expert weight
During the training process, L-Softmax loss is used to optimize the weight W_e of experts, so that each expert has a distinguishable preference for tokens.
3. Each expert selects a fixed number of tokens:
Among them, we pre-determined the maximum number of tokens that each expert can handle, and recorded the index of tokens that each expert needs to process;
4. Calculate the final output:
Each expert calculates the representation of the corresponding tokens according to the index, and weights and sums the representations generated by different experts for the same token, and finally generates the final output through the Expert Residual module.
experiment analysis
The figure below is a graph of the verification set loss of the self-developed algorithm and the Top-1 Gating and s-BASE algorithms changing with the training step. We can find that the verification set loss of the self-developed algorithm is always lower than the top-1 gating and s -The verification set loss of the BASE algorithm proves the effectiveness of our self-developed algorithm.
At the same time, we observed that when the verification set loss was lower than 2.7 for the first time, the speed of the self-developed algorithm was 1.48 times that of s-BASE, which greatly reduced the overhead of model training.
In addition, we also analyzed the training throughput of the self-developed algorithm, Top-1 Gating, and s-BASE. As shown in the figure below, the training throughput of the self-developed algorithm is 1.17 times higher than that of s-BASE.
GPT-MoE pre-training based on PAI DLC
Rapidformer provides training acceleration capabilities for various Transformer models in EasyNLP, which is achieved by organically integrating Microsoft's DeepSpeed and NVIDIA's Megatron, as shown in the following figure:
In the pre-training of the GPT-MoE large model, the main training acceleration core technologies we use include:
Mixed Precision Training (Mixed Precision Training) The benefits of using mixed precision training mainly include the following two points: 1. Reduce video memory usage. Since the memory usage of FP16 is only half of that of FP32, it can naturally help save half of the video memory space during the training process. 2. Speed up the calculation of training and inference. In addition to saving memory, FP16 can also save the training time of the model at the same time. The specific principle is shown in the figure below. The core is that an FP32 backup needs to be maintained to avoid rounding errors when backpropagating parameters are updated. In addition, Loss Scaling is used to alleviate overflow errors.
Selective Activation Recomputation (Selective Activation Recomputation) sets several checkpoints (checkpoints) in the middle of the neural network. All intermediate results other than the checkpoints are discarded, and the time for backpropagation to obtain derivatives requires an intermediate result from the nearest checkpoint. Start calculation at the point, which not only saves video memory, but also avoids the tedious process of calculation from scratch. In actual use, some layers generate a large activation value but a small amount of calculation. It is necessary to selectively filter out this part of the activation value and retain important activation values to save recalculation.
The Zero Redundancy Optimizer (The Zero Redundancy Optimizer) is a new memory optimization technique for large-scale distributed deep learning. ZeRO has three main optimization stages corresponding to optimizer state, gradient and parameter division respectively. We are using the Zero-1 optimizer state partition here.
Sequence Parallelism (Sequence Parallelism) is a technology that splits long sequences to speed up training. On the basis of Tensor Parallelism, the input of the LayerNorm and Dropout layers of the Transformer core is split according to the Sequence Length dimension, so that each device Only part of the above needs to be done Dropout and LayerNorm are fine. There are two advantages to doing this: 1. The calculation of LayerNorm and Dropout is distributed to each device, reducing the waste of computing resources; 2. The activation value generated by LayerNorm and Dropout is also distributed to each device, further Reduced memory overhead.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us