Text generation and understanding based on GPT model
overview
The GPT model can better handle various tasks in the field of text generation, such as text completion, free question and answer, cloze, writing essays, writing abstracts, writing novels, writing poems, etc. ChatGPT, an artificial intelligence product that has become popular all over the Internet recently, is also based on the GPT text generation model. Although the GPT large model works well in these application areas, the training cost is very high. Taking the 175 billion GPT3 launched by OpenAI as an example, it is estimated that it will take 34 days on 1024 A100 GPUs, and the GPT3 with one trillion parameters will take at least 84 days on 3072 A100 graphics cards; The 530 billion NLG model took 3 months of training time on 2048 A100 graphics cards to achieve a better convergence effect.
In view of the large number of parameters of the GPT basic model and the excessive consumption of training and inference hardware resources, sparse training based on MoE is currently the most competitive way to reduce costs and increase efficiency. The full name of MoE is Mixture of Experts, where Expert corresponds to the MLP layer of the Transfomrer model. During training, one MLP is selected from multiple MLPs for activation (as shown in the figure below). This means that the model can increase the model parameter magnitude by increasing the number of MLP modules without increasing the computational intensity (FLOPS/Bytes), thereby improving the generalization performance of the model on downstream tasks. Compared with the dense model of the same quality (validation set loss and zeroshot nlu downstream task performance), the sparse Transformer model after using MoE has nearly 1.2 times the training throughput performance improvement, and 1.3 times the inference throughput performance improvement. In the overall design of the sparse architecture, we chose to organically combine MoE with the GPT of the pure Transformer Decoder architecture. The reason is that the combination of MoE and Decoder is usually better than the combination of Encoder. Specifically, the Encoder learns the language model through random masking, and this random masked token will make the expert's routing selection unbalanced. On the other hand, considering that the GPT model of the Decoder class has a wider range of usage scenarios than the Bert model of the Encoder class, we adopt the GPT+MoE technical architecture route to explore the most energyefficient green and lowcarbon GPT large model training & reasoning software and hardware for a single machine The feasibility of integrated adaptation technology in Chinese text generation scenarios.
Based on the current relatively mature distributed MoE expert routing technology, the top1 routing mechanism in Switch Transformer [2] is adopted. Each Expert is given a probability value according to the following softmax function, and the Expert with the highest probability (top1) is taken as the FFN layer of the network. Where W_r is a parameter that needs to be learned when doing routing selection.
GPTMoE Training & Inference Energy Efficiency Analysis
Basic pretraining model training & inference performance analysis
For any dense (Dense) GPT model, there is a corresponding sparse (MoE) GPT model with faster training & inference speed. Our goal is to find this GPTMoE model configuration under constrained hardware such as standalone conditions, and then further improve its training energy efficiency by improving the MoE algorithm. We compare the training & inference performance of dense & sparse models to find energyefficient sparse models equivalent to dense models.
The parameter quantity, model structure, and training hyperparameters of the eight GPT models are shown in the following table:
GPT model
Parameter amount
Layers
Heads
hidden size
LR
Batch of Tokens
1.3B Dense
1.3B
twenty four
32
2048
2e4
1M
2.7B Dense
2.7B
32
32
2560
1.6e4
1M
3.6B Dense
3.6B
30
32
3072
1.6e4
1M
0.35B+MoE64
6.7B
twenty four
16
1024
3e4
0.5M
1.3B+MoE32
13B
twenty four
32
2048
2e4
1M
1.3B+MoE64
27B
twenty four
32
2048
1.6e4
1M
2.7B+MoE64
56B
32
32
2560
1.6e4
1M
3.6B+MoE64
75B
30
32
3072
1.6e4
1M
As shown in the figure below, the 1.3B+MoE32/64 model shows a lower verification set loss than the 1.3B dense model at the same step, and the loss of the 1.3B+MoE64 model is even lower than the 2.7B dense model
Among the five models, the training throughput of 0.35B+MoE64 is the fastest, about twice that of other models. Among the other four models, the higher throughput speeds are 1.3B dense and 1.3B+MoE32, and the speeds of 1.3B+MoE64 and 2.7B dense are similar. As shown below:
In terms of inference throughput speed, 1.3B Dense has the least memory consumption, and 0.35B+MoE64 has the lowest latency.
input_len=20
output_len = 128
batch_size = 1
Model
latency (ms)
memory (MB)
num of gpus
1.3B Dense
399.66
9476
1
2.7B Dense
753.37
17340
1
3.6B Dense
777.54
22558
1
0.35B+MoE64
356.22
15772
1
1.3B+MoE32
581.34
33294
1
1.3B+MoE64
586.18
57880
1
2.7B+MoE64
742.43
61054
2
3.6B+MoE64
662.57
42724
4
Through the above chart analysis, we can roughly judge that the energyefficient sparse model corresponding to the 2.7BDense model is based on the 1.3B dense model, and is equipped with a MoE model of 32 or 64 experts at the same time. Below we focus on the analysis of the cost performance of the 1.3B+MoE32/64 and 2.7B dense models. After 200 hours of pretraining on the singlemachine A100, draw the loss curve of the pretraining verification set with the help of Tensorboard. We found that when the verification set loss reaches 2.16, the convergence speed of the 1.3B+MoE64 model is 1.17 of the 2.7B dense model, and the convergence speed of the 1.3B+MoE32 lags behind the 2.7B dense model by 15%, as shown in the following figure:
From the performance of singlestage singlecard reasoning based on Faster Transformer in the figure below. The throughput speed of 1.3B+MoE32 and 1.3B+MoE64 is similar and higher than the 2.6B dense model, which is in line with expectations, because their base size is only 1.3B.
Chinese ZeroShotNLU effect evaluation
Chinese text generation effect evaluation
text completion
poetry generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_textgeneration_0.35B_MoE64/summary
Ad copy generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_textgeneration_1.3B_MoE32/summary
composition generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_textgeneration_1.3B_MoE64/summary
Selfdeveloped GPTMoE Algorithm Innovation & Experimental Analysis
background
Top1 Gating is currently the most mainstream and most effective routing algorithm, but it also has obvious shortcomings. For example, in Top1 Gating, each Token will only be handed over to one expert for processing. Therefore, it often happens that some experts need to process many tokens, while some experts only need to process a very small number of tokens, which leads to Experts who deal with very few tokens cannot get enough information to be fully utilized.
Energy Efficient Expert Routing
Therefore, we developed a new routing algorithm by ourselves, as shown in the figure below. Our algorithm allows the expert to actively select a fixed number of tokens (capacity), and the same token can be processed by different experts at the same time, so that each expert can be fully trained.
The representation of the final generated token adopts the weighted sum method, and the representations generated by different experts are weighted and summed and passed through the Expert Residual module to obtain the final token representation. Such a representation is more robust due to the joint action of multiple experts.
1. Calculate the expert's preference for the token:
2. LSoftmax algorithm training expert weight
During the training process, LSoftmax loss is used to optimize the weight W_e of experts, so that each expert has a distinguishable preference for tokens.
3. Each expert selects a fixed number of tokens:
Among them, we predetermined the maximum number of tokens that each expert can handle, and recorded the index of tokens that each expert needs to process;
4. Calculate the final output:
Each expert calculates the representation of the corresponding tokens according to the index, and weights and sums the representations generated by different experts for the same token, and finally generates the final output through the Expert Residual module.
experiment analysis
The figure below is a graph of the verification set loss of the selfdeveloped algorithm and the Top1 Gating and sBASE algorithms changing with the training step. We can find that the verification set loss of the selfdeveloped algorithm is always lower than the top1 gating and s The verification set loss of the BASE algorithm proves the effectiveness of our selfdeveloped algorithm.
At the same time, we observed that when the verification set loss was lower than 2.7 for the first time, the speed of the selfdeveloped algorithm was 1.48 times that of sBASE, which greatly reduced the overhead of model training.
In addition, we also analyzed the training throughput of the selfdeveloped algorithm, Top1 Gating, and sBASE. As shown in the figure below, the training throughput of the selfdeveloped algorithm is 1.17 times higher than that of sBASE.
GPTMoE pretraining based on PAI DLC
Rapidformer provides training acceleration capabilities for various Transformer models in EasyNLP, which is achieved by organically integrating Microsoft's DeepSpeed and NVIDIA's Megatron, as shown in the following figure:
In the pretraining of the GPTMoE large model, the main training acceleration core technologies we use include:
Mixed Precision Training (Mixed Precision Training) The benefits of using mixed precision training mainly include the following two points: 1. Reduce video memory usage. Since the memory usage of FP16 is only half of that of FP32, it can naturally help save half of the video memory space during the training process. 2. Speed up the calculation of training and inference. In addition to saving memory, FP16 can also save the training time of the model at the same time. The specific principle is shown in the figure below. The core is that an FP32 backup needs to be maintained to avoid rounding errors when backpropagating parameters are updated. In addition, Loss Scaling is used to alleviate overflow errors.
Selective Activation Recomputation (Selective Activation Recomputation) sets several checkpoints (checkpoints) in the middle of the neural network. All intermediate results other than the checkpoints are discarded, and the time for backpropagation to obtain derivatives requires an intermediate result from the nearest checkpoint. Start calculation at the point, which not only saves video memory, but also avoids the tedious process of calculation from scratch. In actual use, some layers generate a large activation value but a small amount of calculation. It is necessary to selectively filter out this part of the activation value and retain important activation values to save recalculation.
The Zero Redundancy Optimizer (The Zero Redundancy Optimizer) is a new memory optimization technique for largescale distributed deep learning. ZeRO has three main optimization stages corresponding to optimizer state, gradient and parameter division respectively. We are using the Zero1 optimizer state partition here.
Sequence Parallelism (Sequence Parallelism) is a technology that splits long sequences to speed up training. On the basis of Tensor Parallelism, the input of the LayerNorm and Dropout layers of the Transformer core is split according to the Sequence Length dimension, so that each device Only part of the above needs to be done Dropout and LayerNorm are fine. There are two advantages to doing this: 1. The calculation of LayerNorm and Dropout is distributed to each device, reducing the waste of computing resources; 2. The activation value generated by LayerNorm and Dropout is also distributed to each device, further Reduced memory overhead.
The GPT model can better handle various tasks in the field of text generation, such as text completion, free question and answer, cloze, writing essays, writing abstracts, writing novels, writing poems, etc. ChatGPT, an artificial intelligence product that has become popular all over the Internet recently, is also based on the GPT text generation model. Although the GPT large model works well in these application areas, the training cost is very high. Taking the 175 billion GPT3 launched by OpenAI as an example, it is estimated that it will take 34 days on 1024 A100 GPUs, and the GPT3 with one trillion parameters will take at least 84 days on 3072 A100 graphics cards; The 530 billion NLG model took 3 months of training time on 2048 A100 graphics cards to achieve a better convergence effect.
In view of the large number of parameters of the GPT basic model and the excessive consumption of training and inference hardware resources, sparse training based on MoE is currently the most competitive way to reduce costs and increase efficiency. The full name of MoE is Mixture of Experts, where Expert corresponds to the MLP layer of the Transfomrer model. During training, one MLP is selected from multiple MLPs for activation (as shown in the figure below). This means that the model can increase the model parameter magnitude by increasing the number of MLP modules without increasing the computational intensity (FLOPS/Bytes), thereby improving the generalization performance of the model on downstream tasks. Compared with the dense model of the same quality (validation set loss and zeroshot nlu downstream task performance), the sparse Transformer model after using MoE has nearly 1.2 times the training throughput performance improvement, and 1.3 times the inference throughput performance improvement. In the overall design of the sparse architecture, we chose to organically combine MoE with the GPT of the pure Transformer Decoder architecture. The reason is that the combination of MoE and Decoder is usually better than the combination of Encoder. Specifically, the Encoder learns the language model through random masking, and this random masked token will make the expert's routing selection unbalanced. On the other hand, considering that the GPT model of the Decoder class has a wider range of usage scenarios than the Bert model of the Encoder class, we adopt the GPT+MoE technical architecture route to explore the most energyefficient green and lowcarbon GPT large model training & reasoning software and hardware for a single machine The feasibility of integrated adaptation technology in Chinese text generation scenarios.
Based on the current relatively mature distributed MoE expert routing technology, the top1 routing mechanism in Switch Transformer [2] is adopted. Each Expert is given a probability value according to the following softmax function, and the Expert with the highest probability (top1) is taken as the FFN layer of the network. Where W_r is a parameter that needs to be learned when doing routing selection.
GPTMoE Training & Inference Energy Efficiency Analysis
Basic pretraining model training & inference performance analysis
For any dense (Dense) GPT model, there is a corresponding sparse (MoE) GPT model with faster training & inference speed. Our goal is to find this GPTMoE model configuration under constrained hardware such as standalone conditions, and then further improve its training energy efficiency by improving the MoE algorithm. We compare the training & inference performance of dense & sparse models to find energyefficient sparse models equivalent to dense models.
The parameter quantity, model structure, and training hyperparameters of the eight GPT models are shown in the following table:
GPT model
Parameter amount
Layers
Heads
hidden size
LR
Batch of Tokens
1.3B Dense
1.3B
twenty four
32
2048
2e4
1M
2.7B Dense
2.7B
32
32
2560
1.6e4
1M
3.6B Dense
3.6B
30
32
3072
1.6e4
1M
0.35B+MoE64
6.7B
twenty four
16
1024
3e4
0.5M
1.3B+MoE32
13B
twenty four
32
2048
2e4
1M
1.3B+MoE64
27B
twenty four
32
2048
1.6e4
1M
2.7B+MoE64
56B
32
32
2560
1.6e4
1M
3.6B+MoE64
75B
30
32
3072
1.6e4
1M
As shown in the figure below, the 1.3B+MoE32/64 model shows a lower verification set loss than the 1.3B dense model at the same step, and the loss of the 1.3B+MoE64 model is even lower than the 2.7B dense model
Among the five models, the training throughput of 0.35B+MoE64 is the fastest, about twice that of other models. Among the other four models, the higher throughput speeds are 1.3B dense and 1.3B+MoE32, and the speeds of 1.3B+MoE64 and 2.7B dense are similar. As shown below:
In terms of inference throughput speed, 1.3B Dense has the least memory consumption, and 0.35B+MoE64 has the lowest latency.
input_len=20
output_len = 128
batch_size = 1
Model
latency (ms)
memory (MB)
num of gpus
1.3B Dense
399.66
9476
1
2.7B Dense
753.37
17340
1
3.6B Dense
777.54
22558
1
0.35B+MoE64
356.22
15772
1
1.3B+MoE32
581.34
33294
1
1.3B+MoE64
586.18
57880
1
2.7B+MoE64
742.43
61054
2
3.6B+MoE64
662.57
42724
4
Through the above chart analysis, we can roughly judge that the energyefficient sparse model corresponding to the 2.7BDense model is based on the 1.3B dense model, and is equipped with a MoE model of 32 or 64 experts at the same time. Below we focus on the analysis of the cost performance of the 1.3B+MoE32/64 and 2.7B dense models. After 200 hours of pretraining on the singlemachine A100, draw the loss curve of the pretraining verification set with the help of Tensorboard. We found that when the verification set loss reaches 2.16, the convergence speed of the 1.3B+MoE64 model is 1.17 of the 2.7B dense model, and the convergence speed of the 1.3B+MoE32 lags behind the 2.7B dense model by 15%, as shown in the following figure:
From the performance of singlestage singlecard reasoning based on Faster Transformer in the figure below. The throughput speed of 1.3B+MoE32 and 1.3B+MoE64 is similar and higher than the 2.6B dense model, which is in line with expectations, because their base size is only 1.3B.
Chinese ZeroShotNLU effect evaluation
Chinese text generation effect evaluation
text completion
poetry generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_textgeneration_0.35B_MoE64/summary
Ad copy generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_textgeneration_1.3B_MoE32/summary
composition generation
Online experience address: https://www.modelscope.cn/models/PAI/nlp_gpt3_textgeneration_1.3B_MoE64/summary
Selfdeveloped GPTMoE Algorithm Innovation & Experimental Analysis
background
Top1 Gating is currently the most mainstream and most effective routing algorithm, but it also has obvious shortcomings. For example, in Top1 Gating, each Token will only be handed over to one expert for processing. Therefore, it often happens that some experts need to process many tokens, while some experts only need to process a very small number of tokens, which leads to Experts who deal with very few tokens cannot get enough information to be fully utilized.
Energy Efficient Expert Routing
Therefore, we developed a new routing algorithm by ourselves, as shown in the figure below. Our algorithm allows the expert to actively select a fixed number of tokens (capacity), and the same token can be processed by different experts at the same time, so that each expert can be fully trained.
The representation of the final generated token adopts the weighted sum method, and the representations generated by different experts are weighted and summed and passed through the Expert Residual module to obtain the final token representation. Such a representation is more robust due to the joint action of multiple experts.
1. Calculate the expert's preference for the token:
2. LSoftmax algorithm training expert weight
During the training process, LSoftmax loss is used to optimize the weight W_e of experts, so that each expert has a distinguishable preference for tokens.
3. Each expert selects a fixed number of tokens:
Among them, we predetermined the maximum number of tokens that each expert can handle, and recorded the index of tokens that each expert needs to process;
4. Calculate the final output:
Each expert calculates the representation of the corresponding tokens according to the index, and weights and sums the representations generated by different experts for the same token, and finally generates the final output through the Expert Residual module.
experiment analysis
The figure below is a graph of the verification set loss of the selfdeveloped algorithm and the Top1 Gating and sBASE algorithms changing with the training step. We can find that the verification set loss of the selfdeveloped algorithm is always lower than the top1 gating and s The verification set loss of the BASE algorithm proves the effectiveness of our selfdeveloped algorithm.
At the same time, we observed that when the verification set loss was lower than 2.7 for the first time, the speed of the selfdeveloped algorithm was 1.48 times that of sBASE, which greatly reduced the overhead of model training.
In addition, we also analyzed the training throughput of the selfdeveloped algorithm, Top1 Gating, and sBASE. As shown in the figure below, the training throughput of the selfdeveloped algorithm is 1.17 times higher than that of sBASE.
GPTMoE pretraining based on PAI DLC
Rapidformer provides training acceleration capabilities for various Transformer models in EasyNLP, which is achieved by organically integrating Microsoft's DeepSpeed and NVIDIA's Megatron, as shown in the following figure:
In the pretraining of the GPTMoE large model, the main training acceleration core technologies we use include:
Mixed Precision Training (Mixed Precision Training) The benefits of using mixed precision training mainly include the following two points: 1. Reduce video memory usage. Since the memory usage of FP16 is only half of that of FP32, it can naturally help save half of the video memory space during the training process. 2. Speed up the calculation of training and inference. In addition to saving memory, FP16 can also save the training time of the model at the same time. The specific principle is shown in the figure below. The core is that an FP32 backup needs to be maintained to avoid rounding errors when backpropagating parameters are updated. In addition, Loss Scaling is used to alleviate overflow errors.
Selective Activation Recomputation (Selective Activation Recomputation) sets several checkpoints (checkpoints) in the middle of the neural network. All intermediate results other than the checkpoints are discarded, and the time for backpropagation to obtain derivatives requires an intermediate result from the nearest checkpoint. Start calculation at the point, which not only saves video memory, but also avoids the tedious process of calculation from scratch. In actual use, some layers generate a large activation value but a small amount of calculation. It is necessary to selectively filter out this part of the activation value and retain important activation values to save recalculation.
The Zero Redundancy Optimizer (The Zero Redundancy Optimizer) is a new memory optimization technique for largescale distributed deep learning. ZeRO has three main optimization stages corresponding to optimizer state, gradient and parameter division respectively. We are using the Zero1 optimizer state partition here.
Sequence Parallelism (Sequence Parallelism) is a technology that splits long sequences to speed up training. On the basis of Tensor Parallelism, the input of the LayerNorm and Dropout layers of the Transformer core is split according to the Sequence Length dimension, so that each device Only part of the above needs to be done Dropout and LayerNorm are fine. There are two advantages to doing this: 1. The calculation of LayerNorm and Dropout is distributed to each device, reducing the waste of computing resources; 2. The activation value generated by LayerNorm and Dropout is also distributed to each device, further Reduced memory overhead.
Related Articles

A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team

What Does IOT Mean
Knowledge Base Team

6 Optional Technologies for Data Storage
Knowledge Base Team

What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers

Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00