DeepRec Large Scale Sparse Model Training Reasoning Engine
Guide:
This article will introduce the following three aspects:
• DeepRec background (why should we do DeepRec)
• DeepRec functionality (design motivation and implementation)
• DeepRec community (main functions of the latest 2206 release)
DeepRec Background
Why do we need a sparse model engine? The current community version of TensorFlow can support sparse scenarios, but there are some functional weaknesses in the following three aspects:
• Sparse training function to improve model effect;
• Improve the training performance of model iteration efficiency;
• Deployment of sparse models.
Therefore, we propose DeepRec, whose function is to optimize depth in sparse scenes.
DeepRec's work mainly focuses on four aspects: sparse function, training performance, serving, and deployment&ODL.
DeepRec's internal applications in Alibaba are mainly in the following core scenarios: recommendation (guess what you like), search (main search), advertising (direct train and orientation). We have also provided solutions to some sparse scenarios for some customers on the cloud, which has greatly helped to improve their model effect and iteration efficiency.
DeepRec Function Introduction
DeepRec's functions are mainly divided into the following five aspects: sparse function Embedding, training framework (asynchronous, synchronous), Runtime (Executor, PRMalloc), graph optimization (structured model, SmartStage), and service deployment related functions.
1. Embedding
The Embedding section will introduce the following five subfunctions:
1.1 Dynamic elastic characteristics (EV)
The left side of the figure above is the main way TensorFlow supports sparse function. The user first defines a fixed shape sensor, and the sparse features are mapped to the sensor just defined by Hash+Mod. There are four logical problems:
• Sparse feature conflicts. Hash+Mod is easy to introduce feature conflicts, which will lead to the disappearance of effective features, thus affecting the effect;
• The storage part will lead to the waste of memory, and some memory space will not be used;
• Fixed shape. Once the variable shape is fixed, it cannot be changed in the future;
• Inefficient IO. If users define variables in this way, they must export them in full quantity. If variables have large dimensions, both export and load are time-consuming, but there are few changes in sparse scenarios.
In this case, the principle of Embedding Variable design defined by DeepRec is to transform a static variable into a dynamic storage similar to HashTable. Each key will create a new Embedding, which naturally solves the problem of feature conflict. With this design, when there are too many features, the Embedding Variables will expand out of order and the memory consumption will become very large. Therefore, DeepRec introduces the following two functions: feature admission and feature elimination. They can effectively prevent features from expanding to large dimensions. In sparse scenes such as search and recommendation, some long tail features are rarely trained by the model. Therefore, feature admission can set a threshold for feature entry into Embedding Variable through CounterFilter or BloomFilter; When the model is exported to Checkpoint, it will also have the feature elimination function, and features older in time will also be eliminated. This is a 5 ‰ increase in the AUC of a recommendation service in Alibaba, a 5 ‰ increase in the AUC of a recommendation service on the cloud, and a 4% increase in pvctr.
1.2 Dynamic elastic dimensional characteristics (FAE) based on characteristic frequency
Generally, the Embedding Variables corresponding to the same feature will be set to the same dimension. If the Embedding Variables are set to a higher dimension, low-frequency feature content will easily lead to over fitting, and will consume a lot of memory. On the contrary, if the dimension is set too low, the high-frequency feature content may affect the effect of the model due to insufficient expression ability. The FAE function provides different dimensions for the same feature according to different features. In this way, when the model is automatically trained, the first is to ensure the effect of the model, and the second is to solve the use of resources for training. This is an introduction to the starting point of FAE functions. The use of this function currently allows users to pass in a dimension and statistical algorithm. FAE automatically generates different Embedding Variables according to the implemented algorithm; Later, DeepRec plans to adaptively discover and allocate feature dimensions within the system, so as to improve user usability.
1.3 Adaptive Embedding Variable
This function is similar to the second function in that it starts from defining the relationship between high frequency and low frequency. When the EV mentioned above is particularly large, we will see that the memory consumption is particularly high. In Adaptive Embedding Variables, we use two Variables, as shown in the right figure. We will define one of the variables as static, and the low-frequency features will be mapped to this variable as much as possible; The other is defined as dynamic elastic dimension feature, which is used for the feature of high-frequency part. Variable supports the dynamic conversion of low-frequency and high-frequency features, which greatly reduces the system's use of memory. For example, after a feature training, the first dimension may be close to 1 billion, while the important features are only 20% - 30%. With this adaptive method, you can reduce the use of memory greatly. In practical application, we found that the impact on the accuracy of the model is very small.
1.4 Multi-Hash Variable
This function is used to solve the problem of feature conflict. We used to solve feature conflicts by using one Hash+Mod. Now we use two or more Hash+Mods to get the embedding, and then reduce the resulting embedding. The advantage is that we can use less memory to solve the problem of feature conflicts.
1.5 Embedding multi-level hybrid storage
The starting point of this function is also to find that the memory cost of EV is very high when there are many features, and the memory occupied by workers may reach tens of hundreds of gigabytes during training. We find that the characteristics actually follow the typical power law distribution. Considering this feature point, we put the hot spot feature into more valuable resources such as CPU, while the relatively long tail low frequency feature into relatively cheap resources. As shown in the figure on the right, there are three structures: DRAM, PMEM, and SSD. PMEM is provided by Intel at a speed between DRAM and SSD, but with a large capacity. At present, we support the mixture of DRAM-PMEM, DRAM-SSD and PMEM-SSD, and have achieved results in business. One service on the cloud used 200+multi CPU distributed training, but now it has been changed to single GPU training after using multi-level storage.
The above is an introduction to all the functions of Embedding. Our motivation for doing these functions is due to several problems of TensorFlow (mainly feature conflicts). Our solution is dynamic elastic features and multi hash features. For the problem of large memory overhead of dynamic elastic features, we have developed feature admission and feature elimination functions; For feature frequency, we have developed three groups of functions: dynamic elastic dimension and adaptive dynamic elastic feature are solved in the direction of dimension, and multi-level hybrid storage is solved in the direction of software and hardware.
2. Training framework
The second function to be introduced is the training framework, which is divided into asynchronous and synchronous directions.
2.1 Asynchronous training framework StarServer
In the case of large-scale tasks, the problems of thousands of workers and native TensorFlow are that thread scheduling is very inefficient, critical path overhead is prominent, and packet communication is very frequent, which have become the bottleneck of distributed communication.
StarServer does a good job in thread scheduling and memory optimization of the graph. It changes Send/Recv in the framework to Push/Pull semantics. PS uses the lockless method when executing, which greatly improves the execution efficiency. We have several times the performance improvement compared with the native framework, and the number of internal 3Kworkers can reach a linear expansion.
2.2 Synchronized training framework HybridBackend,
This is the solution we developed for synchronous training. It supports data parallel and model parallel hybrid distributed training. Data reading is completed by data parallelism. Model parallelism can support training with large number of parameters. Finally, data parallelism is used for dense calculation. According to the characteristics of different Embedding Lookups, we have optimized the merging of multiple lookups and grouping. We also use the advantages of GPU Direct RDMA to design the entire synchronization framework based on the perception of network topology.
3. Runtime
The third major function is Runtime, which mainly introduces PRMalloc and Executor optimization.
3.1 PRMalloc
The first is memory allocation. Memory allocation is ubiquitous in TensorFlow and DeepRec. First, we found in sparse training that large memory allocation caused a large number of minorpagefaults. In addition, concurrent allocation also exists in multithreaded allocation. In DeepRec, we designed a memory allocation scheme for deep learning, called PRMalloc, according to the forward and backward characteristics of sparse training. It improves memory usage and system performance. In the figure, we can see that the main part is the MemoryPlanner, which is used to count the characteristics of the current training in the minibatch of the first k rounds of model training, how many sensors need to be allocated each time, record these behaviors through the bin buffer, and make corresponding optimization. After step k, we apply it to greatly reduce the above problems. We found in the use of DeepRec that this can greatly reduce the occurrence of minor page faults, reduce the use of memory, and accelerate the training speed by 1.6 times.
3.2 Executor optimization
The implementation of TensorFlow's native Executor is very simple. First, the DAG is topologically sorted, then the Node is inserted into the execution queue, and the Executor is used for scheduling through Tasks. This implementation does not take into account the business. ThreadPool uses Eigen thread pool by default. If the thread load is uneven, a large number of threads will preempt Steal, which will bring huge overhead. In DeepRec, we define the scheduling to be more uniform, and define the critical path so that there is a certain priority order when scheduling to execute Ops. Finally, DeepRec also provides a variety of scheduling strategies, including Task based and SimpleGraph based.
4. Diagram optimization related functions
4.1 Structural characteristics
This is a function inspired by business. We found that in the search scenario, no matter training or reasoning, the sample is usually one user corresponding to multiple items and multiple labels. The original processing method will be regarded as multiple samples, so the user's storage is redundant. In order to save this cost, we have customized the storage format to optimize this part. If these samples are the same user in a minibatch, some user networks and item networks will be calculated separately. Finally, the corresponding logical calculation will be performed, which can save computing overhead. So we have made structural optimization from the storage side and the computing side respectively.
4.2 SmartStage
We see that sparse model training usually includes sample reading, Embedding Lookup, and MLP network computing. Sample reading and embedding lookup are often not computationally intensive and cannot effectively use computing resources. Although the prefetch interface provided by the native framework can complete asynchronous operations to a certain extent, we design some complex subgraphs in the Embedding Lookup process, which cannot be pipelined through the prefetch of TensorFlow. The pipeline function provided by TensorFlow requires the user to display the specified stage boundary in actual use. On the one hand, it will increase the difficulty of use. On the other hand, due to the insufficient accuracy of the stage, it cannot be accurate to the op level. API users at High Level cannot manually insert it, which will cause many steps to parallelize. The following figure shows the specific operation of SmartStage, which automatically classifies Ops into different stages, so that the performance of concurrent pipelines can be improved. In ModelZoo, the maximum acceleration ratio of the model test effect can reach 1.1-1.3.
5. Serving
5.1 Model Incremental Export and Loading
At the beginning, one of the important points in the introduction of Embedding is inefficient IO. If we use the dynamic elasticity function mentioned earlier, we can naturally do incremental export. As long as the previously accessed sparse IDs are added to the graph, the required IDs can be exported accurately during incremental export. We have two starting points for doing this function: first, our original method for model training is to export a full amount of model export at each step, which is also a restore checkpoint when the program interrupts restore. At the worst, we may lose all the results of the two checkpoint intervals. With incremental export, we will export the dense part in full, and the sparse part in incremental, In the actual scenario, 10 minute incremental export can greatly save the loss caused by restore; In addition, the scene of incremental export is online serving. If the scene is loaded in full every time, the model is very large for sparse scenes, and each loading takes a long time. If online learning is difficult, incremental export will also use ODL scenes.
5.2 ODL
The left side is sample processing, the upper and lower parts are offline and online training, and the right side is serving. Many PAI components are used to complete the pipeline construction.
DeepRec Community
This article will introduce the following three aspects:
• DeepRec background (why should we do DeepRec)
• DeepRec functionality (design motivation and implementation)
• DeepRec community (main functions of the latest 2206 release)
DeepRec Background
Why do we need a sparse model engine? The current community version of TensorFlow can support sparse scenarios, but there are some functional weaknesses in the following three aspects:
• Sparse training function to improve model effect;
• Improve the training performance of model iteration efficiency;
• Deployment of sparse models.
Therefore, we propose DeepRec, whose function is to optimize depth in sparse scenes.
DeepRec's work mainly focuses on four aspects: sparse function, training performance, serving, and deployment&ODL.
DeepRec's internal applications in Alibaba are mainly in the following core scenarios: recommendation (guess what you like), search (main search), advertising (direct train and orientation). We have also provided solutions to some sparse scenarios for some customers on the cloud, which has greatly helped to improve their model effect and iteration efficiency.
DeepRec Function Introduction
DeepRec's functions are mainly divided into the following five aspects: sparse function Embedding, training framework (asynchronous, synchronous), Runtime (Executor, PRMalloc), graph optimization (structured model, SmartStage), and service deployment related functions.
1. Embedding
The Embedding section will introduce the following five subfunctions:
1.1 Dynamic elastic characteristics (EV)
The left side of the figure above is the main way TensorFlow supports sparse function. The user first defines a fixed shape sensor, and the sparse features are mapped to the sensor just defined by Hash+Mod. There are four logical problems:
• Sparse feature conflicts. Hash+Mod is easy to introduce feature conflicts, which will lead to the disappearance of effective features, thus affecting the effect;
• The storage part will lead to the waste of memory, and some memory space will not be used;
• Fixed shape. Once the variable shape is fixed, it cannot be changed in the future;
• Inefficient IO. If users define variables in this way, they must export them in full quantity. If variables have large dimensions, both export and load are time-consuming, but there are few changes in sparse scenarios.
In this case, the principle of Embedding Variable design defined by DeepRec is to transform a static variable into a dynamic storage similar to HashTable. Each key will create a new Embedding, which naturally solves the problem of feature conflict. With this design, when there are too many features, the Embedding Variables will expand out of order and the memory consumption will become very large. Therefore, DeepRec introduces the following two functions: feature admission and feature elimination. They can effectively prevent features from expanding to large dimensions. In sparse scenes such as search and recommendation, some long tail features are rarely trained by the model. Therefore, feature admission can set a threshold for feature entry into Embedding Variable through CounterFilter or BloomFilter; When the model is exported to Checkpoint, it will also have the feature elimination function, and features older in time will also be eliminated. This is a 5 ‰ increase in the AUC of a recommendation service in Alibaba, a 5 ‰ increase in the AUC of a recommendation service on the cloud, and a 4% increase in pvctr.
1.2 Dynamic elastic dimensional characteristics (FAE) based on characteristic frequency
Generally, the Embedding Variables corresponding to the same feature will be set to the same dimension. If the Embedding Variables are set to a higher dimension, low-frequency feature content will easily lead to over fitting, and will consume a lot of memory. On the contrary, if the dimension is set too low, the high-frequency feature content may affect the effect of the model due to insufficient expression ability. The FAE function provides different dimensions for the same feature according to different features. In this way, when the model is automatically trained, the first is to ensure the effect of the model, and the second is to solve the use of resources for training. This is an introduction to the starting point of FAE functions. The use of this function currently allows users to pass in a dimension and statistical algorithm. FAE automatically generates different Embedding Variables according to the implemented algorithm; Later, DeepRec plans to adaptively discover and allocate feature dimensions within the system, so as to improve user usability.
1.3 Adaptive Embedding Variable
This function is similar to the second function in that it starts from defining the relationship between high frequency and low frequency. When the EV mentioned above is particularly large, we will see that the memory consumption is particularly high. In Adaptive Embedding Variables, we use two Variables, as shown in the right figure. We will define one of the variables as static, and the low-frequency features will be mapped to this variable as much as possible; The other is defined as dynamic elastic dimension feature, which is used for the feature of high-frequency part. Variable supports the dynamic conversion of low-frequency and high-frequency features, which greatly reduces the system's use of memory. For example, after a feature training, the first dimension may be close to 1 billion, while the important features are only 20% - 30%. With this adaptive method, you can reduce the use of memory greatly. In practical application, we found that the impact on the accuracy of the model is very small.
1.4 Multi-Hash Variable
This function is used to solve the problem of feature conflict. We used to solve feature conflicts by using one Hash+Mod. Now we use two or more Hash+Mods to get the embedding, and then reduce the resulting embedding. The advantage is that we can use less memory to solve the problem of feature conflicts.
1.5 Embedding multi-level hybrid storage
The starting point of this function is also to find that the memory cost of EV is very high when there are many features, and the memory occupied by workers may reach tens of hundreds of gigabytes during training. We find that the characteristics actually follow the typical power law distribution. Considering this feature point, we put the hot spot feature into more valuable resources such as CPU, while the relatively long tail low frequency feature into relatively cheap resources. As shown in the figure on the right, there are three structures: DRAM, PMEM, and SSD. PMEM is provided by Intel at a speed between DRAM and SSD, but with a large capacity. At present, we support the mixture of DRAM-PMEM, DRAM-SSD and PMEM-SSD, and have achieved results in business. One service on the cloud used 200+multi CPU distributed training, but now it has been changed to single GPU training after using multi-level storage.
The above is an introduction to all the functions of Embedding. Our motivation for doing these functions is due to several problems of TensorFlow (mainly feature conflicts). Our solution is dynamic elastic features and multi hash features. For the problem of large memory overhead of dynamic elastic features, we have developed feature admission and feature elimination functions; For feature frequency, we have developed three groups of functions: dynamic elastic dimension and adaptive dynamic elastic feature are solved in the direction of dimension, and multi-level hybrid storage is solved in the direction of software and hardware.
2. Training framework
The second function to be introduced is the training framework, which is divided into asynchronous and synchronous directions.
2.1 Asynchronous training framework StarServer
In the case of large-scale tasks, the problems of thousands of workers and native TensorFlow are that thread scheduling is very inefficient, critical path overhead is prominent, and packet communication is very frequent, which have become the bottleneck of distributed communication.
StarServer does a good job in thread scheduling and memory optimization of the graph. It changes Send/Recv in the framework to Push/Pull semantics. PS uses the lockless method when executing, which greatly improves the execution efficiency. We have several times the performance improvement compared with the native framework, and the number of internal 3Kworkers can reach a linear expansion.
2.2 Synchronized training framework HybridBackend,
This is the solution we developed for synchronous training. It supports data parallel and model parallel hybrid distributed training. Data reading is completed by data parallelism. Model parallelism can support training with large number of parameters. Finally, data parallelism is used for dense calculation. According to the characteristics of different Embedding Lookups, we have optimized the merging of multiple lookups and grouping. We also use the advantages of GPU Direct RDMA to design the entire synchronization framework based on the perception of network topology.
3. Runtime
The third major function is Runtime, which mainly introduces PRMalloc and Executor optimization.
3.1 PRMalloc
The first is memory allocation. Memory allocation is ubiquitous in TensorFlow and DeepRec. First, we found in sparse training that large memory allocation caused a large number of minorpagefaults. In addition, concurrent allocation also exists in multithreaded allocation. In DeepRec, we designed a memory allocation scheme for deep learning, called PRMalloc, according to the forward and backward characteristics of sparse training. It improves memory usage and system performance. In the figure, we can see that the main part is the MemoryPlanner, which is used to count the characteristics of the current training in the minibatch of the first k rounds of model training, how many sensors need to be allocated each time, record these behaviors through the bin buffer, and make corresponding optimization. After step k, we apply it to greatly reduce the above problems. We found in the use of DeepRec that this can greatly reduce the occurrence of minor page faults, reduce the use of memory, and accelerate the training speed by 1.6 times.
3.2 Executor optimization
The implementation of TensorFlow's native Executor is very simple. First, the DAG is topologically sorted, then the Node is inserted into the execution queue, and the Executor is used for scheduling through Tasks. This implementation does not take into account the business. ThreadPool uses Eigen thread pool by default. If the thread load is uneven, a large number of threads will preempt Steal, which will bring huge overhead. In DeepRec, we define the scheduling to be more uniform, and define the critical path so that there is a certain priority order when scheduling to execute Ops. Finally, DeepRec also provides a variety of scheduling strategies, including Task based and SimpleGraph based.
4. Diagram optimization related functions
4.1 Structural characteristics
This is a function inspired by business. We found that in the search scenario, no matter training or reasoning, the sample is usually one user corresponding to multiple items and multiple labels. The original processing method will be regarded as multiple samples, so the user's storage is redundant. In order to save this cost, we have customized the storage format to optimize this part. If these samples are the same user in a minibatch, some user networks and item networks will be calculated separately. Finally, the corresponding logical calculation will be performed, which can save computing overhead. So we have made structural optimization from the storage side and the computing side respectively.
4.2 SmartStage
We see that sparse model training usually includes sample reading, Embedding Lookup, and MLP network computing. Sample reading and embedding lookup are often not computationally intensive and cannot effectively use computing resources. Although the prefetch interface provided by the native framework can complete asynchronous operations to a certain extent, we design some complex subgraphs in the Embedding Lookup process, which cannot be pipelined through the prefetch of TensorFlow. The pipeline function provided by TensorFlow requires the user to display the specified stage boundary in actual use. On the one hand, it will increase the difficulty of use. On the other hand, due to the insufficient accuracy of the stage, it cannot be accurate to the op level. API users at High Level cannot manually insert it, which will cause many steps to parallelize. The following figure shows the specific operation of SmartStage, which automatically classifies Ops into different stages, so that the performance of concurrent pipelines can be improved. In ModelZoo, the maximum acceleration ratio of the model test effect can reach 1.1-1.3.
5. Serving
5.1 Model Incremental Export and Loading
At the beginning, one of the important points in the introduction of Embedding is inefficient IO. If we use the dynamic elasticity function mentioned earlier, we can naturally do incremental export. As long as the previously accessed sparse IDs are added to the graph, the required IDs can be exported accurately during incremental export. We have two starting points for doing this function: first, our original method for model training is to export a full amount of model export at each step, which is also a restore checkpoint when the program interrupts restore. At the worst, we may lose all the results of the two checkpoint intervals. With incremental export, we will export the dense part in full, and the sparse part in incremental, In the actual scenario, 10 minute incremental export can greatly save the loss caused by restore; In addition, the scene of incremental export is online serving. If the scene is loaded in full every time, the model is very large for sparse scenes, and each loading takes a long time. If online learning is difficult, incremental export will also use ODL scenes.
5.2 ODL
The left side is sample processing, the upper and lower parts are offline and online training, and the right side is serving. Many PAI components are used to complete the pipeline construction.
DeepRec Community
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00