This topic describes the parallelism design and solution of Whale to resolve issues that exist in large-scale classification. A large-scale classification model cannot be trained on a single device, and the performance of distributed training is poor. Whale combines operator splitting and data parallelism and uses an optimized communication topology to resolve the issues. Whale allows you to implement the fusion mode for a large-scale classification task by performing model division, resource grouping, and mapping.
Classification is a fundamental problem in machine learning and data mining. In big data scenarios, the number of classes typically reaches the order of millions, tens of millions, or hundreds of millions. As the class scale increases, the parameters and scale of the model itself superlinearly increase. Ultra-large-scale classification is a common problem in business. For example, the number of classes in facial recognition reaches the order of hundreds of millions.
features = ResNet50(inputs) logits = FC(features) predictions = Softmax(logits)
In this example, 100,000 classes are used, on which you can perform data parallelism. This way, you can compare the performance of data parallelism with that of operator splitting. When the number of classes is 100,000, the weight of the ResNet50 part is 89.6 MB, and that of the fully connected (FC) layer is 781.6 MB, which is 8.7 times that of the former. When data parallelism is used for distributed training, the FC part uses AllReduce to synchronize gradients immediately after they are calculated during back propagation. Meanwhile, the ResNet50 part continues to calculate gradients in back propagation. However, the gradients that are synchronized by the FC part are too large. When the ResNet50 part completes the gradient calculation, the gradients of the FC part are still being synchronized. Therefore, the gradient synchronization of the FC part cannot properly overlap with the back propagation of the ResNet50 part. As a result, the communication proportion during the whole round of iteration is high and the performance is low.
When the number of classes is larger, the distributed computing mode has higher requirements for video memory. For example, if the number of hidden layers is 2,048, an NVIDIA V100 16 GB GPU can store only an FC weight for 1.9 million classes. In consideration of the intermediate output, the calculated gradients, and the intermediate variables of Optimizer, a V100 16 GB GPU supports about 200,000 classes. The video memory usage of the convolution layer is not considered.
To sum up, even if you assign the FC part to a single GPU, data parallelism cannot support a classification task that has millions of classes. The existing data parallelism strategy cannot support ultra-large-scale classification. For an ultra-large-scale classification task, you must split a single operator of the FC part to multiple shards and assign the shards to different devices for computing.
In this scenario, you must consider the class scale and data scale because an extremely large number of classes are inevitably accompanied by large-scale training data. Operator splitting can resolve the issue of large model size. For example, operator splitting can resolve the issue of the large FC part in this example or a BERT-Large model. BERT is short for Bidirectional Encoder Representations from Transformers. The ResNet50 backbone based on the TF-Slim library is implemented by using the convolution function. For the convolution layer, the model itself is small. You must use data parallelism to resolve the issue of large data scale. Therefore, the current problem is how to combine operator splitting and data parallelism to perform distributed training and improve communication performance.
- FC part
The large weight results in a large gradient synchronization overhead. When the data parallelism strategy is used, the synchronization of excessive gradients may cause a performance issue. In addition, a single GPU cannot store an ultra-large number of classes. To resolve these issues, the weight of the FC part is split by column and multiple GPUs are used to store the shards. For the FC and Softmax parts, each GPU is responsible only for a different part of the computing.
- ResNet50 part
The weight is small, the computing time is long, and the gradient synchronization overhead is small. The data parallelism strategy is applied to the ResNet50 part. Each GPU reads a different data shard. This accelerates data processing.
- After GPU0, GPU1, GPU2, and GPU3 complete the forward propagation of the ResNet50 part, GPU4 and GPU5 receive features from GPU0, GPU1, GPU2, and GPU3.
- GPU4 and GPU5 concatenate the features by row and perform computing for the FC part. The following figure shows the detailed computing diagram of the FC part.
- GPU4 and GPU5 obtain the output logits and perform computing for the Softmax part. This process requires cross-GPU communication to obtain the global maximum logits information and the sum of each row.
- The data parallelism part must send features to the operator splitting part in a point-to-point manner. This causes the hotspot issue.
- Some GPUs are idle during the iteration. During the computing of the data parallelism part, the GPUs that are allocated to the operator splitting part are idle and wait. During the computing of the operator splitting part, the GPUs that are allocated to the data parallelism part are idle and wait.
Use Whale to implement a multiclass classification task
The process of implementing a large-scale classification task in fusion mode consists of three steps based on the standard Whale distributed programming paradigm: model division, resource grouping, and mapping.
- Divide the model.Specify scopes to divide the model into different parts based on different parallelism requirements. For more information, see Scope-related operations. In the preceding solution, the model is divided into two stages:
- Stage 0: includes the ResNet50 part.
- Stage 1: includes the FC and Softmax parts.
whale.split, to the original code of the model. Sample code:
with whale.replica(): # Stage 0 features = ResNet50(inputs) with whale.split(): # Stage 1 logits = FC(features) predictions = Softmax(logits)
- Group resources.
Use a cluster to group hardware resources into one virtual device that includes GPU0, GPU1, GPU2, GPU3, GPU4, and GPU5. For more information, see whale.cluster.The Cluster tool provided by Whale allows you to divide the applied workers. By default,
whale.clustervirtualizes all devices into a virtual device. Sample code:
cluster = whale.cluster()
- Map the model to resources.Map each part of the model to the virtual device.
- Assign the ResNet50 part to the virtual device based on the data parallelism strategy. Each GPU of GPU0, GPU1, GPU2, GPU3, GPU4, and GPU5 has a model replica of ResNet50.
- Assign the FC and Softmax parts to the virtual device based on the operator splitting strategy. Each GPU of GPU0, GPU1, GPU2, GPU3, GPU4, and GPU5 first performs computing for a shard of matrix multiplication of the FC part, and then performs computing for the Softmax part.
withsyntax on the generated cluster to map the model to hardware resources with ease. The following core code shows how to use Whale to implement parallelism. For more information about the executable code of a large-scale classification task that uses operator splitting, visit large_scale_classification.py.
cluster = whale.cluster() # Group resources. with cluster: # Map the model to resources. with whale.replica(): # Specify Stage 0. features = ResNet50(inputs) with whale.split(): # Specify Stage 1. logits = FC(features) predictions = Softmax(logits)
|GPU model||ecs.gn6v-c10g1.20xlarge (V100 × 8)|
|NCCL_MAX_NRINGS||The official NVIDIA parameter. The value is set to 4 during the test.|
|NCCL_MIN_NRINGS||The official NVIDIA parameter. The value is set to 4 during the test.|
- Data parallelism: Whale Data Parallelism Batch Size = 16
- Operator splitting based on a fixed batch size: Whale Model Parallelism Batch Size = 16
- Operator splitting based on a dynamic batch size: Whale Model Parallelism + Dynamic Batch Size
- For single-device multi-GPU or cross-device tasks, the performance of operator splitting is much better than that of data parallelism. The reason is that the gradient synchronization traffic of the FC part is reduced.
- The performance of operator splitting in the scenario where the batch size is dynamically adjusted based on the number of GPUs is better than the performance of data parallelism at the maximum batch size. When 64 GPUs are used, the batch size can be adjusted to 160, which is 10 times that for data parallelism.
- The performance acceleration result superlinearly increases. Both the computing traffic and communication traffic in the apply phase where parameters are updated are reduced by 90% because the batch size dynamically increases.