This topic describes the parallelism design and solution of Whale to resolve issues that exist in large-scale classification. A large-scale classification model cannot be trained on a single device, and the performance of distributed training is poor. Whale combines operator splitting and data parallelism and uses an optimized communication topology to resolve the issues. Whale allows you to implement the fusion mode for a large-scale classification task by performing model division, resource grouping, and mapping.

Background information

Classification is a fundamental problem in machine learning and data mining. In big data scenarios, the number of classes typically reaches the order of millions, tens of millions, or hundreds of millions. As the class scale increases, the parameters and scale of the model itself superlinearly increase. Ultra-large-scale classification is a common problem in business. For example, the number of classes in facial recognition reaches the order of hundreds of millions.

Issues

The following figure shows the structure of the ResNet50 classification model.ResNet50 classification model structureThe original ResNet50 classification model uses the following code:
features = ResNet50(inputs)
logits = FC(features)
predictions = Softmax(logits)

In this example, 100,000 classes are used, on which you can perform data parallelism. This way, you can compare the performance of data parallelism with that of operator splitting. When the number of classes is 100,000, the weight of the ResNet50 part is 89.6 MB, and that of the fully connected (FC) layer is 781.6 MB, which is 8.7 times that of the former. When data parallelism is used for distributed training, the FC part uses AllReduce to synchronize gradients immediately after they are calculated during back propagation. Meanwhile, the ResNet50 part continues to calculate gradients in back propagation. However, the gradients that are synchronized by the FC part are too large. When the ResNet50 part completes the gradient calculation, the gradients of the FC part are still being synchronized. Therefore, the gradient synchronization of the FC part cannot properly overlap with the back propagation of the ResNet50 part. As a result, the communication proportion during the whole round of iteration is high and the performance is low.

When the number of classes is larger, the distributed computing mode has higher requirements for video memory. For example, if the number of hidden layers is 2,048, an NVIDIA V100 16 GB GPU can store only an FC weight for 1.9 million classes. In consideration of the intermediate output, the calculated gradients, and the intermediate variables of Optimizer, a V100 16 GB GPU supports about 200,000 classes. The video memory usage of the convolution layer is not considered.

To sum up, even if you assign the FC part to a single GPU, data parallelism cannot support a classification task that has millions of classes. The existing data parallelism strategy cannot support ultra-large-scale classification. For an ultra-large-scale classification task, you must split a single operator of the FC part to multiple shards and assign the shards to different devices for computing.

In this scenario, you must consider the class scale and data scale because an extremely large number of classes are inevitably accompanied by large-scale training data. Operator splitting can resolve the issue of large model size. For example, operator splitting can resolve the issue of the large FC part in this example or a BERT-Large model. BERT is short for Bidirectional Encoder Representations from Transformers. The ResNet50 backbone based on the TF-Slim library is implemented by using the convolution function. For the convolution layer, the model itself is small. You must use data parallelism to resolve the issue of large data scale. Therefore, the current problem is how to combine operator splitting and data parallelism to perform distributed training and improve communication performance.

Solution

Different parallelism strategies are applied to different parts of the model to accelerate distributed training and improve distributed performance.
  • FC part

    The large weight results in a large gradient synchronization overhead. When the data parallelism strategy is used, the synchronization of excessive gradients may cause a performance issue. In addition, a single GPU cannot store an ultra-large number of classes. To resolve these issues, the weight of the FC part is split by column and multiple GPUs are used to store the shards. For the FC and Softmax parts, each GPU is responsible only for a different part of the computing.

  • ResNet50 part

    The weight is small, the computing time is long, and the gradient synchronization overhead is small. The data parallelism strategy is applied to the ResNet50 part. Each GPU reads a different data shard. This accelerates data processing.

In summary, the ResNet50 model of the large-scale classification task is divided into two stages. N copies of the ResNet50 data in Stage 0 are generated and assigned to different GPUs based on the data parallelism strategy. The FC and Softmax parts in Stage 1 are split and assigned to different GPUs based on the operator splitting strategy to implement model parallelism. Assume that you have six GPUs: GPU0, GPU1, GPU2, GPU3, GPU4, and GPU5. The GPUs are divided into two groups for data parallelism and operator splitting, as shown in the following figure.Hybrid parallelismThe data in Stage 0 is copied to GPU0, GPU1, GPU2, and GPU3 to implement data parallelism, as shown in the preceding figure. Each GPU reads a different data shard for training. The parts in Stage 1 are split into two shards and assigned to GPU4 and GPU5 to implement operator splitting. The following figure shows the parallel computing diagram.Parallel computing diagramIn the preceding figure, the computing process consists of the following steps:
  1. After GPU0, GPU1, GPU2, and GPU3 complete the forward propagation of the ResNet50 part, GPU4 and GPU5 receive features from GPU0, GPU1, GPU2, and GPU3.
  2. GPU4 and GPU5 concatenate the features by row and perform computing for the FC part. The following figure shows the detailed computing diagram of the FC part.FC computing diagram
  3. GPU4 and GPU5 obtain the output logits and perform computing for the Softmax part. This process requires cross-GPU communication to obtain the global maximum logits information and the sum of each row.
Although the preceding solution can complete a large-scale classification task, the following issues still exist in the process:
  • The data parallelism part must send features to the operator splitting part in a point-to-point manner. This causes the hotspot issue.
  • Some GPUs are idle during the iteration. During the computing of the data parallelism part, the GPUs that are allocated to the operator splitting part are idle and wait. During the computing of the operator splitting part, the GPUs that are allocated to the data parallelism part are idle and wait.
This issue seriously deteriorates the performance. In addition, the distributed scale cannot be properly scaled up due to the hotspot issue. To resolve these issues, Whale further optimizes the model to perform computing in fusion mode.
Whale maps the ResNet50 part in Stage 0 to all GPUs based on data parallelism, and maps the FC and Softmax parts in Stage 1 to all GPUs based on sharded computing. The following figure shows the mapping between the model and hardware resources.Model and hardware mappingThe output of Stage 0 is transmitted in decentralized AllGather communication mode, instead of the original point-to-point mode. This prevents the hotspot issue from occurring. In addition, after a GPU completes the computing in Stage 0 and the AllGather communication starts, the GPU immediately performs the computing in Stage 1. No GPU enters the idle state. The following figure shows the computing diagram.Optimized computing diagram

Use Whale to implement a multiclass classification task

The process of implementing a large-scale classification task in fusion mode consists of three steps based on the standard Whale distributed programming paradigm: model division, resource grouping, and mapping.

  1. Divide the model.
    Specify scopes to divide the model into different parts based on different parallelism requirements. For more information, see Scope-related operations. In the preceding solution, the model is divided into two stages:
    • Stage 0: includes the ResNet50 part.
    • Stage 1: includes the FC and Softmax parts.
    You can divide the ResNet50 model in Whale by adding two lines of scope code, which are whale.replica and whale.split, to the original code of the model. Sample code:
    with whale.replica():    # Stage 0
        features = ResNet50(inputs)
        
    with whale.split():      # Stage 1
        logits = FC(features)
        predictions = Softmax(logits)
  2. Group resources.

    Use a cluster to group hardware resources into one virtual device that includes GPU0, GPU1, GPU2, GPU3, GPU4, and GPU5. For more information, see whale.cluster.

    The Cluster tool provided by Whale allows you to divide the applied workers. By default, whale.cluster virtualizes all devices into a virtual device. Sample code:
    cluster = whale.cluster()
  3. Map the model to resources.
    Map each part of the model to the virtual device.
    • Assign the ResNet50 part to the virtual device based on the data parallelism strategy. Each GPU of GPU0, GPU1, GPU2, GPU3, GPU4, and GPU5 has a model replica of ResNet50.
    • Assign the FC and Softmax parts to the virtual device based on the operator splitting strategy. Each GPU of GPU0, GPU1, GPU2, GPU3, GPU4, and GPU5 first performs computing for a shard of matrix multiplication of the FC part, and then performs computing for the Softmax part.
    Whale allows you to use the with syntax on the generated cluster to map the model to hardware resources with ease. The following core code shows how to use Whale to implement parallelism. For more information about the executable code of a large-scale classification task that uses operator splitting, visit large_scale_classification.py.
    cluster = whale.cluster() # Group resources.
    
    with cluster:             # Map the model to resources.
        with whale.replica(): # Specify Stage 0.
            features = ResNet50(inputs)
    
        with whale.split():   # Specify Stage 1.
            logits = FC(features)
            predictions = Softmax(logits)

Performance

The data parallelism performance of Whale is used as the baseline for comparison. The following table describes the test environment.
Item Description
GPU model ecs.gn6v-c10g1.20xlarge (V100 × 8)
Network VPC-35 Gbit/s
NCCL_MAX_NRINGS The official NVIDIA parameter. The value is set to 4 during the test.
NCCL_MIN_NRINGS The official NVIDIA parameter. The value is set to 4 during the test.
In the performance test for operator splitting, the ResNet50 model that has 100,000 classes is used. The FC part is split and assigned to distributed servers for sharded computing based on the operator splitting strategy. In addition, the convolution part and the FC part that is split reuse workers in fusion mode. Based on operator splitting, the video memory that is required by the FC part is divided among multiple GPUs. Therefore, the maximum batch size for data parallelism can be larger in operator splitting scenarios. In this topic, the performance of operator splitting is compared with that of data parallelism in the following test scenarios:
  • Data parallelism: Whale Data Parallelism Batch Size = 16
  • Operator splitting based on a fixed batch size: Whale Model Parallelism Batch Size = 16
  • Operator splitting based on a dynamic batch size: Whale Model Parallelism + Dynamic Batch Size
The following figure shows the comparison result.Performance comparison
Note The X-axis indicates the number of GPUs, whereas the Y-axis indicates the acceleration ratio.
From the preceding figure, you can draw the following conclusions:
  • For single-device multi-GPU or cross-device tasks, the performance of operator splitting is much better than that of data parallelism. The reason is that the gradient synchronization traffic of the FC part is reduced.
  • The performance of operator splitting in the scenario where the batch size is dynamically adjusted based on the number of GPUs is better than the performance of data parallelism at the maximum batch size. When 64 GPUs are used, the batch size can be adjusted to 160, which is 10 times that for data parallelism.
  • The performance acceleration result superlinearly increases. Both the computing traffic and communication traffic in the apply phase where parameters are updated are reduced by 90% because the batch size dynamically increases.