This topic describes the performance data of AIAAC 2.0-AIACC Communication Speeding (AIACC-ACSpeed) in training models. Compared to training performed by using native PyTorch DDP, ACSpeed has a significantly improved performance.

Background information

This topic uses the performance data of multi-instance training with AIACC-ACSpeed (ACSpeed) V1.0.2 enabled on an eight-GPU ECS instance as an example. This example tests the performance of ACSpeed when training models in different scenarios.

Tested versions

  • ACSpeed: ACSpeed V1.0.2
  • CUDA: CUDA V11.1
  • Torch: Torch 1.8.1 + cu111
  • Instance type: an eight-GPU instance

Test results

ACSpeed shows significant performance improvements in multiple model trainings, with performance improvements ranging from 5% to 200%. As shown in the results, the improved performance of ACSpeed is more notable when compared to the poor scalability of PyTorch DDP. The performance of ACSpeed is not affected by scaling. The following figure shows the results of the test. Performance improvement
The following table describes the terms used in this test:
Term Description
ddp_acc (x-axis) Indicates the scalability of PyTorch DDP for multiple multi-GPU instances.

The scalability of PyTorch DDP for multiple multi-GPU instances is indicated by multi-instance linearity. A smaller value of the linearity means a poorer scalability. The linearity is calculated based on the following formula: Linearity = multi-instance performance / single-instance performance / the number of clusters.

acc_ratio (y-axis) Indicates the improvement ratio of ACSpeed over PyTorch DDP measured by performance metrics. For example, 1.25 indicates that the performance of ACSpeed is 1.25 times that of PyTorch DDP, which means the performance is improved by 25%.
Dots Indicates different sizes of clusters.
  • Blue dot: indicates that the number of clusters is 1.
  • Orange dot: indicates that the number of clusters is 2.
  • Red dot: indicates that the number of clusters is 4.
  • Green dot: indicates that the number of clusters is 8.

Performance data of example models

This section shows only the performance data of the example models that have been tested. The performance improvements vary across different scenarios, which is caused by the different proportions of communication computing. The following section shows the performance data of specific test models.

  • Scenario 1: Training an alexnet model
    • Model: alexnet
    • Domain: COMPUTER_VISION
    • Subdomain: CLASSIFICATION
    • Batch size: 128
    • Precision: Automatic mixed precision (AMP)
    The following figure shows the performance data in this training scenario:Model 1
  • Scenario 2: Training a resnet18 model
    • Model: resnet18
    • Domain: COMPUTER_VISION
    • Subdomain: CLASSIFICATION
    • Batch size: 16
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 2
  • Scenario 3: Training a resnet50 model
    • Model: resnet50
    • Domain: COMPUTER_VISION
    • Subdomain: CLASSIFICATION
    • Batch size: 32
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 3
  • Scenario 4: Training a vgg16 model
    • Model: vgg16
    • Domain: COMPUTER_VISION
    • Subdomain: CLASSIFICATION
    • Batch size: 64
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 4
  • Scenario 5: Training a timm_vovnet model
    • Model: timm_vovnet
    • Domain: COMPUTER_VISION
    • Subdomain: CLASSIFICATION
    • Batch size: 32
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 5
  • Scenario 6: Training a timm_vision_transformer model
    • Model: timm_vision_transformer
    • Domain: COMPUTER_VISION
    • Subdomain: CLASSIFICATION
    • Batch size: 8
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 6
  • Scenario 7: Training a pytorch_unet model
    • Model: pytorch_unet
    • Domain: COMPUTER_VISION
    • Subdomain: CLASSIFICATION
    • Batch size: 1
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 7
  • Scenario 8: Training an hf_Bart model
    • Model: hf_Bart
    • Domain: NLP
    • Subdomain: LANGUAGE_MODELING
    • Batch size: 4
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 8
  • Scenario 9: Training a hf_Bert model
    • Model: hf_Bert
    • Domain: NLP
    • Subdomain: LANGUAGE_MODELING
    • Batch size: 4
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 9
  • Scenario 10: Training a speech_transformer model
    • Model: speech_transformer
    • Domain: SPEECH
    • Subdomain: RECOGNITION
    • Batch size: 32
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 10
  • Scenario 11: Training a tts_angular model
    • Model: tts_angular
    • Domain: SPEECH
    • Subdomain: SYNTHESIS
    • Batch size: 64
    • Precision: AMP
    The following figure shows the performance data in this training scenario:Model 11