This topic describes the performance data of AIAAC 2.0-AIACC Communication Speeding (AIACC-ACSpeed) in training models. Compared to training performed by using native PyTorch DDP, ACSpeed has a significantly improved performance.
Background information
This topic uses the performance data of multi-instance training with AIACC-ACSpeed (ACSpeed) V1.0.2 enabled on an eight-GPU ECS instance as an example. This example tests the performance of ACSpeed when training models in different scenarios.
Tested versions
- ACSpeed: ACSpeed V1.0.2
- CUDA: CUDA V11.1
- Torch: Torch 1.8.1 + cu111
- Instance type: an eight-GPU instance
Test results
Term | Description |
---|---|
ddp_acc (x-axis) | Indicates the scalability of PyTorch DDP for multiple multi-GPU instances.
The scalability of PyTorch DDP for multiple multi-GPU instances is indicated by multi-instance linearity. A smaller value of the linearity means a poorer scalability. The linearity is calculated based on the following formula: Linearity = multi-instance performance / single-instance performance / the number of clusters. |
acc_ratio (y-axis) | Indicates the improvement ratio of ACSpeed over PyTorch DDP measured by performance metrics. For example, 1.25 indicates that the performance of ACSpeed is 1.25 times that of PyTorch DDP, which means the performance is improved by 25%. |
Dots | Indicates different sizes of clusters.
|
Performance data of example models
This section shows only the performance data of the example models that have been tested. The performance improvements vary across different scenarios, which is caused by the different proportions of communication computing. The following section shows the performance data of specific test models.
- Scenario 1: Training an alexnet model
- Model: alexnet
- Domain: COMPUTER_VISION
- Subdomain: CLASSIFICATION
- Batch size: 128
- Precision: Automatic mixed precision (AMP)
The following figure shows the performance data in this training scenario: - Scenario 2: Training a resnet18 model
- Model: resnet18
- Domain: COMPUTER_VISION
- Subdomain: CLASSIFICATION
- Batch size: 16
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 3: Training a resnet50 model
- Model: resnet50
- Domain: COMPUTER_VISION
- Subdomain: CLASSIFICATION
- Batch size: 32
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 4: Training a vgg16 model
- Model: vgg16
- Domain: COMPUTER_VISION
- Subdomain: CLASSIFICATION
- Batch size: 64
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 5: Training a timm_vovnet model
- Model: timm_vovnet
- Domain: COMPUTER_VISION
- Subdomain: CLASSIFICATION
- Batch size: 32
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 6: Training a timm_vision_transformer model
- Model: timm_vision_transformer
- Domain: COMPUTER_VISION
- Subdomain: CLASSIFICATION
- Batch size: 8
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 7: Training a pytorch_unet model
- Model: pytorch_unet
- Domain: COMPUTER_VISION
- Subdomain: CLASSIFICATION
- Batch size: 1
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 8: Training an hf_Bart model
- Model: hf_Bart
- Domain: NLP
- Subdomain: LANGUAGE_MODELING
- Batch size: 4
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 9: Training a hf_Bert model
- Model: hf_Bert
- Domain: NLP
- Subdomain: LANGUAGE_MODELING
- Batch size: 4
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 10: Training a speech_transformer model
- Model: speech_transformer
- Domain: SPEECH
- Subdomain: RECOGNITION
- Batch size: 32
- Precision: AMP
The following figure shows the performance data in this training scenario: - Scenario 11: Training a tts_angular model
- Model: tts_angular
- Domain: SPEECH
- Subdomain: SYNTHESIS
- Batch size: 64
- Precision: AMP
The following figure shows the performance data in this training scenario: