This article describes the performance data of AIAAC 2.0-AIACC Communication Speeding (AIACC-ACSpeed) in training models. Compared to training performed by using native PyTorch DDP, ACSpeed has a significantly improved performance.
This article uses the performance data of multi-instance training with AIACC-ACSpeed (ACSpeed) V1.0.2 enabled on an eight-GPU ECS instance as an example. This example tests the performance of ACSpeed when training models in different scenarios.
ACSpeed shows significant performance improvements in multiple model trainings, with performance improvements ranging from 5% to 200%. As shown in the results, the improved performance of ACSpeed is more notable when compared to the poor scalability of PyTorch DDP. The performance of ACSpeed is not affected by scaling. The following figure shows the results of the test.Performance improvement
The following table describes the terms used in this test:
Term | Description |
ddp_acc (x-axis) | Indicates the scalability of PyTorch DDP for multiple multi-GPU instances. The scalability of PyTorch DDP for multiple multi-GPU instances is indicated by multi-instance linearity. A smaller value of the linearity means a poorer scalability. The linearity is calculated based on the following formula: Linearity = multi-instance performance / single-instance performance / the number of clusters. |
acc_ratio (y-axis) | Indicates the improvement ratio of ACSpeed over PyTorch DDP measured by performance metrics. For example, 1.25 indicates that the performance of ACSpeed is 1.25 times that of PyTorch DDP, which means the performance is improved by 25%. |
Dots | Indicates different sizes of clusters. • Blue dot: indicates that the number of clusters is 1. • Orange dot: indicates that the number of clusters is 2. • Red dot: indicates that the number of clusters is 4. • Green dot: indicates that the number of clusters is 8. |
This section shows only the performance data of the example models that have been tested. The performance improvements vary across different scenarios, which is caused by the different proportions of communication computing. The following section shows the performance data of specific test models.
Scenario 1: Training an alexnet model
The following figure shows the performance data in this training scenario:
Scenario 2: Training a resnet18 model
The following figure shows the performance data in this training scenario:
Scenario 3: Training a resnet50 model
The following figure shows the performance data in this training scenario:
Scenario 4: Training a vgg16 model
The following figure shows the performance data in this training scenario:
Scenario 5: Training a timm_vovnet model
The following figure shows the performance data in this training scenario:
Scenario 6: Training a timm_vision_transformer model
The following figure shows the performance data in this training scenario:
Scenario 7: Training a pytorch_unet model
The following figure shows the performance data in this training scenario:
Scenario 8: Training an hf_Bart model
The following figure shows the performance data in this training scenario:
Scenario 9: Training a hf_Bert model
The following figure shows the performance data in this training scenario:
Scenario 10: Training a speech_transformer model
The following figure shows the performance data in this training scenario:
Scenario 11: Training a tts_angular model
The following figure shows the performance data in this training scenario:
Learning about AIACC-ACSpeed | Install and Use AIACC-ACSpeed
1,115 posts | 342 followers
FollowAlibaba Cloud Community - April 7, 2024
Alibaba Cloud Community - April 7, 2024
Alibaba Clouder - June 17, 2020
Alibaba Clouder - June 10, 2020
Alibaba Clouder - July 17, 2020
Alibaba Clouder - September 25, 2020
1,115 posts | 342 followers
FollowPowerful parallel computing capabilities based on GPU technology.
Learn MoreApply the latest Reinforcement Learning AI technology to your Field Service Management (FSM) to obtain real-time AI-informed decision support.
Learn MoreSelf-service network O&M service that features network status visualization and intelligent diagnostics capabilities
Learn MoreThis solution provides you with Artificial Intelligence services and allows you to build AI-powered, human-like, conversational, multilingual chatbots over omnichannel to quickly respond to your customers 24/7.
Learn MoreMore Posts by Alibaba Cloud Community