Elastic GPU Service is suitable for scenarios such as video transcoding, image rendering, AI training, AI inference, and cloud graphics workstations. DeepGPU provides enhanced GPU computing capabilities and is suitable for AI training and AI inference. This topic describes the scenarios of Elastic GPU Service and DeepGPU.
Scenarios of Elastic GPU Service
Transcoding for real-time videos
During the Double 11 Global Shopping Festival gala in 2019, Elastic GPU Service was used to support video transcoding at resolutions of 1080P, 2K, and 4K in real-time. Elastic GPU Service transcoded videos with high image quality and definition in real time while consuming minimal bandwidth. The following section provides details:
Elastic GPU Service supported high-concurrency real-time video streaming of more than 5,000 channels, which gradually increased to a peak of 6,200 channels per minute, and smoothly handled the traffic peak.
Elastic GPU Service also took part in tasks such as generating real-time rendering images of households. A large number of ECS Bare Metal instances of the ebmgn6v instance type with powerful computing capacity are provided for the first time to support Taobao renderers. The instances improved performance by dozens of times, achieved real-time rendering in seconds, and rendered more than 5,000 household images in total.
AI training
The GPU-accelerated computed optimized instance families gn6v and gn6e provide excellent general-purpose GPU acceleration capabilities and are suitable for providing acceleration engines for deep learning. The following section provides details:
The gn6v and gn6e instance families use NVIDIA V100 GPU processors with 16 GB and 32 GB of memory respectively and can provide mixed-precision computing capacity of up to 1,000 TFLOPS per node.
The gn6v and gn6e instances can be integrated into an elastic computing ecosystem to provide solutions that are suitable for online and offline computing scenarios.
You can use the instances together with container services to simplify deployment and O&M and schedule resources.
AI inference
The gn6i instance family provides excellent AI inference capabilities that can meet computing requirements in deep learning scenarios, especially in AI inference. The following section provides details:
The gn6i instances use NVIDIA Tesla T4 GPU processors to provide a single-precision floating-point computing capacity of up to 8.1 TFLOPS and int8 fixed-point processing capabilities of up to 130 TOPS. The instances also support mixed precision.
Additionally, a single processor consumes only 75 W of power while maintaining a high-performance output.
The gn6i instances can be integrated into an elastic computing ecosystem to provide solutions that are suitable for online and offline computing scenarios.
You can use the instaces together with container services to simplify deployment and O&M and schedule resources.
Alibaba Cloud Marketplace provides a gn6i instance image that uses an NVIDIA GPU driver and a deep learning framework to simplify development.
Cloud graphics workstations
The gn6i instances use NVIDIA Tesla T4 GPU accelerators based on the Turing architecture and provide excellent graphics computing capacity. You can use gn6i instances together with WUYING Workspace to provide cloud graphics workstation services. The services can be used in scenarios such as film and television animation design, industrial design, medical imaging, and high-performance computing result presentation.
Scenarios of DeepGPU
DeepGPU include the following components: Apsara AI Accelerator (AIACC, that includes AIACC-Training and AIACC-Inference), AIACC 2.0-AIACC Communication Speeding (AIACC-ACSpeed), AIACC Graph Speeding (AIACC-AGSpeed), FastGPU, and cGPU. You can use DeepGPU in AI training and AI inference scenarios. The following section provides details:
AI training
AIACC is suitable for AI training and AI inference scenarios. AIACC-ACSpeed (ACSpeed) and AIACC-AGSpeed (AGSpeed) are suitable for AI trainings that are based on the PyTorch framework and can provide optimization for the PyTorch framework.
The following table describes the AI training scenarios of AIACC.
Scenario
Applicable model
Storage
Image classification and image recognition
MXNet models
Cloud Paralleled File System (CPFS)
CTR prediction
Wide&Deep models of TensorFlow
Hadoop Distributed File System (HDFS)
Natural Language Processing (NLP)
Transformer and BERT models of TensorFlow
CPFS
The following table describes the AI training scenarios of ACSpeed.
Scenario
Applicable model
Storage
Image classification and image recognition
Neural network models, such as ResNet and VGG-16, and AIGC models such as Stable Diffusion
CPFS
CTR prediction
Wide&Deep model
HDFS
NLP
Transformer and BERT models
CPFS
Pretraining and fine-tuning of large models
Large language models (LLMs), such as Megatron-LM and DeepSpeed
CPFS
The following table describes the AI training scenarios of AGSpeed.
Scenario
Applicable model
Image classification
ResNet and MobileNet models
Image segmentation
Unet3D models
NLP
BERT, GPT-2, and T5 models
AI inference
AIACC is suitable for AI inference scenarios. The following table describes the AI inference scenarios of AIACC.
Scenario
Applicable model
Specification
Optimization
Video Ultra HD inference
Ultra HD models
T4 GPU
The performance is improved by 1.7 times based on the following optimization:
Video decoding is ported to the GPU.
Preprocessing and postprocessing are ported to the GPU.
The data set size is automatically obtained from a single operation.
Deep optimization of convolution.
Online inference of image synthesis
GAN models
T4 GPU
The performance is improved by 3 times based on the following optimization:
Preprocessing and postprocessing are ported to the GPU.
The data set size is automatically obtained from a single operation.
Deep optimization of convolution.
Prediction and inference of CTR
Wide&Deep model
M40 GPU
The performance is improved by 5.1 times based on the following optimization:
Pipeline optimization.
Model splitting.
The child models are separately optimized.
NLP inference
BERT models
T4 GPU
The performance is improved by 2.3 times based on the following optimization:
Pipeline optimization of preprocessing and postprocessing.
The data set size is automatically obtained from a single operation.
Deep optimization of Kernel.