Elastic GPU Service provides GPU-accelerated computing capabilities to implement the ready availability and auto scaling of GPU computing resources. As an elastic computing service provided by Alibaba Cloud, Elastic GPU Service combines the computing power of GPUs and CPUs to address the challenges of scenarios such as AI, high-performance computing, and professional graphics and image processing.

Elastic GPU Service (EGS) platform

GPU is a computing chip that can provide real-time, high-speed parallel computing and floating-point computing capabilities. Elastic GPU Service combines elastic computing services provided by Alibaba Cloud with high-speed parallel heterogeneous accelerators of GPUs to deliver features of elastic computing services and GPU acceleration capabilities.

Alibaba Cloud launched GPU-accelerated instances based on the EGS platform. The instances can be operated in the same manner as common ECS instances while GPU acceleration capabilities are provided. To use GPU-accelerated instances, select an enterprise-level heterogeneous computing instance type. For more information about instance types, see Instance families.

Features

  • High elasticity

    Provides serial instance families. GPU-accelerated instances can be created within minutes. The instances support horizontal scaling and allows changes on instance types within the same instance family.

  • High performance and high security

    Supports point-to-point communication between GPUDirect and GPUs. GPUs can directly communicate with each other by using NVLink, which provides features that have high bandwidth, low latency, and no CPU interventions. GPU provides elastic security isolation among tenants and authorizes and manages systems by using hypervisors. You can configure high speed communication between isolated GPUs in a secured manner.

  • Easy deployment

    Deeply integrated with the Alibaba Cloud ecosystem. You can combine Elastic GPU Service with other Alibaba Cloud services to build applications. For example, you can combine Elastic GPU Service with Object Storage Service (OSS) and Apsara File Storage NAS (NAS) to meet storage requirements and with E-MapReduce (EMR) to preprocess deep learning data. You can also combine Elastic GPU Service with Container Service for Kubernetes (ACK) to make delivery easier.

  • Easy monitoring

    Provides comprehensive monitoring in dimensions such as GPUs, instances, and groups. This relieves the pressure on O&M. For more information, see GPU monitoring.

Related tools

Alibaba Cloud provides the following tools that allow you to use GPU resources more efficiently:
  • AIACC-Training: an AI accelerator developed by Alibaba Cloud to improve training performance. For more information, see Automatically install AIACC-Training.
  • AIACC-Inference: an AI accelerator developed by Alibaba Cloud to improve inference performance. For more information, see Automatically install AIACC-Inference.
  • cGPU: a technology used to isolate GPU resources. This way, multiple containers can share a graphics card. For more information, see What is the cGPU service?.
  • FastGPU: a tool provided by Alibaba Cloud to build AI computing tasks. This tool provides you with interfaces and command lines to build AI computing tasks on Alibaba Cloud IaaS resources. For more information, see What is FastGPU?.