Introduction to ACK clusters for heterogeneous computing
ACK supports unified scheduling and operational management of heterogeneous resources, such as GPUs, Application-Specific Integrated Circuits (ASICs), and remote direct memory access (RDMA), which improves cluster resource utilization. The following table describes the heterogeneous computing clusters and features supported by ACK.
Heterogeneous resource | Description |
GPU | ACK allows you to create clusters that contain the NVIDIA T4, P100, V100, and A100 GPUs. ACK supports resource requests for individual GPUs. ACK supports auto scaling based on GPU metrics. ACK supports GPU sharing and computing power isolation. The GPU sharing developed by Alibaba Cloud allows multiple model inference applications to run on the same GPU at the same time. This significantly reduces costs. With the cGPU solution provided by Alibaba Cloud, isolation capabilities for GPU memory and compute power are achieved without the need to modify application containers, which enhances the stability of the applications. The following list describes the supported GPU allocation policies. GPU sharing on a one-pod-one-GPU basis: This policy is commonly used in model inference scenarios. GPU sharing on a one-pod-multi-GPU basis: This policy is commonly used to develop distributed training. Binpack allocation policy: If you use the binpack allocation policy, the system preferentially shares one GPU with multiple pods. This algorithm is suitable for scenarios where high GPU utilization must be guaranteed. Spread allocation policy: If you use the spread algorithm, the system attempts to allocate one GPU to each pod. This algorithm is suitable for scenarios where the high availability of GPUs must be guaranteed.
ACK supports topology-aware GPU scheduling: This feature retrieves the topology of heterogeneous resources from nodes and enables the scheduler to make scheduling decisions based on node topology information, NVlinks, peripheral component interconnect express (PCIe) switches, QuickPath Interconnect (QPI), and RDMA NICs. This optimizes scheduling options and achieves optimal performance. ACK supports GPU resource monitoring: This feature collects the metrics of nodes and applications, detects and sends alerts on device (software and hardware) exceptions, and can be used to monitor dedicated GPUs and shared GPUs.
|
ASIC | ACK allows you to create clusters that contain NETINT ASIC devices and supports resource requests for individual NETINT ASIC cards. |
eRDMA | ACK allows you to create ACK clusters that contain eRDMA devices. You can use Arena to submit distributed deep learning jobs to eRDMA devices. Allows you to create training jobs that require high bandwidth, such as distributed deep learning jobs.
|
GPU instance types supported by ACK
ACK supports multiple GPU-accelerated compute-optimized instance families. If you want to add GPU nodes to an ACK cluster, you need to select from the Elastic Compute Service (ECS) instance families listed below.
gn8v and gn8v-tee, GPU-accelerated compute-optimized instance families
gn8is, GPU-accelerated compute-optimized instance family
gn7e, GPU-accelerated compute-optimized instance family
gn7i, GPU-accelerated compute-optimized instance family
gn7, GPU-accelerated compute-optimized instance family
gn6i, GPU-accelerated compute-optimized instance family
gn6e, GPU-accelerated compute-optimized instance family
gn6v, GPU-accelerated compute-optimized instance family
gn5i, GPU-accelerated compute-optimized instance family
gn5, GPU-accelerated compute-optimized instance family
ebmgn8v, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn8is, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn7e, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn7i, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn7, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn6e, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn6v, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn6i, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ASIC instance types supported by ACK
If you want to add ASIC nodes to an ACK cluster, you can select the instance type ecs.video-trans.26xhevc.
eRDMA instance types supported by ACK
ACK supports multiple eRDMA compute-optimized instance families. You can select from the ECS instance families listed below. For more information, see Enable eRDMA on an enterprise-level instance and Enable eRDMA on a GPU-accelerated instance.
g8a, general-purpose instance family
c8a, compute-optimized instance family
r8a, memory-optimized instance family
g8i, general-purpose instance family
c8i, compute-optimized instance family
r8i, memory-optimized instance family
g8ae, performance-enhanced general-purpose instance family
c8ae, performance-enhanced compute-optimized instance family
r8ae, enhanced-performance memory-optimized instance family
g8y, general-purpose instance family
c8y, compute-optimized instance family
r8y, memory-optimized instance family
i4, instance family with local SSDs
gn8is, GPU-accelerated compute-optimized instance family
GPU-accelerated compute-optimized instance families (gn, ebm, and scc series)
ebmgn8is, GPU-accelerated compute-optimized ECS Bare Metal Instance family
ebmgn8v, GPU-accelerated compute-optimized ECS Bare Metal Instance family