Overview of heterogeneous computing clusters - Container Service for Kubernetes

Alibaba Cloud Container Service for Kubernetes (ACK) manages and schedules heterogeneous resources such as GPUs, ASICs, and eRDMA to improve cluster utilization.

Supported heterogeneous resources

ACK provides unified scheduling and management of heterogeneous resources, such as GPUs, Application-Specific Integrated Circuits (ASICs), and elastic Remote Direct Memory Access (eRDMA).

Heterogeneous resource	Description
GPU	Create clusters with mainstream GPU cards such as T4, P100, and V100. Supports resource requests for individual GPUs. Supports auto scaling based on GPU metrics. Supports GPU sharing and computing power fencing. Alibaba Cloud GPU sharing runs multiple inference workloads on a single GPU, reducing costs. cGPU fences GPU memory and computing power without container modifications, improving application stability. Supported allocation policies: Single-pod-single-GPU sharing: commonly used for model inference. Single-pod-multi-GPU sharing: commonly used for distributed training development. Binpack: preferentially schedules multiple pods to the same GPU card to improve utilization. Spread: distributes pods across GPU cards where possible for high availability (HA). Supports topology-aware GPU scheduling. The scheduler retrieves resource topology from nodes to optimize placement for NVLink, PCIe Switch, QPI, and RDMA NICs. Supports GPU resource monitoring at node and application levels, with automatic exception detection and alerting for dedicated and shared GPUs.
ASIC	ACK supports clusters with NETINT ASIC devices and resource requests for individual ASIC cards.
eRDMA	Create clusters with eRDMA devices. Submit distributed deep learning training jobs with eRDMA devices through Arena. Supports high-bandwidth jobs such as distributed deep learning training.

GPU instance types supported by ACK

Select from the ECS instance families below to add GPU nodes to an ACK cluster.

Confidential computing instances are not supported. These instance types contain the -tee field, such as ecs.gn8v-tee.4xlarge.

Note

You cannot select vGPU-accelerated instances as cluster nodes in the ACK console. See Does ACK support vGPU-accelerated instances?.

ASIC instance types supported by ACK

To add ASIC nodes to an ACK cluster, select the instance type ecs.video-trans.26xhevc.

eRDMA instance types supported by ACK

Select from the ECS instance families below to add eRDMA nodes. See Enable eRDMA on an enterprise-level instance and Enable eRDMA on a GPU-accelerated instance.