Install the topology-aware GPU scheduling component - Container Service for Kubernetes

Install the ack-ai-installer component in your ACK cluster to enable topology-aware GPU scheduling. This feature selects the GPU combination on a node that provides the optimal training speed based on the physical topology of GPU devices.

Before you begin

Before you begin, make sure that you have:

An ACK managed cluster with the instance type set to Elastic GPU Service.
A kubeconfig file for the cluster and a kubectl client connected to the cluster.
Cluster nodes that meet the following version and OS requirements.

Version requirements

Component	Required version
Kubernetes	1.18.8 or later
NVIDIA driver	418.87.01 or later
NVIDIA Collective Communications Library (NCCL)	2.7 or later
GPU	V100

Supported operating systems

CentOS 7.6, CentOS 7.7
Ubuntu 16.04, Ubuntu 18.04
Alibaba Cloud Linux 2, Alibaba Cloud Linux 3

Install the component from Cloud-native AI Suite

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find your cluster and click its name.
In the left-side navigation pane, choose Applications > Cloud-native AI Suite.
On the Cloud-native AI Suite page, click Deploy.
In the Scheduling section, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling), and then click Deploy Cloud-native AI Suite. For more information about the parameters, see Install the cloud-native AI suite.
Verify that ack-ai-installer appears in the Components list on the Cloud-native AI Suite page.

Note: If you have already installed a component of the Cloud-native AI Suite, find ack-ai-installer in the Components list and click Deploy in the Actions column.

What to do next

After you install the component, configure topology-aware GPU scheduling policies for your workloads. For more information, see GPU topology-aware scheduling.