Install the ack-ai-installer component in your ACK cluster to enable topology-aware GPU scheduling. This feature selects the GPU combination on a node that provides the optimal training speed based on the physical topology of GPU devices.
Before you begin
Before you begin, make sure that you have:
An ACK managed cluster with the instance type set to Elastic GPU Service.
A kubeconfig file for the cluster and a kubectl client connected to the cluster.
Cluster nodes that meet the following version and OS requirements.
Version requirements
| Component | Required version |
|---|---|
| Kubernetes | 1.18.8 or later |
| NVIDIA driver | 418.87.01 or later |
| NVIDIA Collective Communications Library (NCCL) | 2.7 or later |
| GPU | V100 |
Supported operating systems
CentOS 7.6, CentOS 7.7
Ubuntu 16.04, Ubuntu 18.04
Alibaba Cloud Linux 2, Alibaba Cloud Linux 3
Install the component from Cloud-native AI Suite
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find your cluster and click its name.
In the left-side navigation pane, choose Applications > Cloud-native AI Suite.
On the Cloud-native AI Suite page, click Deploy.
In the Scheduling section, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling), and then click Deploy Cloud-native AI Suite. For more information about the parameters, see Install the cloud-native AI suite.
Verify that
ack-ai-installerappears in the Components list on the Cloud-native AI Suite page.
Note: If you have already installed a component of the Cloud-native AI Suite, find ack-ai-installer in the Components list and click Deploy in the Actions column.What to do next
After you install the component, configure topology-aware GPU scheduling policies for your workloads. For more information, see GPU topology-aware scheduling.