Add GPU Nodes to ACK Edge Clusters for On-Premises AI Inference - Container Service for Kubernetes

Edge node pools in ACK Edge clusters allow you to manage on-premises GPU resources. This topic describes how to add GPU nodes to an edge node pool in an ACK Edge cluster.

Prerequisites

An ACK Edge cluster has been created.
You must install the GPU driver before you add a node. For more information about supported driver versions, see Supported NVIDIA driver versions for ACK.

Limits

Ensure that your cluster has a sufficient node quota. To add more nodes, submit a request in the Quota Center to increase the quota. For more information about the quota limits of ACK Edge clusters, see Quotas and limits.
When you add GPU nodes, the nodes must access specific domain names. Ensure that the security group of the node allows access to these domain names. For more information, see Domain names and IP CIDR blocks for node registration.

Procedure

Clusters of version 1.26 or later

Starting from version 1.26, ACK Edge clusters automatically detect the GPU model and install the required components during NVIDIA GPU registration. You do not need to configure the gpuVersion parameter.

The process of adding GPU nodes is the same as that for adding other edge nodes. For more information, see Add edge nodes.

Note

ACK Edge clusters of version 1.26 and later support the full range of NVIDIA production-grade GPUs, such as the Tesla series, Hopper (H-series), Ada Lovelace (A-series), and L-series.

Clusters earlier than version 1.26

When you add GPU nodes to an ACK Edge cluster that is earlier than version 1.26, you must select a GPU model from the following list. If you need a different GPU model, submit a ticket.

System architecture	GPU model	Edge Kubernetes cluster version
AMD64/x86_64	Nvidia_Tesla_T4	≥1.16.9-aliyunedge.1
AMD64/x86_64	Nvidia_Tesla_P4	≥1.16.9-aliyunedge.1
AMD64/x86_64	Nvidia_Tesla_P100	≥1.16.9-aliyunedge.1
AMD64/x86_64	Nvidia_Tesla_V100	≥1.18.8-aliyunedge.1
AMD64/x86_64	Nvidia_Tesla_A10	≥1.20.11-aliyunedge.1
AMD64/x86_64	Nvidia_L40	≥1.26.3-aliyun.1

Log on to the Container Service Management Console . In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of your cluster. In the navigation pane on the left, click Nodes > Node Pools.
On the Node Pools page, find the node pool that you want to manage, and in the Actions column, choose > Add Existing Node.
On the Add Node page, click Manual to add an existing instance.
Click Next. On the Instance Information page, configure the parameters for node registration. For more information about the parameters, see Parameter list.
Note
- When you generate the node registration script, set the gpuVersion parameter. For more information about the supported GPU versions, see Limits.
- After this parameter is configured, the registration tool automatically installs nvidia-containerd-runtime. For more information, see nvidia-containerd-runtime.
After you complete the configuration, click Next. On the Complete page, click Copy, and then paste and execute the script on your edge node.

The following figure shows that the node is added successfully.

References

If you encounter issues while you add edge nodes, see Troubleshoot edge node issues.
To remove unused edge nodes, see Remove edge nodes.
To enable autonomous operation for edge nodes so that workloads can continue to run stably during network disconnections between the cloud and the edge, see Configure edge node autonomy.