ACK nodes require the cGPU module to support GPU sharing and scheduling. This page shows how to upgrade the cGPU module on a node using the ACK console and kubectl.
Prerequisites
Before you begin, ensure that you have:
-
An ACK cluster with GPU nodes running cGPU
-
Access to the ACK console
-
kubectlconfigured to connect to the cluster -
If the system disk of your node contains data, first create a backup.
Step 1: Upgrade the cluster component
The upgrade method depends on your cluster type.
| ACK consoleACK consoleCluster type | Component | How to upgrade |
|---|---|---|
| ACK managed cluster Pro, ACK Edge cluster Pro | ack-ai-installer | See Upgrade the shared GPU scheduling component |
| ACK dedicated cluster | ack-cgpu | Follow the steps below |
To upgrade ack-cgpu on an ACK dedicated cluster:
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the name of the cluster you want to update. In the left navigation pane, choose Applications > Helm.
-
On the Helm page, find the ack-cgpu component. Click Update in the Actions column, select a Version, and then click OK.
Step 2: Upgrade existing nodes
Before upgrading nodes, note the following:
-
Stop all GPU applications on the node.
-
Upgrade one node first. After verifying that GPU applications run as expected, upgrade the remaining GPU nodes in batches.
-
This method resets the node's system disk. Back up any data on the system disk before proceeding.
Remove and re-add the node
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the name of the cluster. In the left navigation pane, choose Nodes > Nodes.
-
On the Nodes page, select the cGPU node to upgrade and click Batch Remove. In the Remove Node dialog box, select Drain Node.
-
Re-add the removed node to the original node pool. For more information, see Add existing nodes to a cluster.
ImportantSelect the automatic node addition method. The node is not reset if you add it manually.
Verify the upgrade
After re-adding the node, run the following commands to confirm the cGPU module is updated.
-
Find the cgpu-installer Pod for the newly added node:
kubectl get po -l name=cgpu-installer -n kube-system -o wideAll Pods should show
Runningstatus. Example output:NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cgpu-installer-***** 1/1 Running 0 4d2h 192.168.XXX.XX1 cn-beijing.192.168.XXX.XX1 <none> <none> cgpu-installer-**2 1/1 Running 0 4d2h 192.168.XXX.XX2 cn-beijing.192.168.XXX.XX2 <none> <none> cgpu-installer-**3 1/1 Running 0 4d2h 192.168.XXX.XX3 cn-beijing.192.168.XXX.XX3 <none> <none> -
Access the cgpu-installer Pod:
kubectl exec -ti cgpu-installer-xxxxx -n kube-system -- bash -
Check the current cGPU version:
nsenter -t 1 -i -p -n -u -m -- cat /proc/cgpu_km/versionExample output:
1.5.16For the latest available cGPU version, see ack-ai-installer.
cGPU version compatibility
NVIDIA driver compatibility
| cGPU version | Compatible NVIDIA drivers |
|---|---|
| 1.5.20, 1.5.19, 1.5.18, 1.5.17, 1.5.16, 1.5.15, 1.5.13, 1.5.12, 1.5.11, 1.5.10, 1.5.9, 1.5.8, 1.5.7, 1.5.6, 1.5.5, 1.5.3 | 460, 470, 510, 515, 525, 535, 550, 560, 565, 570, 575 series |
| 1.5.2, 1.0.10, 1.0.9, 1.0.8, 1.0.7, 1.0.6, 1.0.5 | 460 series; 470 series <= 470.161.03; 510 series <= 510.108.03; 515 series <= 515.86.01; 525 series <= 525.89.03. Not supported: 535, 550, 560, 565, 570, 575 series |
| 1.0.3, 0.8.17, 0.8.13 | 460 series; 470 series <= 470.161.03. Not supported: 510, 515, 525, 535, 550, 560, 565, 570, 575 series |
Instance family compatibility
| cGPU version | Compatible instance families |
|---|---|
| 1.5.20, 1.5.19 | gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e; gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e; gn8t / ebmgn8t; gn8is / gn8v / ebmgn8is / ebmgn8v; gn8ia / ebmgn8ia; ebmgn9t |
| 1.5.18, 1.5.17, 1.5.16, 1.5.15, 1.5.13, 1.5.12, 1.5.11, 1.5.10, 1.5.9 | gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e; gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e; gn8t / ebmgn8t; gn8is / gn8v / ebmgn8is / ebmgn8v; gn8ia / ebmgn8ia. Not supported: ebmgn9t |
| 1.5.8, 1.5.7 | gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e; gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e; gn8t / ebmgn8t; gn8is / gn8v / ebmgn8is / ebmgn8v. Not supported: gn8ia / ebmgn8ia, ebmgn9t |
| 1.5.6, 1.5.5 | gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e; gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e; gn8t / ebmgn8t. Not supported: gn8is / gn8v / ebmgn8is / ebmgn8v, gn8ia / ebmgn8ia, ebmgn9t |
| 1.5.3, 1.5.2, 1.0.10, 1.0.9, 1.0.8, 1.0.7, 1.0.6, 1.0.5, 1.0.3 | gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e; gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e. Not supported: gn8t / ebmgn8t, gn8is / gn8v / ebmgn8is / ebmgn8v, gn8ia / ebmgn8ia, ebmgn9t |
| 0.8.17, 0.8.13 | gn6i / gn6e / gn6v / gn6t / ebmgn6i / ebmgn6t / ebmgn6e. Not supported: gn7i / gn7 / gn7e / ebmgn7i / ebmgn7e, gn8t / ebmgn8t, gn8is / gn8v / ebmgn8is / ebmgn8v, gn8ia / ebmgn8ia, ebmgn9t |
nvidia-container-toolkit compatibility
| cGPU version | Compatible nvidia-container-toolkit |
|---|---|
| 1.5.20, 1.5.19, 1.5.18, 1.5.17, 1.5.16, 1.5.15, 1.5.13, 1.5.12, 1.5.11, 1.5.10, 1.5.9, 1.5.8, 1.5.7, 1.5.6, 1.5.5, 1.5.3, 1.5.2, 1.0.10 | <= 1.10; 1.11 ~ 1.17 |
| 1.0.9, 1.0.8, 1.0.7, 1.0.6, 1.0.5, 1.0.3, 0.8.17, 0.8.13 | <= 1.10. Not supported: 1.11 ~ 1.17 |
Kernel version compatibility
| cGPU version | Compatible kernel versions |
|---|---|
| 1.5.20, 1.5.19, 1.5.18, 1.5.17, 1.5.16, 1.5.15, 1.5.13, 1.5.12, 1.5.11, 1.5.10, 1.5.9 | kernel 3.x, 4.x, 5.x <= 5.15 |
| 1.5.8, 1.5.7, 1.5.6, 1.5.5, 1.5.3 | kernel 3.x, 4.x, 5.x <= 5.10 |
| 1.5.2, 1.0.10, 1.0.9, 1.0.8, 1.0.7, 1.0.6, 1.0.5, 1.0.3 | kernel 3.x, 4.x, 5.x <= 5.1 |
| 0.8.17 | kernel 3.x, 4.x, 5.x <= 5.0 |
| 0.8.13, 0.8.12, 0.8.10 | kernel 3.x, 4.x only (kernel 5.x not supported) |