Nodes in an ACK cluster must have the cGPU module installed to support GPU sharing and scheduling. This topic describes how to upgrade the cGPU module on a node using commands or the console.
Step 1: Upgrade components
Cluster type | Component upgrade method |
| To upgrade the ack-ai-installer component, see Upgrade the shared GPU scheduling component. |
ACK dedicated cluster | To upgrade the ack-cgpu component, perform the following steps:
|
Step 2: Upgrade existing nodes
Stop the GPU applications on the node during the upgrade.
Upgrade one node first. After you verify that the GPU applications run as expected, upgrade other GPU nodes in batches.
This method resets the system disk of the node. If the system disk of your node contains data, first create a backup.
1. Remove and re-add the node
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the one you want to change. In the left navigation pane, choose .
On the Nodes page, select the cGPU node to upgrade and click Batch Remove. In the Remove Node dialog box, select Drain Node.
Re-add the removed node to the original node pool. For more information, see Add existing nodes to a cluster.
ImportantSelect the automatic node addition method. The node is not reset if you add it manually.
2. Verify the result
Run the following command to query the cgpu-installer that corresponds to the newly added node:
kubectl get po -l name=cgpu-installer -n kube-system -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cgpu-installer-***** 1/1 Running 0 4d2h 192.168.XXX.XX1 cn-beijing.192.168.XXX.XX1 <none> <none> cgpu-installer-**2 1/1 Running 0 4d2h 192.168.XXX.XX2 cn-beijing.192.168.XXX.XX2 <none> <none> cgpu-installer-**3 1/1 Running 0 4d2h 192.168.XXX.XX3 cn-beijing.192.168.XXX.XX3 <none> <none>Run the following command to access the pod named
cgpu-installer-******:kubectl exec -ti cgpu-installer-xxxxx -n kube-system -- bashRun the following command to query the current cGPU version:
nsenter -t 1 -i -p -n -u -m -- cat /proc/cgpu_km/versionSample output:
1.5.16NoteFor information about the latest cGPU version, see ack-ai-installer.
cGPU version compatibility
NVIDIA driver compatibility
cGPU version | Compatible NVIDIA drivers |
1.5.20 1.5.19 1.5.18 1.5.17 1.5.16 1.5.15 1.5.13 1.5.12 1.5.11 1.5.10 1.5.9 1.5.8 1.5.7 1.5.6 1.5.5 1.5.3 | Supported:
|
1.5.2 1.0.10 1.0.9 1.0.8 1.0.7 1.0.6 1.0.5 | Supported:
Not supported:
|
1.0.3 0.8.17 0.8.13 | Supported:
Not supported:
|
Instance family compatibility
cGPU version | Compatible instance families |
1.5.20 1.5.19 | Supported:
|
1.5.18 1.5.17 1.5.16 1.5.15 1.5.13 1.5.12 1.5.11 1.5.10 1.5.9 | Supported:
Not supported:
|
1.5.8 1.5.7 | Supported:
Not supported:
|
1.5.6 1.5.5 | Supported:
Not supported:
|
1.5.3 1.5.2 1.0.10 1.0.9 1.0.8 1.0.7 1.0.6 1.0.5 1.0.3 | Supported:
Not supported:
|
0.8.17 0.8.13 | Supported:
Not supported:
|
nvidia-container-toolkit compatibility
cGPU version | Compatible nvidia-container-toolkit |
1.5.20 1.5.19 1.5.18 1.5.17 1.5.16 1.5.15 1.5.13 1.5.12 1.5.11 1.5.10 1.5.9 1.5.8 1.5.7 1.5.6 1.5.5 1.5.3 1.5.2 1.0.10 | Supported:
|
1.0.9 1.0.8 1.0.7 1.0.6 1.0.5 1.0.3 0.8.17 0.8.13 | Supported:
Not supported:
|
Kernel version compatibility
cGPU version | Compatible kernel versions |
1.5.20 1.5.19 1.5.18 1.5.17 1.5.16 1.5.15 1.5.13 1.5.12 1.5.11 1.5.10 1.5.9 | Supported:
|
1.5.8 1.5.7 1.5.6 1.5.5 1.5.3 | Supported:
|
1.5.2 1.0.10 1.0.9 1.0.8 1.0.7 1.0.6 1.0.5 1.0.3 | Supported:
|
0.8.17 | Supported:
|
0.8.13 0.8.12 0.8.10 | Supported:
Not supported:
|