This topic describes how to upgrade the NVIDIA driver on GPU nodes when services are deployed on the nodes and when no service is deployed on the nodes.

Prerequisites

Upgrade the NVIDIA driver on GPU nodes where services are deployed

  1. Run the following command to set the target node to unschedulable.
    kubectl cordon node-name
    Note
    • Currently, you can only upgrade the NVIDIA driver on worker nodes.
    • node-name must be in the format of your-region-name.node-id.
      • your-region-name represents the region where your cluster is deployed.
      • node-id indicates the ID of the ECS instance where the target node resides.
      You can run the following command to query the node-name.
      kubectl get node
    Set to unschedulable
  2. Run the following command to migrate Pods on the target node to other nodes:
    kubectl drain node-name --grace-period=120 --ignore-daemonsets=true
    Migrate Pods to other nodes
  3. Run the following command to log on to the target node:
    ssh root@xxx.xxx.x.xx
  4. Run the following command to check the NVIDIA driver version:
    nvidia-smi
    Check the driver version
  5. Run the following commands to uninstall the existing driver:
    Note
    • If your driver version is 384.111, perform the following steps.
    • If your driver version is not 384.111, you need to download the corresponding driver from the official NVIDIA website first.
    cd /tmp
    curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
    chmod u+x NVIDIA-Linux-x86_64-384.111.run
    . /NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  6. Run the following command to restart the target node:
    reboot
  7. Download the driver that you want to use from the official NVIDIA website. This topic uses 410.79 as an example.
  8. Run the following command to install the downloaded driver under the directory where it was saved.
    sh . /NVIDIA-Linux-x86_64-410.79.run -a -s -q
  9. Run the following commands to configure the driver:
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true
  10. Run the following commands to update device-plugin:
    mv /etc/kubernetes/manifests/nvidia-device-plugin.yml /
    mv /nvidia-device-plugin.yml /etc/kubernetes/manifests/
  11. Run the following command on a master node to set the target node to schedulable.
    kubectl uncordon node-name

Result

Run the following command on a master node to check the NVIDIA driver version on the target node. The driver version is now 410.79.
Note Replace node-name with the target node name.
kubectl exec -n kube-system -t nvidia-device-plugin-node-name nvidia-smi
Upgrade driver

Upgrade the NVIDIA driver on GPU nodes where no service is deployed

  1. Run the following command to log on to the target node:
    ssh root@xxx.xxx.x.xx
  2. Run the following command to check the NVIDIA driver version:
    nvidia-smi
    Check the driver version
  3. Run the following commands to uninstall the existing driver:
    Note
    • If your driver version is 384.111, perform the following steps.
    • If your driver version is not 384.111, you need to download the corresponding driver from the official NVIDIA website first.
    cd /tmp
    curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
    chmod u+x NVIDIA-Linux-x86_64-384.111.run
    . /NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  4. Run the following command to restart the target node.
    reboot
  5. Download the driver that you want to use from the official NVIDIA website. This topic uses 410.79 as an example.
  6. Run the following command to install the downloaded driver under the directory where it was saved.
    sh . /NVIDIA-Linux-x86_64-410.79.run -a -s -q
  7. Run the following commands to configure the driver:
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true

Result

Run the following command on a master node to check the NVIDIA driver version on the target node. The driver version is now 410.79.
Note Replace node-name with the target node name.
kubectl exec -n kube-system -t nvidia-device-plugin-node-name nvidia-smi
Upgrade driver