All Products
Search
Document Center

:How Do I Upgrade the Kernel for GPU Nodes in an ACK Cluster?

Last Updated:May 10, 2021

Overview

This document describes how to upgrade the kernel for GPU nodes in a Container Service for Kubernetes (ACK) cluster.

Details

  1. Log on to a GPU node. For more information about how to log on to a GPU node, see Use kubectl to connect to an ACK cluster.
  2. Run the following command to set the node to the unschedulable state:
    kubectl cordon [$Node_ID]
    Note: Replace [$Node_ID] with the ID of the node.
    The following command output is returned:
    node/[$Node_ID] already cordoned
  3. Run the following commands to remove the GPU node for which you want to upgrade the driver from the cluster:
    kubectl drain [$Node_ID] --grace-period=120 --ignore-daemonsets=true
    The following command output is returned:
    node/cn-beijing.[$Node_ID] cordoned
    WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
    pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
  4. Uninstall the existing NVIDIA driver.
    Note: In this example, NVIDIA 384.111 is uninstalled. If your driver version is not version 384.111, download the installation package of the actual version from the official NVIDIA website. Then, replace 384.111 with the actual version number in the commands to be run in this step.
    1. Log on to the GPU node and run the following command to view the driver version:
      nvidia-smi -a | grep 'Driver Version'
      The following command output is returned:
      Driver Version : 384.111
    2. Run the following commands to download the installation package of the NVIDIA driver that you want to uninstall:
      cd /tmp/
      curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
    3. Run the following commands to uninstall the existing NVIDIA driver:
      chmod u+x NVIDIA-Linux-x86_64-384.111.run
      ./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  1. Run the following commands to upgrade the kernel:
    yum clean all && yum makecache
    yum update kernel -y
  2. Run the reboot command to restart the GPU node.
  3. Log on to the GPU node again and run the following command to install the kernel-devel package:
    yum install -y kernel-devel-$(uname -r)
  1. Run the following commands to download the installation package of the driver version to which you want to upgrade from the official NVIDIA website and install this package. In this example, the installation package of NVIDIA 410.79 is downloaded and installed.
    cd /tmp/
    curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
    chmod u+x NVIDIA-Linux-x86_64-410.79.run sh
    ./NVIDIA-Linux-x86_64-410.79.run -a -s -q
  2. Run the following command to configure parameters as needed:
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true
    nvidia-smi --auto-boost-default=0 || true
    nvidia-smi --auto-boost-permission=0 || true
    nvidia-modprobe -u -c=0 -m || true
  3. If you need to start the NVIDIA driver upon startup, check whether the /etc/rc.d/rc.local file includes the following configurations. If the file does not include the configurations, manually add the configurations to the file.
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true
    nvidia-smi --auto-boost-default=0 || true
    nvidia-smi --auto-boost-permission=0 || true
    nvidia-modprobe -u -c=0 -m || true
  4. Run the following commands to restart the kubelet and Docker:
    service kubelet stop
    service docker restart
    service kubelet start
  5. Run the following command to set the GPU node to the schedulable state:
    kubectl uncordon [$Node_ID]
    The following command output is returned:
    node/[$Node_ID] already uncordoned
  1. Run the following command on the nvidia-device-plugin pod of the GPU node to check the driver version:
    kubectl exec -n kube-system -t nvidia-device-plugin-[$Node_ID] nvidia-smi
    The following command output is returned:
    Thu Jan 17 00:33:27 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------
  2. Run the docker ps command to check whether the containers of the GPU node are started. For more information about how to resolve the issue that containers cannot be started, see What can I do if I fail to start a container on the GPU node?

Applicable scope

  • ACK

Note: Make sure that the existing kernel version of the node for which you want to upgrade the kernel is earlier than 3.10.0-957.21.3.