This topic describes the frequently asked questions about node management.

How do I manually upgrade the kernel of the GPU nodes in a cluster?

You can manually upgrade the kernel of the GPU nodes in a cluster by using the following steps.
Note The current kernel version is earlier than 3.10.0-957.21.3.
  1. Use kubectl to connect to a cluster.
  2. Set the target GPU node to unschedulable. In this example, node cn-beijing.i-2ze19qyi8votgjz12345 is used as the target node.
    kubectl cordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already cordoned
  3. Migrate the Pods on the target node to other nodes.
    # kubectl drain cn-beijing.i-2ze19qyi8votgjz12345 --grace-period=120 --ignore-daemonsets=true
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 cordoned
    WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
    pod/nginx-ingress-controller-78d847fb96-5fkkw evicted
  4. Uninstall the existing nvidia-driver.
    Note In this step, the version 384.111 driver is uninstalled. If your driver version is not 384.111, you need to download a driver from the official NVIDIA website and replace 384.111 with your actual version number.
    1. Log on to the GPU node and run the nvidia-smi command to check the driver version.
      # nvidia-smi -a | grep 'Driver Version'
      Driver Version                      : 384.111
    2. Download the driver installation package.
      cd /tmp/
      curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
      Note The installation package is required to uninstall the driver.
    3. Uninstall the existing driver.
      chmod u+x NVIDIA-Linux-x86_64-384.111.run
      . /NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
  5. Upgrade kernel.
    Upgrade kernel based on your needs.
  6. Restart the GPU node.
    reboot
  7. Log on to the GPU node to install the corresponding kernel-devel package.
    yum install -y kernel-devel-$(uname -r)
  8. Go to the official NVIDIA website to download the required driver and install it on the target node. In this example, 410.79 is used.
    cd /tmp/
    curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run
    chmod u+x NVIDIA-Linux-x86_64-410.79.run
    sh . /NVIDIA-Linux-x86_64-410.79.run -a -s -q
    
    # warm up GPU
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true
    nvidia-smi --auto-boost-default=0 || true
    nvidia-smi --auto-boost-permission=0 || true
    nvidia-modprobe -u -c=0 -m || true
  9. Check the /etc/rc.d/rc.local file to see whether the following configurations are included. If not, add the following content.
    nvidia-smi -pm 1 || true
    nvidia-smi -acp 0 || true
    nvidia-smi --auto-boost-default=0 || true
    nvidia-smi --auto-boost-permission=0 || true
    nvidia-modprobe -u -c=0 -m || true
  10. Restart kubelet and Docker.
    service kubelet stop
    service docker restart
    service kubelet start
  11. Set the GPU node to schedulable.
    # kubectl uncordon cn-beijing.i-2ze19qyi8votgjz12345
    
    node/cn-beijing.i-2ze19qyi8votgjz12345 already uncordoned
  12. Run the following command in the nvidia-device-plugin container to check the version of the driver installed in the GPU node.
    kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz12345 nvidia-smi
    Thu Jan 17 00:33:27 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Note If you run the docker ps command and find that no container is started on the GPU node, see Failed to start a container on the GPU node.

How do I fix the issue where no container is started on a GPU node?

When you restart kubelet and Docker on GPU nodes where certain Kubernetes versions are installed, you may find that no container is started on the nodes after the restart.
# service kubelet stop
Redirecting to /bin/systemctl stop kubelet.service
# service docker stop
Redirecting to /bin/systemctl stop docker.service
# service docker start
Redirecting to /bin/systemctl start docker.service
# service kubelet start
Redirecting to /bin/systemctl start kubelet.service

# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
Run the following command to view the cgroup driver.
docker info | grep -i cgroup
Cgroup Driver: cgroupfs
The output shows that the cgroup driver is set to cgroupfs.

You can use the following steps to fix the issue.

  1. Create a copy of /etc/docker/daemon.json and run the following command to update /etc/docker/daemon.json.
    cat >/etc/docker/daemon.json <<-EOF
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
            "max-size": "100m",
            "max-file": "10"
        },
        "oom-score-adjust": -1000,
        "storage-driver": "overlay2",
        "storage-opts":["overlay2.override_kernel_check=true"],
        "live-restore": true
    }
    EOF
  2. Run the following command to restart Docker and kubelet.
    # service kubelet stop
    Redirecting to /bin/systemctl stop kubelet.service
    # service docker restart
    Redirecting to /bin/systemctl restart docker.service
    # service kubelet start
    Redirecting to /bin/systemctl start kubelet.service
  3. Run the following command to check whether the cgroup driver is set to systemd.
    # docker info | grep -i cgroup
    Cgroup Driver: systemd