To isolate GPU resources shared by multiple nodes in a Kubernetes cluster, Docker 19.03.5 and its nvidia-container-runtime binary must be used. If the Docker runtime version is earlier than 19.03.5, you must upgrade the Docker runtime to Docker 19.03.5. This topic describes how to upgrade the Docker runtime and its nvidia-container-runtime binary on these nodes to ensure GPU sharing among these nodes.
Background information
The nvidia-container-runtime binary allows you to build and run GPU-accelerated Docker containers. The binary automatically configures the containers. This ensures that the containers can use NVIDIA GPU resources.
Procedure
The steps described in this topic apply to only CentOS and Alibaba Cloud Linux 2.
Before you perform the following steps, you must use the command-line interface (CLI) to connect to your Container Service for Kubernetes (ACK) cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Run the following command on the master node to disconnect a specified node from the cluster.
You must label the node as unschedulable. This prevents pods from being scheduled to the node during the upgrade.
kubectl cordon <NODE_NAME>
Note<NODE_NAME> specifies the name of the node on which the Docker runtime will be upgraded. You can run the kubectl get nodes command to query node names.
Run the following command on the master node to migrate the pods on the node.
After the node is disconnected from the cluster, you must migrate the pods on the node to other available nodes.
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-local-data --force
Note<NODE_NAME> specifies the name of the node on which the Docker runtime will be upgraded.
Run the following commands to stop kubelet and the Docker runtime.
Stop kubelet and the Docker runtime before you upgrade the Docker runtime on the node.
service kubelet stop docker rm -f $(docker ps -aq) service docker stop
Run the following commands to remove the Docker runtime and nvidia-container-runtime.
Before you upgrade the Docker runtime and nvidia-container-runtime on the node, the current versions must be removed.
yum remove -y docker-ce docker-ce-cli containerd yum remove -y nvidia-container-runtime* libnvidia-container*
Run the following commands to back up and remove the daemon.json file:
cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "10" }, "bip": "169.254.123.1/24", "oom-score-adjust": -1000, "storage-driver": "overlay2", "storage-opts":["overlay2.override_kernel_check=true"], "live-restore": true } mv /etc/docker/daemon.json /tmp
Run the following command to install the Docker runtime.
Download the Docker installation package to the node on which you want to upgrade the Docker runtime.
VERSION=19.03.5 URL=http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/public/pkg/docker/docker-${VERSION}.tar.gz curl -ssL $URL -o /tmp/docker-${VERSION}.tar.gz cd /tmp tar -xf docker-${VERSION}.tar.gz cd /tmp/pkg/docker/${VERSION}/rpm yum localinstall -y $(ls .)
Run the following command to install nvidia-container-runtime on the node:
sudo cd /tmp sudo yum install -y unzip sudo wget https://aliacs-k8s-cn-hongkong.oss-cn-hongkong.aliyuncs.com/public/pkg/nvidia-container-runtime/nvidia-container-runtime-3.13.0-linux-amd64.tar.gz sudo tar -xvf nvidia-container-runtime-3.13.0-linux-amd64.tar.gz sudo yum -y -q --nogpgcheck localinstall pkg/nvidia-container-runtime/3.13.0/common/*
Run the following command to configure the daemon.json file.
Overwrite the /etc/docker/daemon.json file with the specified daemon.json file to make the original configurations take effect.
mv /tmp/daemon.json /etc/docker/daemon.json cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "10" }, "bip": "169.254.123.1/24", "oom-score-adjust": -1000, "storage-driver": "overlay2", "storage-opts":["overlay2.override_kernel_check=true"], "live-restore": true }
Run the following commands to start the Docker runtime and kubelet:
service docker start service kubelet start
Run the following command to connect the node to the cluster.
After the Docker runtime of the node is upgraded, the node is changed to the schedulable state in the cluster.
kubectl uncordon <NODE_NAME>
Note<NODE_NAME> specifies the name of the node on which the Docker runtime has been upgraded.
Run the following command to restart the GPU installer on the node:
docker ps |grep cgpu-installer | awk '{print $1}' | xargs docker rm -f