Issue
Containers fail to be started after you restart the kubelet and Docker on a GPU node in a Container Service for Kubernetes (ACK) cluster of specific editions.
Cause
Containers fail to be started because the cgroupfs driver is used.
Solution
- Log on to the GPU node and run the following command to check whether the cgroupfs driver is used:
docker info | grep -i cgroup
The following command output is returned:Cgroup Driver: cgroupfs
- Run the following command to update the /etc/docker/daemon.json file and change the cgroup driver to systemd:
Note: Back up the file before you modify it.
cat >/etc/docker/daemon.json <<-EOF { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "10" }, "oom-score-adjust": -1000, "storage-driver": "overlay2", "storage-opts":["overlay2.override_kernel_check=true"], "live-restore": true } EOF
- Run the following command to stop the kubelet:
service kubelet stop
The following command output is returned:Redirecting to /bin/systemctl stop kubelet.service
- Run the following command to restart Docker.
service docker restart
The following command output is returned:Redirecting to /bin/systemctl restart docker.service
- Run the following command to start the kubelet:
service kubelet start
The following command output is returned:Redirecting to /bin/systemctl start kubelet.service
- Run the following command to check whether the cgroup driver is changed to systemd:
docker info | grep -i cgroup
The following command output is returned:Cgroup Driver: systemd
Applicable scope
- ACK