All Products
Search
Document Center

What Can I Do if I Fail to Start Containers on a GPU Node?

Last Updated: May 10, 2021

Issue

Containers fail to be started after you restart the kubelet and Docker on a GPU node in a Container Service for Kubernetes (ACK) cluster of specific editions.

Cause

Containers fail to be started because the cgroupfs driver is used.

Solution

  1. Log on to the GPU node and run the following command to check whether the cgroupfs driver is used:
    docker info | grep -i cgroup
    The following command output is returned:
    Cgroup Driver: cgroupfs
  1. Run the following command to update the /etc/docker/daemon.json file and change the cgroup driver to systemd:
    Note: Back up the file before you modify it.
    cat >/etc/docker/daemon.json <<-EOF
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
            "max-size": "100m",
            "max-file": "10"
        },
        "oom-score-adjust": -1000,
        "storage-driver": "overlay2",
        "storage-opts":["overlay2.override_kernel_check=true"],
        "live-restore": true
    }
    EOF
  1. Run the following command to stop the kubelet:
    service kubelet stop
    The following command output is returned:
    Redirecting to /bin/systemctl stop kubelet.service
  1. Run the following command to restart Docker.
    service docker restart
    The following command output is returned:
    Redirecting to /bin/systemctl restart docker.service
  1. Run the following command to start the kubelet:
    service kubelet start
    The following command output is returned:
    Redirecting to /bin/systemctl start kubelet.service
  1. Run the following command to check whether the cgroup driver is changed to systemd:
    docker info | grep -i cgroup
    The following command output is returned:
    Cgroup Driver: systemd

Applicable scope

  • ACK