Getting Started with Kubernetes | GPU Management and Device Plugin Implementation

By Che Yang (Biran), Senior Technical Expert at Alibaba

In 2016, AlphaGo and TensorFlow brought the technological revolution of artificial intelligence (AI) from academia to industry. AI is boosted by cloud computing and computing power.

Source of Requirements

After years of development, AI has been applied to many scenarios, such as intelligent customer service, machine translation, and image search. Machine learning and AI have a long history. The popularity of cloud computing and the corresponding massive increase in computing power make it possible to apply AI in the industry.

Since 2016, the Kubernetes community has received many requests from different channels asking to run TensorFlow and other machine learning frameworks in Kubernetes clusters. To meet these requests, we have to solve challenges such as the management of jobs and other offline tasks, the heterogeneous devices required for machine learning, and support for NVIDIA graphics processing units (GPUs).

We can use Kubernetes to manage GPUs to drive down costs and improve efficiency. GPUs are more costly than CPUs. An off-premises CPU costs less than RMB 1 per hour, whereas an off-premises GPU costs RMB 10 to 30 per hour, making it necessary to improve the GPU utilization.

We can use Kubernetes to manage GPUs and other heterogeneous resources to achieve the following:

Accelerate deployment: We can avoid the repeated deployment of the complex environment for machine learning by using containers.
Improve the utilization of cluster resources: We can do this through central scheduling and allocation.
Ensure exclusive access to resources: We can isolate heterogeneous devices through containers to prevent interference.

Deployment can be accelerated by minimizing the time spent on environment preparation. The deployment process can be solidified and reused through container image technology. Many machine learning frameworks provide container images. We can use container images to improve GPU utilization.

GPU utilization can be improved through time-division multiplexing. Kubernetes is required to centrally schedule GPUs in large quantities so that users can apply for resources as needed and release resources immediately after use. This ensures flexible use of the GPU pool.

The device isolation capability provided by Docker is required to prevent interference among the processes of different applications that run on the same device. This ensures high efficiency, cost-effectiveness, and system stability.

GPU Containerization

Kubernetes is suitable for running GPU applications and is also a container scheduling platform, with containers as the scheduling units. Before learning how to use Kubernetes, let's learn how to run GPU applications in a container environment.

1. Run GPU Applications in a Container Environment

Running GPU applications in a container environment is not complicated. This is done in two steps:

Build a GPU-supporting container image.
Run the image through Docker and map GPU devices and dependent libraries to containers.

2. Prepare a GPU Container Image

You can prepare a GPU container image through either of the following methods:

Use the official container image for deep learning.

Select an official GPU image from Docker Hub or Alibaba Cloud Container Registry. Standard images are available for popular machine learning frameworks, such as TensorFlow, Caffe, and PyTorch. Official GPU images are easy to use, secure, and reliable.

Build a GPU container image based on NVIDIA CUDA.

When official images do not meet your needs, for example, when you have made custom changes to the TensorFlow framework, you need to compile a custom TensorFlow image. We recommend that you create a custom image based on the official image of NVIDIA.

The following code is written in TensorFlow. It creates a custom GPU image based on the CUDA image.

3. How a GPU Container Image Works

Before building a GPU container image, we need to learn how to install a GPU application on a host.

As shown in the left part of the following figure, the NVIDIA hardware driver is first installed at the underlying layer. The CUDA tool library is installed at the upper layer. Machine learning frameworks such as PyTorch and TensorFlow are installed at the uppermost layer.

The CUDA tool library is closely coupled with applications. When an application version is changed, the related CUDA version may be updated as well. The NVIDIA driver is relatively stable.

The right part of the preceding figure shows the NVIDIA GPU container solution, in which the NVIDIA driver is installed on the host and the software located above the CUDA tool library is implemented by container images. The link in the NVIDIA driver is mapped to containers through Mount Bind.

After you install a new NVIDIA driver, you can run different versions of CUDA images on the same node.

4. Run a GPU Application in a Container

Now, we will see how a GPU container works. The following figure shows an example in which we use Docker to run a GPU container.

You need to map the host device and the NVIDIA driver library to the GPU container at runtime. This is different from the case of a common container.

The right part of the preceding figure shows the GPU configuration of the GPU container after startup. The upper-right part of the preceding figure shows the device mapping result, and the lower-right part shows the changes that occur after the drive library is mapped to the container in Bind mode.

NVIDIA Docker is typically used to run GPU containers and automates the processes of device mapping and drive library mapping. Device mounting is simple, but the drive library on which GPU applications depend is complex.

Different drive libraries are used depending on the specific scenarios, such as deep learning and video processing. To use drive libraries, you must have an understanding of NVIDIA, and especially of NVIDIA containers.

GPU Management Through Kubernetes

1. Deploy GPU Kubernetes

Configure the GPU capability for a Kubernetes node as follows. Here, we use a CentOS node as an example.

As shown in the preceding figure, we must:

1. Install an NVIDIA driver.

NVIDIA drivers require kernel compilation. You must install the GNU Compiler Collection (GCC) and the kernel source code before installing an NVIDIA driver.

2. Install NVIDIA Docker 2 by using the yum source.

Reload Docker after NVIDIA Docker 2 is installed. The default startup engine in Docker's daemon.json is replaced by NVIDIA. Run the "docker info" command to check whether NVIDIA runC is used during runtime.

3. Install the NVIDIA Device Plugin.

Download the deployment declaration file of the device plugin from NVIDIA's git repo and run the "kubectl create" command to deploy the plugin.

Here the device plugin is deployed through deamonset. When a Kubernetes node fails to schedule GPU applications, you need to check modules such as the device plugin. For example, you can view the device plugin logs to check whether the default runC of Docker is set to NVIDIA runC and whether the NVIDIA driver has been installed.

2. Verify the Result of GPU Kubernetes Deployment

After the GPU node is deployed, view GPU information in the node status information, including:

GPU name, which is nvidia.com/gpu in this example
GPU quantity, which is 2 in the following figure, indicating that the node has two GPUs

3. Use the GPU yaml Sample in Kubernetes

It is easy to use GPU containers in Kubernetes.

Set nvidia.com/gpu to the number of required GPUs under the limit field in the pod resource configuration. It is set to 1 in the following figure. Then, run the "kubectl create" command to deploy the target pod.

4. View the Results

After the deployment is complete, log on to the container and run the "nvidia-smi" command to check the result. You can see that a T4 GPU is used by the container. One of the two GPUs is in use in the container. The other GPU is transparent to the container and is inaccessible due to the GPU isolation feature.

Use GPU Resources in Kubernetes

1. Manage GPU Resources Through Extension

Kubernetes manages GPU resources through plugin extension, which is implemented by two independent internal mechanisms.

Kubernetes provides extended resources to allow you to create custom resources. Extended resources are measured at the integer level so that different heterogeneous devices can be supported by using a general mode, such as remote direct memory access (RDMA), field programmable gate array (FPGA), and AMD GPUs. This feature is not restricted to NVIDIA GPUs.
Kubernetes provides a device plugin framework to allow third-party device providers to schedule devices and manage the entire lifecycle. The device plugin framework connects Kubernetes to the device plugin module. It also reports device information to Kubernetes and selects devices for scheduling.

2. Report Extended Resources

Extended resources are a type of node-level APIs and are used independently of the device plugin. To report extended resources, use the PATCH API to update the status field of a node object. The PATCH operation is performed by using a simple curl command. This allows the Kubernetes scheduler to record the GPU type of the node, which uses one GPU.

The PATCH operation is not required if you use a device plugin. You only need to implement the device plugin programming model so that the device plugin performs the PATCH operation when extended resources are reported.

3. How the Device Plugin Works

The workflow of the device plugin is divided into two parts:

Resource reporting upon startup
Scheduling and running during usage

The device plugin is easy to develop. Two event methods are involved.

ListAndWatch is used for resource reporting and provides a health check mechanism. An unhealthy device is reported to the unhealthy device ID list of Kubernetes so that the device plugin framework removes this device from the schedulable device list.
The Allocate method is called by the device plugin during container deployment. The input parameter is the ID of the device used by the container. The returned parameters include the device, data volume, and environment variables required to start the container.

4. Report and Monitor Resources

Each hardware device is managed by the related device plugin, which is connected as a client to the device plugin manager of the kubelet through gRPC and reports to the kubelet the UNIX socket API version and device name to which it listens.

The following figure shows the process by which a device plugin reports resources. The process is divided into four steps. The first three steps occur on the node, and the fourth step is the interaction between the kubelet and the API server.

Step 1: The device plugin is registered to interact with Kubernetes. Multiple devices may exist on a node. The device plugin, as a client, reports the following information to the kubelet: (1) name of the device managed by the device plugin, such as a GPU or RDMA; (2) file path of the UNIX socket to which the device plugin listens so that the kubelet can call the device plugin; (3) protocol for interaction, which is the API version.
Step 2: The device plugin starts a gRPC server. Then, the device plugin acts on behalf of the gRPC server to provide services to the kubelet. The listening address and API version are provided in Step 1.
Step 3: After the gRPC server is started, the kubelet establishes a persistent connection to ListAndWatch of the device plugin to discover the device ID and check the device health. The device plugin notifies the kubelet when a device is unhealthy. If the unhealthy device is idle, the kubelet removes it from the schedulable device list. If the unhealthy device is used by a pod, the kubelet does not do anything because killing the pod is a high-risk action.
Step 4: The kubelet exposes these devices to the status of the node and sends the device quantity to the Kubernetes API server. The scheduler implements scheduling based on this information.

The kubelet reports only the GPU quantity to the Kubernetes API server. The device plugin manager of the kubelet stores the GPU ID list and assigns the GPU IDs to devices. The Kubernetes global scheduler does not see the GPU ID list, only the GPU quantity.

As a result, when a device plugin is used, the Kubernetes global scheduler implements scheduling based only on the GPU quantity. Two GPUs on the same node exchange data more effectively through NVLINK communication than PCIe communication. In this case, the device plugin does not support scheduling based on GPU affinity.

5. Schedule and Run Pods

When a pod wants to use a GPU, it declares the GPU resource and required quantity in Resource.limits, such as nvidia.com/gpu: 1. Kubernetes finds the node that meets the required GPU quantity, subtracts the number of GPUs on the node by 1, and binds the pod and the node.

After the binding is complete, the node-matched kubelet creates a container. When the kubelet finds that the resource specified in the pod's container request is a GPU, it enables the internal device plugin manager to select an available GPU from the GPU ID list and assigns the GPU to the container.

The kubelet sends an Allocate request to the device plugin. The request includes the device ID list that contains the GPU to be assigned to the container.

After receiving the Allocate request, the device plugin finds the device path, driver directory, and environment variables related to the device ID, and returns the information to the kubelet through an Allocate response.

The kubelet assigns a GPU to the container based on the received device path and driver directory. Then, Docker creates a container as instructed by the kubelet. The created container includes a GPU. Finally, the required driver directory is mounted. This completes the process of assigning a GPU to a pod in Kubernetes.

Thinking and Practice

1. Summary

In this article, we learned how to use GPUs in Docker and Kubernetes.

GPU containerization: how to build a GPU image and run a GPU container directly in Docker
Using Kubernetes to manage GPU resources: how to support GPU scheduling and verify GPU configurations in Kubernetes, and how to schedule GPU containers
Device plugin: how to report and monitor resources, and how to schedule and run pods
Thinking: current flaws and common device plugins in the community

2. Flaws of Device Plugins

In this final section, we will evaluate device plugins.

Device plugins are not designed with full consideration of the actual scenarios in academic circles and industry. GPU resources are scheduled only by the kubelet.

The global scheduler is essential for GPU resource scheduling. However, the Kubernetes scheduler implements scheduling only based on the GPU quantity. Device plugins cannot schedule heterogeneous devices based on factors other than the GPU quantity, such as a pod that runs two NVLINK-enabled GPUs.

Device plugins do not support global scheduling based on the status of devices in a cluster.

Device plugins do not support the extensible parameters that are added by the Allocate and ListAndWatch APIs. In addition, device plugins cannot schedule the resources of complex devices through API extension.

Therefore, device plugins are only applicable to a limited number of scenarios. This explains why NVIDIA and other vendors have implemented fork-based solutions based on Kubernetes upstream code.

3. Heterogeneous Resource Scheduling Solutions in the Community

The most commonly used solution is developed by NVIDIA.
The Alibaba Cloud service team developed a GPU sharing scheduling solution to schedule shared GPU resources. We welcome you to use and help us improve this solution.
Other vendors provide the RDMA and FPGA scheduling solutions.

Learn more about Alibaba Cloud Kubernetes product at https://www.alibabacloud.com/product/kubernetes

Community