By Bi Ran
Kubernetes services of major container cluster service vendors around the world all provide the capability to schedule Nvidia GPU containers, but it is generally implemented by allocating a GPU card to a container. This allows better isolation and ensures that applications using GPU are not affected by other applications. It is suitable for deep learning model training scenarios, but it would be a waste for model development and model prediction scenarios. The demand is to allow more prediction services to share the same GPU card, thus improving Nvidia GPU utilization in the cluster. This requires the partitioning of GPU resources. Here, the dimension of GPU resource partitioning refers to the partitioning of GPU memory and Cuda Kernel threads. Generally, cluster-level GPU sharing is mainly about two things:
This article mainly describes scheduling. The isolation solution will be implemented based on Nvidia MPS in the future.
For fine-grained GPU card scheduling, the Kubernetes community does not currently have a good solution. This is because the Kubernetes definition of extended resources, such as GPUs, only supports the addition and subtraction of integer granularity, but cannot support the allocation of complex resources. For example, if you want to use Pod A to occupy half of the GPU card, the recording and calling of resource allocation cannot be implemented in the current Kubernetes architecture design. Here, Multi-Card GPU Share relates to actual vector resources, while the Extended Resource describes scalar resources.
Therefore, we have designed an out-of-tree GPU Share Scheduling Solution, which relies on the existing working mechanism of Kubernetes:
Many customers have a clear requirement to allow multi-AI applications to be scheduled to the same GPU. They can accept controlling the size of memory from the application level, and use
gpu_options.per_process_gpu_memory_fraction to control the memory usage of the application. The first problem we need to solve is to simplify, using memory as the scheduling scale, and transfer the size of the memory to the container in the form of parameters.
In this design, the core of the following designs of Kubernetes is not modified: the design of the Extended Resource, the implementation of the Scheduler, the mechanism of the Device Plugin, and the related design of the Kubelet. Reusing the Extended Resource to describe the application API for shared resources. The advantage is to provide a portable solution that users can use on the native Kubernetes.
First, our task is to define two new Extended Resources: the first is gpu-mem, corresponding to the GPU memory, and the second is gpu-count, corresponding to the number of GPU cards. Vector resources are described by these two scalar resources, and the vector resources are combined to provide a mechanism to support GPU Share. The basic architecture diagram is as follows:
1. Resource reporting
GPU Share Device Plugin uses the NVML library to query the number of GPU cards and the memory of each GPU card, and uses
ListAndWatch() to report the total memory (quantity memory) of GPUs on the node as an additional Extended Resource to Kubelet. Then, Kubelet reports it to Kubernetes API Server. For example, if a node contains two GPU cards and each card contains 16276 MiB, then from the user's perspective, the GPU resources of the node are 16276 2 = 32552, and the number of GPU cards on the node, which is 2, is also reported as an additional Extended Resource.
2. Extended scheduling
GPU Share Scheduler Extender can reserve the allocation information in the Pod Spec in the form of annotations while allocating gpu-mem to the Pod, and can determine whether each GPU card contains enough available gpu-mem allocation at the time of filtering based on this information.
2.1. The default Kubernetes scheduler calls the Filter method of the GPU Share Scheduler Extender over http after all the filter actions have been performed. This is because the default scheduler can only determine whether free resources are available that can meet the demand on the whole, and cannot specifically determine whether the demand is met on a single card when computing the Extended Resources. Therefore, it is up to the GPU Share Scheduler Extender to check whether a single card contains available resources.
The following figure is used as an example. In a Kubernetes cluster composed of 3 nodes that contain 2 GPU cards, when a user applies for
gpu-mem = 8138, the default scheduler scans all nodes and finds that the remaining resources of N1 is 16276 * 2 - 16276 -12207 = 4069, which does not meet the resource demands, so N1 node are filtered out.
The remaining resources of N2 and N3 nodes are both 8138 MiB, which meets the conditions of the default scheduler from the perspective of overall scheduling. At this time, the default scheduler entrusts the GPU Share Scheduler Extender to perform secondary filtering. In the secondary filtering, the GPU Share Scheduler Extender needs to determine whether the single card meets the scheduling requirements. For N2 node, it is found that the node has 8138 MiB of available resources, but from the perspective of each GPU card, GPU0 and GPU1 have only 4069 MiB of available resources, which cannot meet the demand of 8138 MiB of a single card. N3 Node also has a total of 8138 MiB available resources, but these available resources all belong to GPU0, meeting the demand of single-card scheduling. As a result, precise conditional filtering can be implemented through the filtering of the GPU Share Scheduler Extender.
2.2. When the scheduler finds a node that meets the condition, it entrusts the bind method of the GPU Share Scheduler Extender to bind the node and the Pod. Here, the Extender needs to perform two operations:
ALIYUN_COM_GPU_MEM_IDXin the annotation of the Pod. In addition, the GPU memory applied by the Pod is also saved as
ALIYUN_COM_GPU_MEM_ASSUME_TIMEto the annotation of the Pod, and the Pod is bound to the selected node at this time.
Note: The Pod annotation for
ALIYUN_COM_GPU_MEM_ASSIGNED is also saved and initialized to "false." It means that the Pod is assigned to a GPU card during scheduling, but the Pod is not actually created on the node.
ALIYUN_COM_GPU_MEM_ASSUME_TIME represents the time
If no GPU resources on the allocated node meet the condition, the scheduler does not perform binding at this time and exits directly without reporting an error. The default scheduler will reschedule after ASSUME times out.
As shown in the following figure, when GPU Share Scheduler Extender binds the Pod with gpu-mem 8138 to the selected node N1, it first compares the available resources of different GPUs, which are GPU0 (12207), GPU1 (8138), GPU2 (4069) and GPU3 (16276). GPU2 are discarded because its remaining resources do not meet the requirements. Among the other 3 GPUs that meet the condition, GPU1 is the GPU card with the least resources left, and the free resources satisfy the condition, so GPU1 is selected.
3. Run on the node
When the event that the Pod is bound to the node is received by Kubelet, Kubelet creates a real Pod entity on the node. In this process, Kubelet calls the
Allocate method of the GPU Share Device Plugin, and the parameter of the
Allocate method is gpu-mem applied by the Pod. In the
Allocate method, the corresponding Pod is run according to the scheduling decision of the GPU Share Scheduler Extender.
3.1. All the GPU Share Pods in this node with Pending status and
ALIYUN_COM_GPU_MEM_ASSIGNED set to
false are listed.
3.2. The Pod with the same number of
ALIYUN_COM_GPU_MEM_POD (in the Pod Annotation) and Allocate applications is selected. If multiple Pods meet the condition, the POD with the earliest
ALIYUN_COM_GPU_MEM_ASSUME_TIME is selected.
ALIYUN_COM_GPU_MEM_ASSIGNED in the Pod Annotation is set to true, and the GPU information in the Pod Annotation is converted into an environment variable and returned to Kubelet to truly create the Pod.
Currently, the project has been open sourced on GitHub.
1. First, create an application that uses
apiVersion: apps/v1 kind: Deployment metadata: name: binpack-1 labels: app: binpack-1 spec: replicas: 1 selector: # define how the deployment finds the pods it manages matchLabels: app: binpack-1 template: # define the pods specifications metadata: labels: app: binpack-1 spec: containers: - name: binpack-1 image: cheyang/gpu-player:v2 resources: limits: # MiB aliyun.com/gpu-mem: 1024
See Usage Documentation.
See How to Build.
Alibaba Developer - May 8, 2019
Alibaba Cloud MaxCompute - March 20, 2019
Marketplace - November 23, 2018
Alibaba Clouder - June 21, 2018
Alibaba Container Service - May 13, 2019
Alibaba Cloud MaxCompute - March 20, 2019
Super Computing Service provides ultimate computing performance and parallel computing cluster services for high-performance computing through high-speed RDMA network and heterogeneous accelerators such as GPU.Learn More
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
Powerful parallel computing capabilities based on GPU technology.Learn More
A secure image hosting platform providing containerized image lifecycle managementLearn More
More Posts by Alibaba Container Service