Empower Deep Learning with GPU Sharing for Cluster Scheduling

As a leading container and cluster service provider in the world, Kubernetes provides the capability to schedule Nvidia GPUs in container clusters, mainly assigning one GPU to one container, which is also suitable for training deep learning models. However, a lot of resources are still wasted in model development and prediction scenarios. In these scenarios, we may want to share a GPU in a cluster.

GPU sharing for cluster scheduling is to let more model development and prediction services share GPU, therefore improving Nvidia GPU utilization in a cluster. This requires the division of GPU resources. GPU resources are divided by GPU video memory and CUDA Kernel thread. Generally, cluster-level GPU sharing is mainly about two things: Scheduling and Isolation.

For fine-grained GPU card scheduling, Kubernetes community does not have a good solution at present. This is because the Kubernetes definition of extended resources, such as GPUs, only supports the addition and subtraction of integer granularity, but cannot support the allocation of complex resources. For example, if you want to use Pod A to occupy half of the GPU card, the recording and calling of resource allocation cannot be implemented in the current Kubernetes architecture design. Here, Multi-Card GPU Share relates to actual vector resources, while the Extended Resource describes scalar resources.

Therefore, we have designed an out-of-tree Share GPU Scheduling Solution with Kubernetes extension and plugin mechanism, which is not invasive to core components, such as the API Server, the Scheduler, the Controller Manager and the Kubelet.

It is suitable for cluster administrators who want to improve the GPU utilization of the cluster and application developers who want to be able to run multiple logic tasks on the Volta GPU at the same time.

For the detailed design and deployment procedure, please go to Advance Deep Learning with Alibaba Open-Source and Pluggable Scheduling Tool for GPU Sharing.

Related Market Product

NVIDIA GPU Cloud Virtual Machine Image

The NVIDIA GPU Cloud Virtual Machine Image is an optimized environment for running GPU-optimized deep learning frameworks and HPC applications available from the NVIDIA GPU Cloud container registry.

Related Products

E-MapReduce

EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.

Machine Learning Platform for AI

Machine Learning Platform for AI provides end-to-end machine learning services, including data processing, feature engineering, model training, model prediction, and model evaluation. Machine Learning Platform for AI combines all of these services to make AI more accessible than ever.

Community

Empower Deep Learning with GPU Sharing for Cluster Scheduling

Related Blog Posts

How to Install NVIDIA GPU Cloud Virtual Machine Image on Alibaba Cloud

How to Prepare Your GPU Machine and Get It cuDNN Ready

Related Documentation

Create a compute optimized instance with GPU capabilities

Accelerate machine learning tasks on a GPU instance by using RAPIDS

Related Market Product

NVIDIA GPU Cloud Virtual Machine Image

Related Products

E-MapReduce

Machine Learning Platform for AI

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

MaxCompute