Community Blog Empower Deep Learning with GPU Sharing for Cluster Scheduling

Empower Deep Learning with GPU Sharing for Cluster Scheduling

GPU sharing can optimizes the usage of GPU resources in a cluster, which will improve your experience for deep learning tasks.

As a leading container and cluster service provider in the world, Kubernetes provides the capability to schedule Nvidia GPUs in container clusters, mainly assigning one GPU to one container, which is also suitable for training deep learning models. However, a lot of resources are still wasted in model development and prediction scenarios. In these scenarios, we may want to share a GPU in a cluster.

GPU sharing for cluster scheduling is to let more model development and prediction services share GPU, therefore improving Nvidia GPU utilization in a cluster. This requires the division of GPU resources. GPU resources are divided by GPU video memory and CUDA Kernel thread. Generally, cluster-level GPU sharing is mainly about two things: Scheduling and Isolation.

For fine-grained GPU card scheduling, Kubernetes community does not have a good solution at present. This is because the Kubernetes definition of extended resources, such as GPUs, only supports the addition and subtraction of integer granularity, but cannot support the allocation of complex resources. For example, if you want to use Pod A to occupy half of the GPU card, the recording and calling of resource allocation cannot be implemented in the current Kubernetes architecture design. Here, Multi-Card GPU Share relates to actual vector resources, while the Extended Resource describes scalar resources.

Therefore, we have designed an out-of-tree Share GPU Scheduling Solution with Kubernetes extension and plugin mechanism, which is not invasive to core components, such as the API Server, the Scheduler, the Controller Manager and the Kubelet.

It is suitable for cluster administrators who want to improve the GPU utilization of the cluster and application developers who want to be able to run multiple logic tasks on the Volta GPU at the same time.

For the detailed design and deployment procedure, please go to Advance Deep Learning with Alibaba Open-Source and Pluggable Scheduling Tool for GPU Sharing.

Related Blog Posts

How to Install NVIDIA GPU Cloud Virtual Machine Image on Alibaba Cloud

NVIDIA makes available on the Alibaba Cloud platform a customized image optimized for the NVIDIA Pascal? and Volta? -based Tesla GPUs. Running NGC containers on this virtual machine (VM) instance provides optimum performance for deep learning jobs.

For those familiar with the Alibaba platform, the process of launching the instance is as simple as logging into Alibaba, selecting the "NVIDIA GPU Cloud Machine Image" and one of the supported NVIDIA GPU instance types, configuring settings as needed, then launching the VM. After launching the VM, you can SSH into it and start running deep learning jobs using framework containers from the NGC container registry.

This article provides step-by-step instructions for accomplishing this.

How to Prepare Your GPU Machine and Get It cuDNN Ready

Video and image processing solutions are some of the hottest topics of today. However, there are always questions about the preparation of GPU system for video or image processing.

This article captures the step-by-step installation process for the preparation of cuDNN on Alibaba Cloud GPU compute service. cuDNN is part of the NVIDIA deep learning SDK includes standard routines such as pooling, normalization and convolution.

To prepare the deep learning platform, we start with setting up the GPU compute service. We can deploy any GN5 series machine; in this demo, we will be using an X86_64 Linux platform machine. We install setup both cuDNN and CUDA drivers (requirement for cuDNN library) and follow up with the verification of the cuDNN library.

Related Documentation

Create a compute optimized instance with GPU capabilities

You must install the GPU driver to use a compute optimized instance with GPU capabilities. You can choose whether to install the GPU driver when you create an instance, or manually install the driver after the instance is created. This topic describes how to create a compute optimized instance with GPU capabilities and install the driver during the creation process.

Alibaba Cloud Marketplace provides images that support deep learning and machine learning

Accelerate machine learning tasks on a GPU instance by using RAPIDS

NVIDIA GPU Cloud (NGC) is a deep learning ecosystem developed by NVIDIA to provide developers with free access to deep learning and machine learning software stacks that allows them to quickly build corresponding environments. The NGC website provides RAPIDS Docker images, which come with pre-installed environments.

This topic describes how to use RAPIDS libraries (based on the NGC environment) that are installed on a GPU instance to accelerate tasks for data science and machine learning and improve the efficiency of computing resources

Related Market Product

NVIDIA GPU Cloud Virtual Machine Image

The NVIDIA GPU Cloud Virtual Machine Image is an optimized environment for running GPU-optimized deep learning frameworks and HPC applications available from the NVIDIA GPU Cloud container registry.

Related Products


EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.

Machine Learning Platform for AI

Machine Learning Platform for AI provides end-to-end machine learning services, including data processing, feature engineering, model training, model prediction, and model evaluation. Machine Learning Platform for AI combines all of these services to make AI more accessible than ever.

0 0 0
Share on

Alibaba Clouder

2,606 posts | 737 followers

You may also like