As a leading container and cluster service provider in the world, Kubernetes provides the capability to schedule Nvidia GPUs in container clusters, mainly assigning one GPU to one container, which is also suitable for training deep learning models. However, a lot of resources are still wasted in model development and prediction scenarios. In these scenarios, we may want to share a GPU in a cluster.
GPU sharing for cluster scheduling is to let more model development and prediction services share GPU, therefore improving Nvidia GPU utilization in a cluster. This requires the division of GPU resources. GPU resources are divided by GPU video memory and CUDA Kernel thread. Generally, cluster-level GPU sharing is mainly about two things: Scheduling and Isolation.
For fine-grained GPU card scheduling, Kubernetes community does not have a good solution at present. This is because the Kubernetes definition of extended resources, such as GPUs, only supports the addition and subtraction of integer granularity, but cannot support the allocation of complex resources. For example, if you want to use Pod A to occupy half of the GPU card, the recording and calling of resource allocation cannot be implemented in the current Kubernetes architecture design. Here, Multi-Card GPU Share relates to actual vector resources, while the Extended Resource describes scalar resources.
Therefore, we have designed an out-of-tree Share GPU Scheduling Solution with Kubernetes extension and plugin mechanism, which is not invasive to core components, such as the API Server, the Scheduler, the Controller Manager and the Kubelet.
It is suitable for cluster administrators who want to improve the GPU utilization of the cluster and application developers who want to be able to run multiple logic tasks on the Volta GPU at the same time.
For the detailed design and deployment procedure, please go to Advance Deep Learning with Alibaba Open-Source and Pluggable Scheduling Tool for GPU Sharing.
NVIDIA makes available on the Alibaba Cloud platform a customized image optimized for the NVIDIA Pascal? and Volta? -based Tesla GPUs. Running NGC containers on this virtual machine (VM) instance provides optimum performance for deep learning jobs.
For those familiar with the Alibaba platform, the process of launching the instance is as simple as logging into Alibaba, selecting the "NVIDIA GPU Cloud Machine Image" and one of the supported NVIDIA GPU instance types, configuring settings as needed, then launching the VM. After launching the VM, you can SSH into it and start running deep learning jobs using framework containers from the NGC container registry.
This article provides step-by-step instructions for accomplishing this.
Video and image processing solutions are some of the hottest topics of today. However, there are always questions about the preparation of GPU system for video or image processing.
This article captures the step-by-step installation process for the preparation of cuDNN on Alibaba Cloud GPU compute service. cuDNN is part of the NVIDIA deep learning SDK includes standard routines such as pooling, normalization and convolution.
To prepare the deep learning platform, we start with setting up the GPU compute service. We can deploy any GN5 series machine; in this demo, we will be using an X86_64 Linux platform machine. We install setup both cuDNN and CUDA drivers (requirement for cuDNN library) and follow up with the verification of the cuDNN library.
Data Science cluster is a new model available in E-MapReduce (EMR) 3.13.0 and later versions for machine learning and deep learning. You can use GPU or CPU models to perform data training through Data Science clusters. Training data can be stored on HDFS and OSS. EMR supports TensorFlow for distributed training on large amounts of data.
NVIDIA GPU Cloud (NGC) is a deep learning ecosystem developed by NVIDIA to provide developers with free access to deep learning and machine learning software stacks that allows them to quickly build corresponding environments. The NGC website provides RAPIDS Docker images, which come with pre-installed environments.
This topic describes how to use RAPIDS libraries (based on the NGC environment) that are installed on a GPU instance to accelerate tasks for data science and machine learning and improve the efficiency of computing resources
The NVIDIA GPU Cloud Virtual Machine Image is an optimized environment for running GPU-optimized deep learning frameworks and HPC applications available from the NVIDIA GPU Cloud container registry.
EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.
Machine Learning Platform for AI provides end-to-end machine learning services, including data processing, feature engineering, model training, model prediction, and model evaluation. Machine Learning Platform for AI combines all of these services to make AI more accessible than ever.
Alibaba Clouder - October 24, 2019
Alibaba Developer - May 8, 2019
Alibaba Container Service - June 12, 2019
Alibaba BlockChain Service Team - August 22, 2018
Alibaba Clouder - September 29, 2017
Alibaba Cloud MaxCompute - March 20, 2019
More Posts by Alibaba Clouder