Deep Learning Containers (DLC) of Machine Learning Platform for AI is a deep learning platform developed based on Alibaba Cloud Container Service for Kubernetes (ACK). It provides stable, easy-to-use, scalable, and high-performance runtimes for training deep learning models.
DLC integrates deep learning frames and network optimization technologies of Machine Learning Platform for AI. DLC offers distributed computing capabilities that support near-linear expansion. The NVIDIA P128 GPU can boost the speedup of parallel computing to higher than 100 times. In scenarios where tens of millions of image datasets or hundreds of thousands of categories need to be processed, the performance of model training by DLC is over eight times higher than open source deep learning frameworks. For search, recommendation, advertising, and media feeds services in the Internet industry, DLC is capable of handling hundreds of billions of samples and tens of billions of features. DLC allows you to train models on thousands of nodes in parallel. In these scenarios, the performance of model training by DLC is over five times higher than open source deep learning frameworks.
- Provides a distributed deployment solution for conducting data parallelism, model parallelism, and hybrid parallelism.
- Allows you to use existing ACK clusters to train deep learning models.
- Supports open source Kubernetes interfaces. This allows you to submit training jobs with user-provided images.
- Provides the training job management platform DLC Dashboard, which is deployed in ACK clusters. It allows you to submit jobs and monitor the progress of each job in a visualized manner.
- Allows you to use Arena and Kubectl to submit, manage, and view jobs. Arena is a Kubernetes-based command-line tool empowered by Artificial Intelligence (AI). Kubectl is a command-line tool for Kubernetes clusters.
- Allows you to monitor GPU resource usage in real time to facilitate task scheduling.