dlc concepts, benefits, and scenarios - Platform For AI - Alibaba Cloud Documentation Center

Deep Learning Containers (DLC) helps you quickly create single-node or distributed training jobs. It uses Kubernetes to launch compute nodes, eliminating the need to manually provision machines and configure runtime environments without disrupting your existing workflows. Because it supports multiple deep learning frameworks and offers flexible resource configuration options, DLC is ideal for rapid training job deployment.

Benefits

Diverse computing resources:
Built on Lingjun AI Computing Service and general computing resources, DLC supports various compute instance types, including Elastic Compute Service (ECS), Elastic Container Instance (ECI), Shenlong Bare Metal Instances, and Lingjun bare metal instances. This enables hybrid scheduling of heterogeneous computing.
Distributed job types:
DLC supports over ten training frameworks, including Megatron, DeepSpeed, PyTorch, TensorFlow, Slurm, Ray, MPI, and XGBoost, without building your own clusters. DLC provides various official images and supports custom runtime environments. You can submit jobs by using the console, an SDK, or the command line, making it a comprehensive solution for AI training.
High stability:
For LLM training, DLC uses a proprietary fault tolerance engine (AIMaster), a high-performance checkpointing framework (EasyCKPT), a Health Check feature (SanityCheck), and node self-healing capabilities. These features provide rapid detection, precise diagnostics, and fast feedback. This approach resolves stability issues, reduces computing power loss, and improves training reliability.
High performance:
A proprietary AI training acceleration framework improves distributed training efficiency. It implements unified parallel acceleration strategies, including data parallelism, pipeline parallelism, operator splitting, and nested parallelism. The framework combines automatic parallel strategy exploration and multi-dimensional memory optimization with topology-aware scheduling over high-speed networks. Further optimizations in the distributed communication library include communication thread pools, gradient grouping, mixed-precision communication, and gradient compression. These optimizations create a highly optimized training engine, especially for large model pre-training, continuous training, and alignment.

Resource type

When you submit a training job through Deep Learning Containers (DLC), Platform for AI (PAI) offers the following two resource types based on your use case and computing power requirements:

Lingjun AI Computing Service: This service is designed for large model training and other deep learning tasks that require massive computing resources. It is engineered for ultra-large-scale deep learning and integrated AI computing, based on hardware-software co-optimization. It provides a high-performance heterogeneous computing foundation and end-to-end AI engineering capabilities. Its core advantages are high performance, high efficiency, and high utilization, which meet the demands of fields such as large model training, autonomous driving, fundamental research, and finance.
General computing resources: These resources suit standard training needs and support machine learning tasks of various scales and types.

Lingjun AI Computing Service and general computing resources are available through the following purchasing options:

Resource quota: You can purchase Lingjun AI Computing Service or general computing resources in advance on a subscription basis for AI development and training. This model enables flexible resource management and efficient use.
Public resources: You can use Lingjun AI Computing Service or general computing resources on demand when you submit a training job, without purchasing them in advance. You are billed on a pay-as-you-go basis.
Preemptible resources: Lingjun AI Computing Service offers preemptible Instances, which help you acquire AI computing power at a lower cost and reduce overall job expenses.

Scenarios

Data preprocessing
You can customize runtime environments to perform offline parallel preprocessing on your data, which significantly simplifies data preprocessing.
Large-scale distributed training
You can conduct offline, large-scale distributed training using various open-source deep learning frameworks. DLC supports training on thousands of nodes simultaneously, significantly shortening training time.
Offline inference
You can use DLC to run offline inference on models. This approach improves GPU utilization during idle periods and reduces resource waste.

References

Create training tasks: Learn how to submit training jobs through the console, an SDK, or the command line, and configure key parameters.
DLC use cases: Learn how to use DLC through practical examples.