Deep Learning Containers (DLC) is a managed training service within PAI that lets you quickly create single-node or distributed training jobs. DLC uses Kubernetes to launch compute nodes, so you do not need to manually provision machines or configure runtime environments. DLC integrates with existing workflows without disruption, supports multiple deep learning frameworks, and offers flexible resource configuration options, making it ideal for rapid training job deployment.
How DLC works
When you submit a training job, DLC handles the infrastructure so you can focus on your model code:
-
Submit a job -- Define your training job through the console, an SDK, or the command line. Specify the framework, resource type, and runtime environment.
-
Provision resources -- DLC allocates compute nodes from Lingjun AI Computing Service or general computing resources based on configuration.
-
Launch containers -- Kubernetes launches containers on provisioned nodes using an official image or custom runtime environment.
-
Run training -- Training code executes. For distributed jobs, DLC coordinates across nodes using the selected framework.
-
Monitor and recover -- Built-in fault tolerance and health check capabilities detect issues and trigger automatic recovery.
-
Output results -- When training completes, logs and results are available for review.
Key capabilities
Run training at any scale
DLC is built on Lingjun AI Computing Service and general computing resources, giving you access to a range of compute instance types:
-
Elastic Compute Service (ECS)
-
Elastic Container Instance (ECI)
-
Shenlong Bare Metal Instances
-
Lingjun bare metal instances
This combination enables hybrid scheduling of heterogeneous computing, so you can match the right hardware to each workload.
Use your preferred frameworks
DLC supports over ten training frameworks without requiring cluster builds:
|
Megatron |
DeepSpeed |
PyTorch |
TensorFlow |
|
Slurm |
Ray |
MPI |
XGBoost |
DLC provides various official images and supports custom runtime environments. Submit jobs through the console, an SDK, or the command line.
Train reliably at scale
For LLM training, DLC includes proprietary reliability features that provide rapid detection, precise diagnostics, and fast feedback:
|
Feature |
Description |
|
AIMaster |
A proprietary fault tolerance engine that detects and recovers from failures automatically. |
|
EasyCKPT |
A high-performance checkpointing framework that saves training state efficiently. |
|
SanityCheck |
A health check feature that validates node readiness in the early stages of job training. |
|
Node self-healing |
Detects and handles unhealthy nodes to keep training running. |
Together, these features resolve stability issues, reduce computing power loss, and improve training reliability.
Accelerate training performance
A proprietary AI training acceleration framework improves distributed training efficiency through multiple optimization layers:
-
Parallel strategies -- Data parallelism, pipeline parallelism, operator splitting, and nested parallelism with automatic parallel strategy exploration.
-
Memory optimization -- Multi-dimensional memory optimization reduces GPU memory pressure.
-
Network-aware scheduling -- Topology-aware scheduling over high-speed networks places workloads for optimal communication patterns.
-
Communication optimizations -- The distributed communication library includes communication thread pools, gradient grouping, mixed-precision communication, and gradient compression.
These optimizations are especially effective for large model pre-training, continuous training, and alignment.
Resource types
PAI offers two resource types based on use case and computing power requirements.
|
Lingjun AI Computing Service |
General computing resources |
|
|
Best for |
Large model training and deep learning tasks that require massive computing resources |
Standard training needs across various scales and types |
|
Architecture |
Hardware-software co-optimization for ultra-large-scale deep learning and integrated AI computing |
Standard cloud compute infrastructure |
|
Core strengths |
High performance, high efficiency, and high utilization |
Flexibility across machine learning tasks of various scales and types |
|
Typical use cases |
Large model training, autonomous driving, fundamental research, finance |
General machine learning and deep learning workloads |
|
Key differentiator |
High-performance heterogeneous computing foundation with end-to-end AI engineering capabilities |
Broad compatibility with standard training workflows |
Purchasing options
Lingjun AI Computing Service and general computing resources are available through the following purchasing options:
|
Option |
Description |
Billing model |
Availability |
|
Resource quota |
Purchase Lingjun AI Computing Service or general computing resources in advance for AI development and training. This option enables flexible resource management and efficient use. |
Subscription |
Lingjun and General |
|
Public resources |
Use Lingjun AI Computing Service or general computing resources on demand when you submit a training job, without purchasing in advance. |
Pay-as-you-go |
Lingjun and General |
|
Preemptible resources |
Acquire AI computing power at a lower cost to reduce overall job expenses. |
Pay-as-you-go (discounted) |
Lingjun only |
Scenarios
Data preprocessing
Customize runtime environments to perform offline parallel preprocessing on data, significantly simplifying data preprocessing. Useful when cleaning, transforming, or augmenting large datasets before training.
Large-scale distributed training
Conduct offline, large-scale distributed training using various open-source deep learning frameworks. DLC supports training on thousands of nodes simultaneously, significantly shortening training time.
Offline inference
Use DLC to run offline inference on models. This approach improves GPU utilization during idle periods and reduces resource waste. Offline inference is well suited for batch prediction workloads where real-time latency is not required.
Get started
-
Create training tasks -- Learn how to submit training jobs through the console, an SDK, or the command line, and configure key parameters.
-
DLC use cases -- Learn how to use DLC through practical examples.