Deep Learning Containers (DLC) provides a ready-to-use Kubernetes-based training environment. Start training jobs without manual cluster setup, with support for multiple frameworks, Lingjun AI Computing Service, and GPUs.
Benefits
-
Diverse computing resources:
Supports ECS, ECI, Shenlong Bare Metal, and Lingjun bare metal instances across Lingjun AI Computing Service and general computing resources, with hybrid scheduling of heterogeneous computing power.
-
Multiple distributed job types:
Supports 10+ frameworks—Megatron, DeepSpeed, PyTorch, TensorFlow, Slurm, Ray, MPI, XGBoost—without separate clusters. Provides pre-built images, custom runtime environments, and job submission from the console, SDK, or CLI.
-
High stability:
For large model training, DLC provides AIMaster (fault tolerance), EasyCKPT (checkpoints), SanityCheck (health checks), and node self-healing. These detect and recover from faults automatically, reducing computing power loss.
-
High performance:
A built-in acceleration framework improves training efficiency through data parallelism, pipeline parallelism, operator splitting, and nested parallel strategies. Additional features include automatic parallel strategy exploration, multi-dimensional memory optimization, topology-aware scheduling, and an optimized communication library with gradient group fusion and compression. Optimized for foundation model pre-training, continued training, and alignment.
Resource types
Select a resource type when submitting a training job:
-
Lingjun AI Computing Service: Computing resources designed for large model training and ultra-large-scale deep learning tasks, such as autonomous driving and scientific research.
-
General computing resource: Suitable for standard training needs. Supports machine learning tasks of various scales and types.
Both resource types are available in the following ways:
-
Resource quota: Purchase a subscription to Lingjun AI Computing Service or general computing resources for flexible resource management.
-
Public resource: Use computing resources on demand without advance purchase. Billed on a pay-as-you-go basis.
-
Preemptible resource: Acquire Lingjun AI computing power at a lower cost with preemptible instances.
Use cases
-
Data preprocessing
Customize runtime environments for parallel data preprocessing offline.
-
Large-scale distributed training
Run large-scale offline distributed training with open-source frameworks. Supports thousands of nodes simultaneously.
-
Offline inference
Run offline inference jobs to improve idle GPU utilization.
Workflow
The typical DLC workflow:
-
Preparations
Prepare computing resources, an image, a dataset, and a code repository. Preparations.
-
Create a training job
Submit training jobs from the console, SDK, or CLI. Create a training job.
Available advanced features:
-
Automatic fault tolerance: Launches an AIMaster instance to monitor the job and automatically recover from failures.
-
Health check: Runs SanityCheck on resources before training and automatically isolates faulty nodes to reduce job startup failures.
-
EasyCKPT: Saves and recovers large PyTorch models without data loss and supports resumed training from checkpoints.
-
RDMA configuration: Configure an RDMA network for Lingjun AI Computing Service resources to accelerate inter-node communication in distributed training.
-
Storage configuration: Access training data in OSS, NAS, CPFS, or MaxCompute by configuring them in your code or mounting them as volumes.
-
SLS log forwarding: Forward DLC job logs to a specified Log Service (SLS) Logstore for custom analysis and monitoring.
-
Preemptible resource: Use a preemptible resource from Lingjun AI Computing Service to acquire AI computing power at a lower cost.
-
-
View and manage training jobs
After submitting a job, view the training job details to monitor status. You can also stop, clone, share, or delete jobs. Manage training jobs.
-
Monitor training jobs
Monitor training jobs in the following ways:
-
For a training job with a bound dataset, view the training job analysis report.
-
Use CloudMonitor or ARMS to view resource status or configure alert rules. Monitor a training job by using CloudMonitor or ARMS.
-
Create message notification rules in the PAI workspace event center. Configure message notifications.
-
-
Configure scheduled training jobs
For continuous training and model tuning with updated data or hyperparameters, configure offline scheduling to submit DLC jobs periodically.
Explore DLC tutorials for additional use cases.
Related topics
-
Create a training job: Submit training jobs from the console, an SDK, or the CLI, and configure key parameters.
-
DLC use cases: Practical examples for using DLC.