DLC concepts, benefits, use cases, and workflow - Platform For AI

Benefits

Diverse computing resources:

Supports ECS, ECI, Shenlong Bare Metal, and Lingjun bare metal instances across Lingjun AI Computing Service and general computing resources, with hybrid scheduling of heterogeneous computing power.
Multiple distributed job types:

Supports 10+ frameworks—Megatron, DeepSpeed, PyTorch, TensorFlow, Slurm, Ray, MPI, XGBoost—without separate clusters. Provides pre-built images, custom runtime environments, and job submission from the console, SDK, or CLI.
High stability:

For large model training, DLC provides AIMaster (fault tolerance), EasyCKPT (checkpoints), SanityCheck (health checks), and node self-healing. These detect and recover from faults automatically, reducing computing power loss.
High performance:

A built-in acceleration framework improves training efficiency through data parallelism, pipeline parallelism, operator splitting, and nested parallel strategies. Additional features include automatic parallel strategy exploration, multi-dimensional memory optimization, topology-aware scheduling, and an optimized communication library with gradient group fusion and compression. Optimized for foundation model pre-training, continued training, and alignment.

Resource types

Select a resource type when submitting a training job:

Lingjun AI Computing Service: Computing resources designed for large model training and ultra-large-scale deep learning tasks, such as autonomous driving and scientific research.
General computing resource: Suitable for standard training needs. Supports machine learning tasks of various scales and types.

Both resource types are available in the following ways:

Resource quota: Purchase a subscription to Lingjun AI Computing Service or general computing resources for flexible resource management.
Public resource: Use computing resources on demand without advance purchase. Billed on a pay-as-you-go basis.
Preemptible resource: Acquire Lingjun AI computing power at a lower cost with preemptible instances.

Use cases

Data preprocessing

Customize runtime environments for parallel data preprocessing offline.
Large-scale distributed training

Run large-scale offline distributed training with open-source frameworks. Supports thousands of nodes simultaneously.
Offline inference

Run offline inference jobs to improve idle GPU utilization.

Workflow

The typical DLC workflow:

Preparations

Prepare computing resources, an image, a dataset, and a code repository. Preparations.
Create a training job
Submit training jobs from the console, SDK, or CLI. Create a training job.
Available advanced features:
- Automatic fault tolerance: Launches an AIMaster instance to monitor the job and automatically recover from failures.
- Health check: Runs SanityCheck on resources before training and automatically isolates faulty nodes to reduce job startup failures.
- EasyCKPT: Saves and recovers large PyTorch models without data loss and supports resumed training from checkpoints.
- RDMA configuration: Configure an RDMA network for Lingjun AI Computing Service resources to accelerate inter-node communication in distributed training.
- Storage configuration: Access training data in OSS, NAS, CPFS, or MaxCompute by configuring them in your code or mounting them as volumes.
- SLS log forwarding: Forward DLC job logs to a specified Log Service (SLS) Logstore for custom analysis and monitoring.
- Preemptible resource: Use a preemptible resource from Lingjun AI Computing Service to acquire AI computing power at a lower cost.
- Improve public network access speed: By default, DLC uses a shared gateway with limited bandwidth to access the public internet. You can create a dedicated gateway to increase network upload and download speeds.
- PerfTracker: If your job encounters performance issues, use PerfTracker, an online performance analysis and diagnostic tool. It generates an analysis report and automatically diagnoses the cause of performance degradation.
- ACCL: ACCL is a collective communication library built on NCCL. It delivers higher communication performance for your jobs and includes capabilities for fault diagnosis and self-healing.
View and manage training jobs

After submitting a job, view the training job details to monitor status. You can also stop, clone, share, or delete jobs. Manage training jobs.
Monitor training jobs

Monitor training jobs in the following ways:
- For a training job with a bound dataset, view the training job analysis report.
- Use CloudMonitor or ARMS to view resource status or configure alert rules. Monitor a training job by using CloudMonitor or ARMS.
- Create message notification rules in the PAI workspace event center. Configure message notifications.
Configure scheduled training jobs

For continuous training and model tuning with updated data or hyperparameters, configure offline scheduling to submit DLC jobs periodically.

Explore DLC tutorials for additional use cases.

Platform For AI:DLC overview

Benefits

Resource types

Use cases

Workflow

Related topics