All Products
Search
Document Center

Platform For AI:DLC overview

Last Updated:Mar 11, 2026

Deep Learning Containers (DLC) is a managed training service within PAI that lets you quickly create single-node or distributed training jobs. DLC uses Kubernetes to launch compute nodes, so you do not need to manually provision machines or configure runtime environments. DLC integrates with existing workflows without disruption, supports multiple deep learning frameworks, and offers flexible resource configuration options, making it ideal for rapid training job deployment.

How DLC works

When you submit a training job, DLC handles the infrastructure so you can focus on your model code:

  1. Submit a job -- Define your training job through the console, an SDK, or the command line. Specify the framework, resource type, and runtime environment.

  2. Provision resources -- DLC allocates compute nodes from Lingjun AI Computing Service or general computing resources based on configuration.

  3. Launch containers -- Kubernetes launches containers on provisioned nodes using an official image or custom runtime environment.

  4. Run training -- Training code executes. For distributed jobs, DLC coordinates across nodes using the selected framework.

  5. Monitor and recover -- Built-in fault tolerance and health check capabilities detect issues and trigger automatic recovery.

  6. Output results -- When training completes, logs and results are available for review.

Key capabilities

Run training at any scale

DLC is built on Lingjun AI Computing Service and general computing resources, giving you access to a range of compute instance types:

  • Elastic Compute Service (ECS)

  • Elastic Container Instance (ECI)

  • Shenlong Bare Metal Instances

  • Lingjun bare metal instances

This combination enables hybrid scheduling of heterogeneous computing, so you can match the right hardware to each workload.

Use your preferred frameworks

DLC supports over ten training frameworks without requiring cluster builds:

Megatron

DeepSpeed

PyTorch

TensorFlow

Slurm

Ray

MPI

XGBoost

DLC provides various official images and supports custom runtime environments. Submit jobs through the console, an SDK, or the command line.

Train reliably at scale

For LLM training, DLC includes proprietary reliability features that provide rapid detection, precise diagnostics, and fast feedback:

Feature

Description

AIMaster

A proprietary fault tolerance engine that detects and recovers from failures automatically.

EasyCKPT

A high-performance checkpointing framework that saves training state efficiently.

SanityCheck

A health check feature that validates node readiness in the early stages of job training.

Node self-healing

Detects and handles unhealthy nodes to keep training running.

Together, these features resolve stability issues, reduce computing power loss, and improve training reliability.

Accelerate training performance

A proprietary AI training acceleration framework improves distributed training efficiency through multiple optimization layers:

  • Parallel strategies -- Data parallelism, pipeline parallelism, operator splitting, and nested parallelism with automatic parallel strategy exploration.

  • Memory optimization -- Multi-dimensional memory optimization reduces GPU memory pressure.

  • Network-aware scheduling -- Topology-aware scheduling over high-speed networks places workloads for optimal communication patterns.

  • Communication optimizations -- The distributed communication library includes communication thread pools, gradient grouping, mixed-precision communication, and gradient compression.

These optimizations are especially effective for large model pre-training, continuous training, and alignment.

Resource types

PAI offers two resource types based on use case and computing power requirements.

Lingjun AI Computing Service

General computing resources

Best for

Large model training and deep learning tasks that require massive computing resources

Standard training needs across various scales and types

Architecture

Hardware-software co-optimization for ultra-large-scale deep learning and integrated AI computing

Standard cloud compute infrastructure

Core strengths

High performance, high efficiency, and high utilization

Flexibility across machine learning tasks of various scales and types

Typical use cases

Large model training, autonomous driving, fundamental research, finance

General machine learning and deep learning workloads

Key differentiator

High-performance heterogeneous computing foundation with end-to-end AI engineering capabilities

Broad compatibility with standard training workflows

Purchasing options

Lingjun AI Computing Service and general computing resources are available through the following purchasing options:

Option

Description

Billing model

Availability

Resource quota

Purchase Lingjun AI Computing Service or general computing resources in advance for AI development and training. This option enables flexible resource management and efficient use.

Subscription

Lingjun and General

Public resources

Use Lingjun AI Computing Service or general computing resources on demand when you submit a training job, without purchasing in advance.

Pay-as-you-go

Lingjun and General

Preemptible resources

Acquire AI computing power at a lower cost to reduce overall job expenses.

Pay-as-you-go (discounted)

Lingjun only

Scenarios

Data preprocessing

Customize runtime environments to perform offline parallel preprocessing on data, significantly simplifying data preprocessing. Useful when cleaning, transforming, or augmenting large datasets before training.

Large-scale distributed training

Conduct offline, large-scale distributed training using various open-source deep learning frameworks. DLC supports training on thousands of nodes simultaneously, significantly shortening training time.

Offline inference

Use DLC to run offline inference on models. This approach improves GPU utilization during idle periods and reduces resource waste. Offline inference is well suited for batch prediction workloads where real-time latency is not required.

Get started

  • Create training tasks -- Learn how to submit training jobs through the console, an SDK, or the command line, and configure key parameters.

  • DLC use cases -- Learn how to use DLC through practical examples.