Deep Learning Containers (DLC) lets you run training jobs on idle quota capacity from your cluster, improving overall resource utilization without disrupting regular business operations.
Idle resource jobs can be terminated at any time when borrowed resources are reclaimed. Before enabling idle resources, make sure your training code implements checkpointing. Use EasyCkpt to add automatic checkpoint-and-resume support.
How it works
In large-scale clusters, quota utilization is uneven: some teams exhaust their quotas while others leave resources idle. DLC's idle resource feature lets a training job borrow unused capacity from the current quota or other quotas in the cluster.
The table below summarizes how idle resource jobs differ from regular quota jobs:
| Dimension | Regular quota job | Idle resource job |
|---|---|---|
| Resource source | Allocated quota only | Idle capacity from any quota |
| Quota constraint | Bound by quota limits | Not bound by quota limits |
| Termination risk | None (quota is reserved) | Job is terminated when borrowed resources are reclaimed |
| Fault tolerance | Standard | Requires checkpointing or AIMaster |
Reclamation trigger: When a regular job from the quota group that owns the borrowed resources is dequeued but cannot be scheduled due to insufficient capacity, the system reclaims those resources. The idle resource job's pod status changes to Preempted
Prerequisites
Before you begin, make sure that you have:
A subscription resource quota created and associated with your workspace. The quota can be general computing resources or Lingjun resources. For details, see Overview.
Submit a DLC job with idle resources
On the DLC job submission page, go to the Resource Information section and configure the following parameters. For general submission steps, see Submit training jobs.
Parameter Description Resource quota Select a general computing resource quota or a Lingjun resources quota. To run high-performance AI training and computing, use Lingjun resources. Lingjun resources are available only in the China (Ulanqab) and Singapore regions. Idle resources Choose how the job uses resources: Acceptable — the job may use idle computing resources or resources from the associated quota. Only idle resources — the job runs exclusively on idle capacity and never uses the associated quota. Automatic fault tolerance Enable this to let AIMaster automatically reallocate resources when idle capacity is reclaimed, resuming the job without manual intervention. See AIMaster: Elastic fault tolerance engine. 

Monitor resource usage from the DLC job list or the job details page. Each job shows whether it is running on quota resources or idle resources.
In quota: The job is using resources within the associated quota.
Not in quota: The job is running on borrowed idle resources.
Preempted: The job's pod was evicted because the borrowed idle resources were reclaimed.

Reduce the impact of preemption
Idle resource jobs can be preempted at any point. Use the following tools together to minimize training loss and enable automatic recovery:
EasyCkpt: Saves training checkpoints automatically, so the job can resume from the last checkpoint instead of restarting from scratch. See Use EasyCkpt to save and resume foundation model trainings.
AIMaster (Automatic fault tolerance): When idle resources are reclaimed, AIMaster reallocates alternative resources and resumes the job automatically without requiring you to restart it manually. See AIMaster: Elastic fault tolerance engine.
These two tools are complementary: EasyCkpt preserves training progress at the checkpoint level, while AIMaster handles resource reallocation and job continuation at the infrastructure level. Enable both for the most resilient idle resource jobs.