All Products
Search
Document Center

Platform For AI:Use idle resources

Last Updated:Apr 01, 2026

Deep Learning Containers (DLC) lets you run training jobs on idle quota capacity from your cluster, improving overall resource utilization without disrupting regular business operations.

Important

Idle resource jobs can be terminated at any time when borrowed resources are reclaimed. Before enabling idle resources, make sure your training code implements checkpointing. Use EasyCkpt to add automatic checkpoint-and-resume support.

How it works

In large-scale clusters, quota utilization is uneven: some teams exhaust their quotas while others leave resources idle. DLC's idle resource feature lets a training job borrow unused capacity from the current quota or other quotas in the cluster.

The table below summarizes how idle resource jobs differ from regular quota jobs:

DimensionRegular quota jobIdle resource job
Resource sourceAllocated quota onlyIdle capacity from any quota
Quota constraintBound by quota limitsNot bound by quota limits
Termination riskNone (quota is reserved)Job is terminated when borrowed resources are reclaimed
Fault toleranceStandardRequires checkpointing or AIMaster

Reclamation trigger: When a regular job from the quota group that owns the borrowed resources is dequeued but cannot be scheduled due to insufficient capacity, the system reclaims those resources. The idle resource job's pod status changes to Preempted

Prerequisites

Before you begin, make sure that you have:

  • A subscription resource quota created and associated with your workspace. The quota can be general computing resources or Lingjun resources. For details, see Overview.

Submit a DLC job with idle resources

  1. On the DLC job submission page, go to the Resource Information section and configure the following parameters. For general submission steps, see Submit training jobs.

    ParameterDescription
    Resource quotaSelect a general computing resource quota or a Lingjun resources quota. To run high-performance AI training and computing, use Lingjun resources. Lingjun resources are available only in the China (Ulanqab) and Singapore regions.
    Idle resourcesChoose how the job uses resources: Acceptable — the job may use idle computing resources or resources from the associated quota. Only idle resources — the job runs exclusively on idle capacity and never uses the associated quota.
    Automatic fault toleranceEnable this to let AIMaster automatically reallocate resources when idle capacity is reclaimed, resuming the job without manual intervention. See AIMaster: Elastic fault tolerance engine. image

    image

  2. Monitor resource usage from the DLC job list or the job details page. Each job shows whether it is running on quota resources or idle resources.

    • In quota: The job is using resources within the associated quota.

    • Not in quota: The job is running on borrowed idle resources.

    • Preempted: The job's pod was evicted because the borrowed idle resources were reclaimed.

    image

Reduce the impact of preemption

Idle resource jobs can be preempted at any point. Use the following tools together to minimize training loss and enable automatic recovery:

These two tools are complementary: EasyCkpt preserves training progress at the checkpoint level, while AIMaster handles resource reallocation and job continuation at the infrastructure level. Enable both for the most resilient idle resource jobs.