Unified inference and training resource management - Platform For AI

Training-serving colocation runs inference services and training jobs on the same GPU cluster. Through the parent-child resource quota preemption mechanism, inference services automatically preempt training resources. Combined with EAS (Elastic Algorithm Service) scheduled auto scaling and DLC (Deep Learning Containers) idle compute resources, the cluster prioritizes inference during the day and runs training at night, keeping GPUs fully utilized.

Background

Example scenario

Assume you have a 128-GPU cluster shared by three teams:

Team A runs inference services and has the highest resource priority.
Teams B and C run model training with lower priority than inference.
When Team A needs more inference resources, the system automatically reclaims training resources from Teams B and C.
During the day, EAS scales up to handle inference traffic. At night, EAS scales down to release GPUs, and training jobs start automatically.
Teams B and C manage their resources and jobs independently without interfering with each other.

How it works

EAS inference services are deployed on the parent quota, while DLC training jobs are deployed on sub-quotas. When inference services need more resources, the system automatically preempts training compute. Combined with EAS scheduled auto scaling (scale up during the day, scale down at night) and DLC idle compute resources (use spare compute for training at night), no manual steps are required.

To implement this scenario:

Create Quota 1 with 128 GPUs and enable the child-level compute preemption switch. Under Quota 1, create two sub-quotas: Quota 1.1 (48 GPUs) and Quota 1.2 (80 GPUs).
Create workspace_a for Team A and bind it to Quota 1. Deploy EAS inference services on Quota 1 and configure scheduled auto scaling.
Create workspace_b for Team B and bind it to Quota 1.1. Create DLC training jobs on Quota 1.1 with idle compute resources enabled.
Create workspace_c for Team C and bind it to Quota 1.2. Create DSW (Data Science Workshop) instances on Quota 1.2 for development.

Procedure

Prepare AI computing resources (general-purpose compute resources or Lingjun AI computing resources). General-purpose resource pools must be version 2.0 to support EAS, DLC, and DSW simultaneously. For more information, see Resource pool overview.
Add Resource Quota.
1. Create Quota 1 with the following key parameters. For more information, see Create a resource quota or General computing resource quotas.
  - Select the resources (128 GPUs).
  - Enable the Child-level Preemption switch. When enabled, EAS inference services on the parent quota can preempt training resources on sub-quotas.
2. In the Actions column of Quota 1, click New Child-level Resource Quota to create two sub-quotas. For more information, see Create parent-child quotas.
  - Quota 1.1: 48 GPUs.
  - Quota 1.2: 80 GPUs.
Create three workspaces and bind each to its corresponding quota. For more information, see Create and manage a workspace.
- Team A: workspace_a, bound to Quota 1.
- Team B: workspace_b, bound to Quota 1.1.
- Team C: workspace_c, bound to Quota 1.2.
Create an EAS inference service on Quota 1 and configure scheduled auto scaling. For more information, see Service Deployment.
Configure the scheduled auto scaling rules as follows:
- Scale up to the target number of replicas at 8:00 AM to handle daytime inference traffic.
- Scale down to zero or a minimal number of replicas at 10:00 PM to release GPUs for training jobs.
For detailed configuration, see Scheduled auto scaling.
Create DLC training jobs or DSW instances on a sub-quota and enable idle compute resources. For more information, see Create a training job.
After you enable idle compute resources, training jobs can use spare compute beyond the quota limit. GPUs released by EAS at night are automatically assigned to training jobs.
For detailed configuration, see Use idle-time resources.
Grant workspace administrator permissions to Teams A, B, and C. For more information about workspace configuration, see Configure a workspace. For role definitions, see Roles and permissions.

Use cases

Preempt training resources for inference

On the Resource Quota page, click Quota 1, and on the Overview tab, enable the Child-level compute preemption switch.

When Team A's inference service needs more resources than available, the system automatically reclaims training resources from Teams B and C.

Redistribute resources between Teams B and C

Adjust the resources of Quota 1.1 and Quota 1.2 based on team requirements. On the Resource Quota page, find Quota 1.1 or Quota 1.2, and click Scale in the Actions column. For more information, see Scale quotas.

Scale Quota 1.1 from 48 GPUs to 56 GPUs (add 8 GPUs).
Scale Quota 1.2 from 80 GPUs to 72 GPUs (remove 8 GPUs).

Isolate permissions between Teams B and C

Quota 1.1 is bound to workspace_b, and Quota 1.2 is bound to workspace_c. Teams B and C manage their resources and jobs independently within their own workspaces. For more information, see Workspace scheduling center.

To configure resource usage roles: On the Workspace Settings page, click the Scheduling Configuration tab. In the Resource Usage section, select the Allowed Roles for the target quota, click + Add to add a configuration entry, and then click Save.

Platform For AI:Unified inference and training resource management