ACK provides specialized scheduling capabilities for AI training, batch inference, heterogeneous GPU and FPGA workloads, and large-scale batch jobs. Use the tables below to identify the right feature for your scenario.
Elastic scheduling
Mix ECS instances, Elastic Container Instances (ECI), and preemptible instances in a single application, then define priority-based policies that control which resource type is used first during scale-out and which is released first during scale-in.
| Feature | Scenario | References |
|---|---|---|
| Elastic scheduling | Reduce costs by prioritizing cheaper resources during scale-out (for example, exhaust ECS instances before falling back to ECI) and releasing them first during scale-in. Supports subscription, pay-as-you-go, and preemptible instances. | Use Elastic Container Instance-based scheduling and Configure priority-based resource scheduling |
Task scheduling
ACK provides gang scheduling, Capacity Scheduling, and Kube Queue for batch processing and AI workloads.
| Feature | Scenario | References |
|---|---|---|
| Gang scheduling | Distributed training or batch jobs that require all tasks to start simultaneously. Without gang scheduling, partially started jobs block cluster resources and cause deadlock (all jobs stuck in Pending). Gang scheduling starts all correlated processes at the same time, preventing the process group from blocking. | Work with gang scheduling |
| Capacity Scheduling | Multi-team clusters where different teams use resources at different times. Standard Kubernetes resource quotas allocate fixed amounts per namespace, leading to idle resources when a team's quota goes unused. Capacity Scheduling, built on the Yarn capacity scheduler and the Kubernetes scheduling framework, lets teams share idle resources across quota boundaries. | Use Capacity Scheduling |
| Kube Queue (ack-kube-queue) | Large clusters running AI, machine learning, and batch workloads submitted by multiple users. Pod-level scheduling degrades when job counts are high, and jobs from different users can interfere during scheduling. ack-kube-queue manages job queues with customizable policies and an integrated quota system to maximize resource utilization. | Use ack-kube-queue to manage job queues |
Scheduling of heterogeneous resources
ACK provides cGPU, topology-aware CPU scheduling, and topology-aware GPU scheduling features to schedule heterogeneous resources. For the node labels that control GPU scheduling, see Labels used by ACK to control GPUs.
GPU sharing with cGPU
cGPU lets multiple pods share a single GPU while isolating each pod's GPU memory. ACK Pro clusters support the following GPU policies based on your workload type:
| Policy | Use when | Description |
|---|---|---|
| One-pod-one-GPU sharing and memory isolation | Model inference | A single pod uses one GPU with memory isolation enforced between pods on the same GPU. |
| One-pod-multi-GPU sharing and memory isolation | Building code to train distributed models | A single pod spans multiple GPUs with memory isolation, suited for building code to train distributed models. |
| binpack or spread allocation | Improving GPU utilization and ensuring high availability | GPU allocation based on the binpack or spread algorithm to improve GPU utilization and ensure the high availability of GPUs. |
See cGPU Professional Edition for setup instructions.
Topology-aware CPU scheduling and topology-aware GPU scheduling
For performance-sensitive workloads, the scheduler selects an optimal placement based on the hardware topology of the node: GPU-to-GPU communication paths (NVLink and PCIe Switches) and the non-uniform memory access (NUMA) topology of CPUs.
| Feature | References |
|---|---|
| Topology-aware CPU scheduling | Topology-aware CPU scheduling |
| Topology-aware GPU scheduling | Overview |
FPGA scheduling
Schedule workloads that require FPGA resources to FPGA-accelerated nodes using labels, and manage all FPGA resources in the cluster in a unified manner.
| Feature | References |
|---|---|
| FPGA scheduling | Use labels to schedule pods to FPGA-accelerated nodes |
Task queue scheduling
ACK lets you customize task queue scheduling for AI workloads, machine learning workloads, and batch jobs.