This topic describes the scheduling of elastic resources, AI tasks, and heterogeneous resources to help maximize resource utilization and improve the efficiency of jobs.
ACK provides hybrid scheduling for different types of elastic resources.
|Elastic scheduling||Alibaba Cloud provides different types of elastic resources, such as Elastic Compute
Service (ECS) instances and elastic container instances. The following billing methods
are supported: subscription, pay-as-you-go, and preemptible instances.
This feature allows you to schedule various types of resources such as ECS instances and elastic container instances. You can also configure priority-based resource scheduling policies for an application. When the system deploys or scales out pods for an application, pods are scheduled to nodes based on the priorities of the nodes that are listed in the scheduling policies that you configured. When the system scales in pods for an application, pods are removed from nodes based on the priorities of the nodes in ascending order.
For example, you can configure the system to prioritize ECS instances during scale-out activities. The system starts to use elastic container instances only after ECS instances are exhausted. The system prioritizes elastic container instances over ECS instances during scale-in activities. This helps you reduce costs and use resources in a more efficient manner.
ACK provides the gang scheduling and capacity scheduling features for batch processing tasks.
|Gang scheduling||For example, if a job requires all-or-nothing scheduling, all tasks of the job must
be scheduled at the same time. If only some of the tasks are started, the started
jobs must wait until all the remaining tasks are scheduled. If each submitted job
contains unscheduled tasks, all submitted jobs remain in the Pending state and the
cluster is deadlocked.
To resolve this issue, Alibaba Cloud provides the gang scheduling feature. Gang scheduling aims to start all correlated processes at the same time. This prevents the process group from being blocked when the system fails to start some processes.
|Capacity Scheduling||Kubernetes uses resource quotas to allocate resources based on fixed amounts. However,
a Kubernetes cluster may be managed by multiple users who use cluster resources in
different cycles and ways. As a result, resource utilization is low in the Kubernetes
To improve the resource utilization of a Kubernetes cluster, ACK provides the capacity scheduling feature to optimize resource allocation. This feature is designed based on the Yarn capacity scheduler and the Kubernetes scheduling framework. Capacity scheduling allows you to meet the resource requests in a Kubernetes cluster and improve resource utilization by sharing idle resources.
Scheduling of heterogeneous resources
Container Service for Kubernetes (ACK) provides the cGPU, topology-aware CPU scheduling, and topology-aware GPU scheduling features to allow you to schedule heterogeneous resources. For more information, see Labels used by ACK to control GPUs.
|cGPU||cGPU provides GPU sharing to help reduce the costs of GPU resources while ensuring
the stability of workloads that require GPU resources.
ACK Pro clusters support the following GPU policies:
|cGPU Professional Edition|
|Topology-aware CPU scheduling and topology-aware GPU scheduling||To ensure the high performance of workloads, schedulers select an optimal scheduling solution based on the topological information about the heterogeneous resources of nodes. The information includes how GPUs communicate with each other by using NVLink and PCIe Switches, and the non-uniform memory access (NUMA) topology of CPUs.|
|Field Programmable Gate Array (FPGA) scheduling||This feature allows you to manage the FPGA resources of a cluster in a unified manner. You can use this feature to schedule workloads that require FPGA resources to FPGA-accelerated nodes.||Use labels to schedule pods to FPGA-accelerated nodes|