All Products
Search
Document Center

Container Service for Kubernetes:Scheduling overview

Last Updated:Apr 17, 2024

ACK Scheduler is a scheduler developed by Container Service for Kubernetes (ACK) based on the Kubernetes scheduling framework. ACK Scheduler allows you to schedule different workloads and elastic resources in a unified manner. ACK Scheduler supports hybrid scheduling of different elastic resources, fine-grained scheduling of heterogeneous resources, and scheduling of batch computing tasks. This helps improve the performance of applications and the overall resource utilization for clusters. This topic describes the scheduling features that are provided by ACK, including elastic scheduling, task scheduling, load-aware scheduling, fine-grained scheduling, and heterogeneous resource scheduling.

Elastic scheduling

ACK provides hybrid scheduling for different types of elastic resources.

Feature

Description

References

Elastic scheduling

Alibaba Cloud provides different types of elastic resources, such as Elastic Compute Service (ECS) instances and elastic container instances. The following billing methods are supported: subscription, pay-as-you-go, and preemptible instances.

This feature allows you to schedule various types of resources such as ECS instances and elastic container instances. You can also configure priority-based resource scheduling policies for an application. When the system deploys or scales out pods for an application, pods are scheduled to nodes based on the priorities of the nodes that are listed in the scheduling policies that you configured. When the system scales in pods for an application, pods are removed from nodes based on the priorities of the nodes in ascending order.

For example, you can configure the system to prioritize ECS instances during scale-out activities. The system starts to use elastic container instances only after ECS instances are exhausted. The system prioritizes elastic container instances over ECS instances during scale-in activities. This helps you reduce costs and use resources in a more efficient manner.

Task scheduling

ACK provides the gang scheduling, capacity scheduling, and ack-kube-queue features for batch processing tasks.

Feature

Description

References

Gang scheduling

For example, if a job requires all-or-nothing scheduling, all tasks of the job must be scheduled at the same time. If only some of the tasks are started, the started jobs must wait until all the remaining tasks are scheduled. If each submitted job contains unscheduled tasks, all submitted jobs remain in the Pending state and the cluster is deadlocked.

To resolve this issue, Alibaba Cloud provides the gang scheduling feature. Gang scheduling aims to start all correlated processes at the same time. This prevents the process group from being blocked when the system fails to start some processes.

Work with gang scheduling

Capacity Scheduling

Kubernetes uses resource quotas to allocate resources based on fixed amounts. However, a Kubernetes cluster may be managed by multiple users who use cluster resources in different cycles and ways. As a result, resource utilization is low in the Kubernetes cluster.

To improve the resource utilization of a Kubernetes cluster, ACK provides the capacity scheduling feature to optimize resource allocation. This feature is designed based on the Yarn capacity scheduler and the Kubernetes scheduling framework. Capacity scheduling allows you to meet the resource requests in a Kubernetes cluster and improve resource utilization by sharing idle resources.

Work with capacity scheduling

Kube Queue

Schedulers schedule resources by pod. If you run a large number of tasks in your cluster, the scheduling efficiency of the schedulers is adversely affected. In addition, jobs submitted by different users may interfere with each other during the scheduling process.

ack-kube-queue is designed to manage AI, machine learning, and batch workloads in Kubernetes. It allows system administrators to customize job queue management and improve the flexibility of queues. Combined with a quota system, ack-kube-queue can automate and optimize the management of workloads and resource quotas to maximize resource utilization in Kubernetes clusters.

Use ack-kube-queue to manage job queues

Scheduling of heterogeneous resources

ACK provides the cGPU, topology-aware CPU scheduling, and topology-aware GPU scheduling features to allow you to schedule heterogeneous resources.

Feature

Description

References

cGPU

cGPU provides GPU sharing to help reduce the costs of GPU resources while ensuring the stability of workloads that require GPU resources.

ACK Pro clusters support the following GPU policies:

  • GPU sharing and memory isolation on a one-pod-one-GPU basis. This policy is commonly used in model inference scenarios.

  • GPU sharing and memory isolation on a one-pod-multi-GPU basis. This policy is commonly used to build the code to train distributed models.

  • GPU allocation based on the binpack or spread algorithm. This policy is commonly used to improve GPU utilization and ensure the high availability of GPUs.

Topology-aware CPU scheduling and topology-aware GPU scheduling

To ensure the high performance of workloads, schedulers select an optimal scheduling solution based on the topological information about the heterogeneous resources of nodes. The information includes how GPUs communicate with each other by using NVLink and PCIe Switches, and the non-uniform memory access (NUMA) topology of CPUs.

Field Programmable Gate Array (FPGA) scheduling

This feature allows you to manage the FPGA resources of a cluster in a unified manner. You can use this feature to schedule workloads that require FPGA resources to FPGA-accelerated nodes.

Schedule workloads to FPGA-accelerated nodes

Load-aware scheduling

ACK provides the load-aware scheduling feature that is implemented based on the loads of nodes, and provides the load-aware hotspot descheduling feature.

Feature

Description

References

Load-aware scheduling

This feature is used to schedule pods to nodes with lower loads based on the historical statistics of node loads. This implements load balancing and prevents application or node exceptions caused by overloaded nodes.

Use load-aware pod scheduling

Load-aware hotspot descheduling

ack-koordinator provides the load-aware hotspot descheduling feature, which can sense the changes in the loads on cluster nodes and automatically optimize the nodes that exceed the safety load to prevent extreme load imbalance.

Work with load-aware hotspot descheduling

Fine-grained scheduling

ACK provides features such as recommendation on resource specifications and dynamic resource overcommitment to enable fine-grained scheduling for specific applications.

Feature

Description

References

Recommendation on resource specifications

This feature provides resource profiles and allows you to obtain recommendations on resource specifications for individual containers in pods based on the resource profiles.

Resource profiling

Dynamic resource overcommitment

This feature is used to monitor the load on a node in real time and then schedule idle resources that are allocated to pods. This helps improve the resource utilization of the cluster.

Dynamic resource overcommitment

Elastic resource limit

This feature allows you to dynamically adjust the resource usage and resource watermarks of workloads based on priorities to improve the efficiency of your workloads.

Elastic resource limit

CPU QoS

This feature allows you to guarantee CPU resources for high-priority workloads.

CPU QoS

Memory QoS

This feature allows you to optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers.

Memory QoS for containers

Resource isolation based on the L3 cache and MBA

This feature allows you to use the Last Level Cache (L3 cache) and the Memory Bandwidth Allocation (MBA) feature to isolate the L3 cache and memory bandwidth used by workloads. This helps ensure the quality of service (QoS) of high-priority workloads and improves the overall resource utilization.

Resource isolation based on the L3 cache and MBA

Dynamic resource parameter modification for pods

This feature allows you to dynamically modify the CPU and memory resource parameters for pods without restarting the pods. This feature improves resource utilization while ensuring the stability of your workloads.

Dynamically modify the resource parameters of a pod

Use the nearby memory access acceleration feature on multi-NUMA instances

The nearby memory access acceleration feature of ack-koordinator migrates the memory on the remote non-uniform memory access (NUMA) of a CPU-bound application to the local server in a secure manner. This improves the hit ratio of local memory access and improves the performance of memory-intensive applications.

Use the nearby memory access acceleration feature on multi-NUMA instances

Schedule pods based on idle vSwitch IP addresses

The scheduler of open source Kubernetes cannot detect whether the vSwitches used by the nodes in a Kubernetes cluster have idle IP addresses. When multiple clusters or nodes use the same vSwitch, a pod may fail to launch after it is scheduled to a node if the vSwitch of the node does not have idle IP addresses. In this case, the pod controller recreates the pod, and then the scheduler reschedules the pod to the same node. This process is repeated multiple times within a short period of time, causing service deployment issues and generating a large number of alerts. kube-scheduler provided by Container Service for Kubernetes (ACK) can detect whether the vSwitch used by a node has sufficient idle IP addresses. If the vSwitch does not have sufficient idle IP addresses, kube-scheduler stops scheduling pods to the node.

Schedule pods based on idle vSwitch IP addresses