ACK One Multi-cluster Spark and AI Job Scheduling

This article presents ACK One's multi-cluster AI job scheduling solution that optimizes resource utilization by distributing Spark jobs across multipl...

By Jing Cai, Kun Wu and Yi Chen

Overview

In the cloud-native era, an increasing number of users are using multiple Kubernetes clusters to support their business operations. This shift is driven by factors such as global business spread (even reaching the upper limit of a single cluster), the urgent need for disaster recovery, and security compliance requirements. When deploying applications in Kubernetes clusters, administrators usually reserve considerable resource buffers to deal with load fluctuations - a practice aimed at ensuring application reliability. However, this leads to a persistent gap where the requested resources of containers far exceed actual usage, resulting in low cluster resource utilization. In this case, we often run offline tasks across multiple clusters to improve resource utilization. Yet, critical challenges remain, including how to determine which cluster has sufficient resources, the cluster to which the offline task is scheduled, and how to ensure that offline tasks do not affect running online services.

Spark is a widely used distributed computing framework in the big data and AI fields. ACK One's AI job scheduling across clusters supports scheduling and distributing Spark jobs in multiple clusters. Combined with the colocation of ACK Koordinator, it can solve the preceding challenges. On the premise of not disrupting running online services, it intelligently schedules Spark jobs based on the actual available resources in each cluster, maximizing the utilization of idle resources in multiple clusters.

ACK One Multi-cluster AI Job Scheduling and Distribution

Alibaba Cloud Distributed Cloud Container Platform (ACK One) is an enterprise-class cloud-native container platform that is developed by ACK to meet container management requirements in hybrid cloud, multi-cluster, distributed computing, and disaster recovery scenarios. It provides unified multi-cluster management capabilities. Through ACK One registered clusters, you can connect your Kubernetes clusters from other public cloud service providers and in data centers to the Alibaba Cloud Container Service for Kubernetes console. Then, the Fleet instance enables unified multi-cluster management for these registered clusters as well as ACK and ACK Edge clusters, including application distribution, traffic management, observable O&M management, and security management.

ACK One multi-cluster AI job scheduling and distribution provides the job distribution feature for multi-cluster and hybrid cloud scenarios, enabling centralized scheduling and distribution of AI workloads. When a single Container Service for Kubernetes (ACK) cluster fails to meet the resource requirements for large-scale AI training or inference tasks, or when there are idle resources across multiple ACK clusters, this feature lets you distribute jobs across these clusters to optimize resource utilization. ACK One multi-cluster job distribution offers the following capabilities:

1. Multiple-job type support: Compatible with PyTorchJob, SparkApplication, and TFJob.

2. Multi-cluster gang scheduling: Distribute jobs across clusters through resource pre-allocation or dynamic resource checks, ensuring successful task deployment to sub-clusters and improving overall scheduling efficiency.

3. Multi-tenant quota management: Enforce resource limits per tenant using ElasticQuotaTree-based namespace quotas in multi-tenant environments.

4. Priority-based scheduling: Prioritize important tasks for resource allocation based on PriorityClass defined in PodTemplate for AI jobs.

5. Multiple task queuing policy configuration: Allow for flexible queue policies to support cluster utilization optimization and task priority guarantee modes, supporting both blocking and non-blocking scheduling patterns.

6. Job rescheduling on failure: The Global Scheduler automatically reclaims failed jobs and reschedules them to eligible clusters with sufficient resources.

1_jpeg

The specific process is as follows:

Submit PyTorchJob, SparkApplications, or TFJob type jobs with the distribution policy PropagationPolicy to the Fleet instance.
The Fleet instance performs capacity scheduling based on job priorities and tenant quotas.
The Global Scheduler in the Fleet instance applies multi-cluster dynamic resource scheduling and gang scheduling for dequeued jobs, reserving resources or dynamically checking for eligible clusters. If scheduling fails, the job is re-queued.
Successfully scheduled jobs are propagated to designated ACK clusters.
If a job fails in a sub-cluster, the Global Scheduler reclaims and reschedules it to other eligible clusters.

Spark Jobs Schedulable and Distributable Based on idle Resources in Multiple Clusters

The following features are required when you use idle resources to schedule and distribute a Spark job in multiple clusters:

Multi-cluster Spark job scheduling and distribution provided by ACK One Fleet instances, including idle resource-aware scheduling.
Single-cluster colocation of ACK Koordinator.
Colocation of Koordinator supported by ACK Spark Operator.

Scheduling Based on Idle Resources

Scheduling based on idle resources in a single ACK cluster is the dynamic resource overcommitment of ACK Koordinator. ACK Koordinator introduces the Batch concept to describe idle resources and records them on nodes through the Extended Resource of Kubernetes. As shown in the following figure, when scheduling with the default Request-based scheduler, area 4 remains idle, and the newly schedulable resources are only the light gray part at the top. However, scheduling based on Koordinator's Batch resources can make full use of idle resources (areas 5 + 6). To improve performance, Koordinator allows configuring a certain percentage of reserved resources. For more information, see Enable dynamic resource overcommitment.

2_jpeg

Leveraging ACK Koordinator's colocation capability, ACK Spark Operator allows you to convert the resources required by the SparkApplication driver and executor into Koordiantor's Batch resources based on specific annotations. They are scheduled based on Batch resources when ACK Scheduler schedules pods.

From the multi-cluster perspective, the Global Scheduler of the Fleet instance dynamically monitors idle resources of sub-clusters when scheduling jobs in multiple clusters. The Spark job checks whether the Batch resources of each sub-cluster are sufficient for the resources required by the driver and executor according to the corresponding annotation, and determines the optimal target cluster for job distribution in combination with the multi-cluster gang scheduling of Global Scheduler. For sub-clusters whose Kubernetes version is 1.28 or later, the Fleet instance supports resource preoccupation to improve the success rate of Spark job scheduling. At the same time, the Fleet instance watches the running status of the Spark job in sub-clusters. If the driver cannot be run due to insufficient resources, the Fleet instance reclaims the SparkApplication after a specific period of time and reschedules SparkApplication to other sub-clusters that have sufficient resources.

Complete Solution Process

3_jpeg

The following steps are required when you use idle resources to schedule and distribute a Spark job in multiple clusters:

1. Prepare the environment.

Create an ACK One Fleet instance and associate multiple ACK clusters with the Fleet instance for unified management.
Install ACK Koordinator in each ACK cluster and manage colocation policies.
Determine the namespace of SparkApplication to be distributed. Create the namespace in the Fleet instance and each ACK cluster, and configure it as the managed namespace when installing the ACK Spark Operator.

2. Create SparkApplication and PropagationPolicy for the Fleet instance. The multi-cluster scheduling component (Global Scheduler) of the Fleet instance matches Spark job resource requests based on the remaining resources of each associated sub-cluster. For sub-clusters whose Kubernetes version is 1.28 or later, the Fleet instance supports resource preoccupation to improve the success rate of Spark job scheduling.

In this step, you can use PriorityClass to assign the submitted Spark job as a low priority to ensure that the submitted Spark job does not preempt resources from online services or affect their normal operation.

3. After the Fleet instance schedules jobs, SparkApplication is scheduled and distributed to the associated clusters.

4. If the job fails, Global Scheduler will reclaim the job and reschedule it to other qualified clusters with sufficient resources.

In addition, through PriorityClass lowering the priority of the Spark job and ACK Koordinator's native QoS capabilities, running the Spark job never compromises the normal operation of online services in the corresponding cluster.

Conclusion

In summary, multi-cluster Spark job scheduling and distribution provided by ACK One helps you break through the resource limits and schedule the Spark job based on the idle resources of each cluster without affecting the online services that are running in the cluster. This maximizes the utilization of idle resources in your multiple clusters. ACK One also supports multi-cluster scheduling for other AI job types such as PytorchJob and TFJob. For more information, see Best practices of multi-cluster Spark scheduling based on idle resources and ACK One multi-cluster AI job scheduling and distribution.

Community

ACK One Multi-cluster Spark and AI Job Scheduling

Overview

ACK One Multi-cluster AI Job Scheduling and Distribution

Spark Jobs Schedulable and Distributable Based on idle Resources in Multiple Clusters

Scheduling Based on Idle Resources

Complete Solution Process

Conclusion

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Container Compute Service (ACS)

Container Service for Kubernetes

EasyDispatch for Field Service Management

Conversational AI Service