Alibaba Cloud Container Compute Service (ACS) provides the gang scheduling feature, which fulfills all-or-nothing requirements in job scheduling scenarios. This topic describes how to use gang scheduling.
Prerequisites
kube-scheduler is installed and its version meets the following requirements.
ACS cluster version
Scheduler version
1.31
v1.31.0-aliyun-1.2.0 and later
1.30
v1.30.3-aliyun-1.1.1 and later
1.28
v1.28.9-aliyun-1.1.0 and later
Gang scheduling supports only the high-performance network GPU (gpu-hpn) compute type. For more information, see Definition of computing types.
The Enable Custom Labels And Schedulers For GPU-HPN Nodes setting is disabled. For more information, see Component configuration.
Feature introduction
When a job creates multiple pods, the pods must start and run in a coordinated manner. Resources must be allocated to the group of pods as a batch to ensure that all pods can request resources at the same time. If the scheduling requirements of a single pod are not met, the scheduling for the entire group of pods fails. The scheduler provides these all-or-nothing scheduling semantics to help prevent resource deadlocks caused by resource competition among multiple jobs.
The built-in scheduler of ACS provides the gang scheduling feature to implement all-or-nothing scheduling, which ensures that jobs can run successfully.
The group of pods for which the gang scheduling feature is configured must belong to the same compute class.
Usage
The gang scheduling feature provided by ACS is compatible with the PodGroup custom resource in Kubernetes. The corresponding version is podgroups.scheduling.sigs.k8s.io/v1alpha1. Before you submit a job, you must create a PodGroup instance in the job's namespace and specify the minimum number of pods (`minMember`) required for the job to run. Then, when you create the job's pods, you must associate them with the PodGroup instance using the Pod-group.scheduling.sigs.k8s.io label. During scheduling, ACS allocates resources to all pods that share the same PodGroup label.
Create a PodGroup custom resource.
apiVersion: scheduling.sigs.k8s.io/v1alpha1 kind: PodGroup metadata: name: demo-job-podgroup namespace: default spec: scheduleTimeoutSeconds: 10 minMember: 3 # Set the minimum number of running pods.Create a job and associate it with the PodGroup.
apiVersion: batch/v1 kind: Job metadata: name: gang-job namespace: default spec: parallelism: 3 # The number of pods must be greater than or equal to minMember in the PodGroup object. template: metadata: labels: alibabacloud.com/compute-class: "gpu" # Specify the compute class as gpu or gpu-hpn. alibabacloud.com/gpu-model-series: "example-model" # The GPU compute class requires you to specify a GPU model. pod-group.scheduling.sigs.k8s.io: demo-job-podgroup # Associate with the PodGroup instance demo-job-podgroup. spec: containers: - name: demo-job image: registry.cn-hangzhou.aliyuncs.com/acs/stress:v1.0.4 args: - 'infinity' command: - sleep resources: requests: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" limits: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" restartPolicy: Never backoffLimit: 4
Make sure that the number of associated pods is greater than or equal to the `minMember` value configured for the PodGroup instance. Otherwise, the pods cannot be scheduled.
Examples
This example demonstrates both successful and failed scheduling outcomes when you use gang scheduling for a job.
Run the following command to create the
test-gangnamespace.kubectl create ns test-gangRun the following command to create a ResourceQuota in the
test-gangnamespace to demonstrate how gang scheduling behaves when resources are insufficient.cat << EOF | kubectl apply -f - apiVersion: v1 kind: ResourceQuota metadata: name: object-counts namespace: test-gang spec: hard: pods: "2" EOFRun the following command to create a PodGroup object. In the object,
minMemberis set to 3, which specifies that at least 3 associated pods must be scheduled successfully at the same time. If one of the pods fails to be created or scheduled, all pods in the group remain in the Pending state.cat << EOF | kubectl apply -f - apiVersion: scheduling.sigs.k8s.io/v1alpha1 kind: PodGroup metadata: name: demo-job-podgroup namespace: test-gang spec: minMember: 3 # Set the minimum number of running pods. EOFUse the following YAML content to create a gang-job.yaml file. This file defines a Job object that specifies four pod replicas and is associated with the PodGroup object.
apiVersion: batch/v1 kind: Job metadata: name: gang-job namespace: test-gang spec: parallelism: 4 # The number of pods must be greater than or equal to minMember in the PodGroup object. template: metadata: labels: alibabacloud.com/compute-class: "gpu" # Specify the compute class as gpu or gpu-hpn. alibabacloud.com/gpu-model-series: "example-model" # The GPU compute class requires you to specify a GPU model. pod-group.scheduling.sigs.k8s.io: demo-job-podgroup # Associate with the PodGroup instance demo-job-podgroup. spec: containers: - name: demo-job image: registry.cn-hangzhou.aliyuncs.com/acs/stress:v1.0.4 args: - 'infinity' command: - sleep resources: requests: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" limits: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" restartPolicy: Never backoffLimit: 4Run the following command to deploy the gang-job job to the cluster.
kubectl apply -f gang-job.yamlRun the following command to view the pod status.
kubectl get pod -n test-gangExpected output:
NAME READY STATUS RESTARTS AGE gang-job-hrnc6 0/1 Pending 0 23s gang-job-wthnq 0/1 Pending 0 23sThe ResourceQuota limits the number of running pods to two, so only two pods are created for this job. This number is less than the `minMember` value specified in the PodGroup. Therefore, both pods remain in the Pending state and are not scheduled.
Run the following command to delete the ResourceQuota and remove the limit on the number of pods.
kubectl delete resourcequota -n test-gang object-countsRun the following command to view the pod status.
kubectl get pod -n test-gangExpected output:
NAME READY STATUS RESTARTS AGE gang-job-24cz9 1/1 Running 0 96s gang-job-mmkxl 1/1 Running 0 96s gang-job-msr8v 1/1 Running 0 96s gang-job-qnclz 1/1 Running 0 96sThe output indicates that the pods are scheduled successfully.