Use gang scheduling - Container Compute Service - Alibaba Cloud Documentation Center

Alibaba Cloud Container Compute Service (ACS) provides the gang scheduling feature, which fulfills all-or-nothing requirements in job scheduling scenarios. This topic describes how to use gang scheduling.

Prerequisites

kube-scheduler is installed and its version meets the following requirements.
ACS cluster version
Scheduler version
1.31
v1.31.0-aliyun-1.2.0 and later
1.30
v1.30.3-aliyun-1.1.1 and later
1.28
v1.28.9-aliyun-1.1.0 and later
Gang scheduling supports only the high-performance network GPU (gpu-hpn) compute type. For more information, see Definition of computing types.
The Enable Custom Labels And Schedulers For GPU-HPN Nodes setting is disabled. For more information, see Component configuration.

Feature introduction

When a job creates multiple pods, the pods must start and run in a coordinated manner. Resources must be allocated to the group of pods as a batch to ensure that all pods can request resources at the same time. If the scheduling requirements of a single pod are not met, the scheduling for the entire group of pods fails. The scheduler provides these all-or-nothing scheduling semantics to help prevent resource deadlocks caused by resource competition among multiple jobs.

The built-in scheduler of ACS provides the gang scheduling feature to implement all-or-nothing scheduling, which ensures that jobs can run successfully.

Important

The group of pods for which the gang scheduling feature is configured must belong to the same compute class.

Usage

The gang scheduling feature provided by ACS is compatible with the PodGroup custom resource in Kubernetes. The corresponding version is podgroups.scheduling.sigs.k8s.io/v1alpha1. Before you submit a job, you must create a PodGroup instance in the job's namespace and specify the minimum number of pods (`minMember`) required for the job to run. Then, when you create the job's pods, you must associate them with the PodGroup instance using the Pod-group.scheduling.sigs.k8s.io label. During scheduling, ACS allocates resources to all pods that share the same PodGroup label.

Create a PodGroup custom resource.

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata: 
  name: demo-job-podgroup
  namespace: default
spec: 
  scheduleTimeoutSeconds: 10 
  minMember: 3 # Set the minimum number of running pods.

Create a job and associate it with the PodGroup.

apiVersion: batch/v1
kind: Job
metadata:
  name: gang-job
  namespace: default
spec:
  parallelism: 3 # The number of pods must be greater than or equal to minMember in the PodGroup object.
  template:
    metadata:
      labels:
        alibabacloud.com/compute-class: "gpu" # Specify the compute class as gpu or gpu-hpn.
        alibabacloud.com/gpu-model-series: "example-model" # The GPU compute class requires you to specify a GPU model.
        pod-group.scheduling.sigs.k8s.io: demo-job-podgroup # Associate with the PodGroup instance demo-job-podgroup.
    spec:
      containers:
      - name: demo-job
        image: registry.cn-hangzhou.aliyuncs.com/acs/stress:v1.0.4
        args:
          - 'infinity'
        command:
          - sleep
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "1"
            memory: "1Gi"
            nvidia.com/gpu: "1"
      restartPolicy: Never
  backoffLimit: 4

Important

Make sure that the number of associated pods is greater than or equal to the `minMember` value configured for the PodGroup instance. Otherwise, the pods cannot be scheduled.

Examples

This example demonstrates both successful and failed scheduling outcomes when you use gang scheduling for a job.

Run the following command to create the test-gang namespace.
```
kubectl create ns test-gang
```

Run the following command to create a ResourceQuota in the test-gang namespace to demonstrate how gang scheduling behaves when resources are insufficient.

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: object-counts
  namespace: test-gang
spec:
  hard:
    pods: "2"
EOF

Run the following command to create a PodGroup object. In the object, minMember is set to 3, which specifies that at least 3 associated pods must be scheduled successfully at the same time. If one of the pods fails to be created or scheduled, all pods in the group remain in the Pending state.
```
cat << EOF | kubectl apply -f -
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata: 
  name: demo-job-podgroup
  namespace: test-gang
spec: 
  minMember: 3 # Set the minimum number of running pods.
EOF
```

Use the following YAML content to create a gang-job.yaml file. This file defines a Job object that specifies four pod replicas and is associated with the PodGroup object.

apiVersion: batch/v1
kind: Job
metadata:
  name: gang-job
  namespace: test-gang
spec:
  parallelism: 4 # The number of pods must be greater than or equal to minMember in the PodGroup object.
  template:
    metadata:
      labels:
        alibabacloud.com/compute-class: "gpu" # Specify the compute class as gpu or gpu-hpn.
        alibabacloud.com/gpu-model-series: "example-model" # The GPU compute class requires you to specify a GPU model.
        pod-group.scheduling.sigs.k8s.io: demo-job-podgroup # Associate with the PodGroup instance demo-job-podgroup.
    spec:
      containers:
      - name: demo-job
        image: registry.cn-hangzhou.aliyuncs.com/acs/stress:v1.0.4
        args:
          - 'infinity'
        command:
          - sleep
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "1"
            memory: "1Gi"
            nvidia.com/gpu: "1"
      restartPolicy: Never
  backoffLimit: 4

Run the following command to deploy the gang-job job to the cluster.
```
kubectl apply -f gang-job.yaml
```
Run the following command to view the pod status.
```
kubectl get pod -n test-gang
```
Expected output:
```
NAME             READY   STATUS    RESTARTS   AGE
gang-job-hrnc6   0/1     Pending   0          23s
gang-job-wthnq   0/1     Pending   0          23s
```
The ResourceQuota limits the number of running pods to two, so only two pods are created for this job. This number is less than the `minMember` value specified in the PodGroup. Therefore, both pods remain in the Pending state and are not scheduled.
Run the following command to delete the ResourceQuota and remove the limit on the number of pods.
```
kubectl delete resourcequota -n test-gang object-counts
```

Run the following command to view the pod status.

kubectl get pod -n test-gang

Expected output:

NAME             READY   STATUS    RESTARTS   AGE
gang-job-24cz9   1/1     Running   0          96s
gang-job-mmkxl   1/1     Running   0          96s
gang-job-msr8v   1/1     Running   0          96s
gang-job-qnclz   1/1     Running   0          96s

The output indicates that the pods are scheduled successfully.

ACS cluster version	Scheduler version
1.31	v1.31.0-aliyun-1.2.0 and later
1.30	v1.30.3-aliyun-1.1.1 and later
1.28	v1.28.9-aliyun-1.1.0 and later