Container Service for Kubernetes (ACK) provides the gang scheduling feature based
on the new kube-scheduler framework. This feature provides a solution to job scheduling
in the all-or-nothing scenario. This topic describes how to enable gang scheduling.
Prerequisites
- A professional managed Kubernetes cluster is created. For more information, see Create a professional managed Kubernetes cluster.
Notice Gang scheduling is available for only professional managed Kubernetes clusters. To
enable gang scheduling for dedicated Kubernetes clusters,
Submit a ticket to add your account to the whitelist.
- The following table describes the system component versions that are required for
topology-aware CPU scheduling.
Component |
Required version |
Kubernetes |
V1.16 and later |
Helm |
V3.0 and later |
Docker |
19.03.5 |
Operating system |
CentOS 7.6, CentOS 7.7, Ubuntu 16.04, Ubuntu 18.04, and Alibaba Cloud Linux 2. |
Background information
Gang scheduling is a scheduling algorithm that schedules all correlated processes
to different processors in a parallel system and starts these processes simultaneously.
Gang scheduling aims to start all correlated processes at the same time. This prevents
the process group from being blocked when the system fails to start some processes.
For example, if you submit a batch job that contains multiple tasks, either all of
the tasks are scheduled or none of them is scheduled. Task scheduling in the all-or-nothing
scenario is known as gang scheduling.
Kubernetes is widely used in online service orchestration. ACK wants to use Kubernetes
as a platform for unified management of online services and offline jobs. This improves
the resource utilization and performance of clusters. However, kube-scheduler cannot
migrate specific offline workloads to Kubernetes clusters. For example, if a job requires
all-or-nothing scheduling, all tasks of the job must be scheduled at the same time.
If only some of the tasks are started, the started jobs must wait until all the remaining
tasks are scheduled. If each submitted job contains unscheduled tasks, all submitted
jobs remain in the Pending state and the cluster is deadlocked. To avoid this situation,
you must enable gang scheduling for kube-scheduler.
Features
In ACK, a pod group is a group of pods that need to be scheduled at the same time.
When you submit a job that requires all-or-nothing scheduling, you can add labels to pods. The labels specify the name of the pod group to which the job belongs and
the minimum number of tasks that must be scheduled to run the job. kube-scheduler
schedules tasks based on the minimum number of tasks that must be scheduled. The tasks
are scheduled only when the cluster resources are sufficient to schedule the required
number of tasks. Otherwise, the job remains in the Pending state.
How to enable gang scheduling
To enable gang scheduling, set min-available and name by adding labels to the pods.
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "3"
- name: Specifies the name of a pod group.
- min-available: Specifies the minimum number of pods that must be scheduled to run a job. Pods are
scheduled only when the computing resources are sufficient to schedule the required
number of pods.
Note Pods in the same pod group must be assigned the same priority.
Examples
In this example, a distributed TensorFlow job is used to demonstrate how to enable
gang scheduling. The ACK cluster that is used in this example has four GPUs.
- Run the following command to use Arena of Kubeflow to deploy the environment in your ACK cluster to run the distributed
TensorFlow job.
Note Arena is a subproject of
Kubeflow. Kubeflow is an open source project for machine learning based on Kubernetes. Arena
allows you to manage the lifecycle of machine learning jobs by using the command line
interface or SDK. Lifecycle management includes environment setup, data preparation,
model development, model training, and model prediction. This improves the working
efficiency of data scientists.
git clone https://github.com/kubeflow/arena.git
kubectl create ns arena-system
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
Run the following command to check whether the environment to run TensorFlow jobs
is deployed. If the pods are in the Running state, it indicates that the environment
is deployed.
kubectl get pods -n arena-system
NAME READY STATUS RESTARTS AGE
tf-job-dashboard-56cf48874f-gwlhv 1/1 Running 0 54s
tf-job-operator-66494d88fd-snm9m 1/1 Running 0 54s
- Use the following template to submit a distributed TensorFlow job to the ACK cluster.
The job runs on one parameter server (PS) pod and four worker pods. Each worker pod
requires two GPUs.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: '1'
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 4
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 2
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
- Submit the distributed TensorFlow job without enabling gang scheduling
Run the following command to query the states of pods. Only two worker pods are running
and the other worker pods are in the Pending state.
kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 1/1 Running 0 6m43s
tf-smoke-gpu-worker-0 1/1 Running 0 6m43s
tf-smoke-gpu-worker-1 1/1 Running 0 6m43s
tf-smoke-gpu-worker-2 0/1 Pending 0 6m43s
tf-smoke-gpu-worker-3 0/1 Pending 0 6m43s
Run the following command to query log data of the running worker pods. The returned
log data indicates that the running worker pods are waiting for the system to start
the pending worker pods. The GPU resources occupied by the running worker pods are
not in use.
kubectl logs -f tf-smoke-gpu-worker-0
INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
- Submit the distributed TensorFlow job with gang scheduling enabled
Run the following command to query the states of pods. The computing resources in
the cluster are insufficient to schedule the minimum number of pods. Therefore, the
pod group cannot be scheduled and all pods are in the Pending state.
kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 0/1 Pending 0 43s
tf-smoke-gpu-worker-0 0/1 Pending 0 43s
tf-smoke-gpu-worker-1 0/1 Pending 0 43s
tf-smoke-gpu-worker-2 0/1 Pending 0 43s
tf-smoke-gpu-worker-3 0/1 Pending 0 43s
After four GPUs are allocated to the cluster, the computing resources in the cluster
are sufficient to schedule the minimum number of pods. After the pod group is scheduled,
the four worker pods start to run. Run the following command to query the states of
pods:
kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 1/1 Running 0 3m16s
tf-smoke-gpu-worker-0 1/1 Running 0 3m16s
tf-smoke-gpu-worker-1 1/1 Running 0 3m16s
tf-smoke-gpu-worker-2 1/1 Running 0 3m16s
tf-smoke-gpu-worker-3 1/1 Running 0 3m16s
Run the following command to query log data of a running worker pod. The following
output indicates that the tasks have been started.
kubectl logs -f tf-smoke-gpu-worker-0
INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step Img/sec loss
INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10 images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20 images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142