The ACK co-scheduler extends your self-managed Kubernetes cluster with advanced scheduling capabilities designed for compute-intensive workloads. Install the ack-co-scheduler component in a registered cluster to enable Gang Scheduling, CPU topology-aware scheduling, and ECI elastic scheduling for big data and AI applications.
Prerequisites
Before you begin, make sure you have:
-
A registered cluster with your self-managed Kubernetes cluster connected to it. See Create an ACK One registered cluster.
-
System components that meet the following version requirements:
Component Version Kubernetes 1.18.8 or later Helm 3.0 or later Docker 19.03.5 Operating system CentOS 7.6, CentOS 7.7, Ubuntu 16.04, Ubuntu 18.04, Alibaba Cloud Linux
Usage notes
When deploying a job, set .template.spec.schedulerName to ack-co-scheduler. This tells Kubernetes to route the job's pods through the ACK co-scheduler instead of the default scheduler.
Install the ack-co-scheduler component
Use onectl for scripted or automated environments. Use the console if you prefer a UI-based approach.
Install using onectl
-
Install onectl on your machine. See Use onectl to manage registered clusters.
-
Run the following command:
onectl addon install ack-co-schedulerExpected output:
Addon ack-co-scheduler, version **** installed.
Install using the console
-
Log on to the Container Service Management Console. In the left navigation pane, click Clusters.
-
Click the name of your cluster. In the left navigation pane, click Add-ons.
-
On the Add-ons page, click the Others tab. Find the ack-co-scheduler component and click Install in the lower-right corner of the card.
-
In the confirmation dialog box, click OK.
Gang scheduling
Gang scheduling is implemented based on the new Kube-scheduler framework and addresses the all-or-nothing scheduling problem for distributed jobs: all pods in a group are scheduled together, or none of them are. This prevents resource deadlocks in AI training jobs and multi-process workloads such as MPI, where every worker must run simultaneously. If some pods acquire resources while others cannot start, the entire job stalls.
Submit a TensorFlow distributed job
The following example submits a TensorFlow distributed training job with Gang Scheduling enabled. Both the PS and Worker pods use pod-group.scheduling.sigs.k8s.io labels to form a pod group.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "2"
spec:
schedulerName: ack-co-scheduler # Route pods through the ACK co-scheduler.
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: '10'
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 4
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "2"
spec:
schedulerName: ack-co-scheduler
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: 10
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Key fields:
| Field | Description |
|---|---|
pod-group.scheduling.sigs.k8s.io/name |
Groups pods into a pod group. All pods sharing the same name are scheduled together. |
pod-group.scheduling.sigs.k8s.io/min-available |
Minimum number of pods that must be schedulable before any pod in the group starts. Set this value based on how many pods must run simultaneously for the job to make progress. In this example, at least 2 of the 5 pods (1 PS + 4 Workers) must be schedulable. |
schedulerName: ack-co-scheduler |
Routes the pod through the ACK co-scheduler. Set this on every pod template in the job. |
Verify Gang scheduling
After submitting the job, check that pods are entering a pending state together:
kubectl get pods -l pod-group.scheduling.sigs.k8s.io/name=tf-smoke-gpu
Pods remain in Pending until the scheduler can place at least min-available pods simultaneously. This is expected behavior, not an error. If pods stay pending for an extended period, run the following command and check the Events section for scheduling messages:
kubectl describe pod <pod-name>
For more information, see Use Gang scheduling.
CPU topology-aware scheduling
CPU topology-aware scheduling pins container CPU cores to the same Non-Uniform Memory Access (NUMA) node, reducing cross-node memory access latency. This benefits CPU-intensive workloads such as real-time inference and latency-sensitive services where consistent, low-latency CPU access is critical.
Prerequisites
Deploy the resource-controller component before enabling this feature. See Manage add-ons.
Enable CPU topology-aware scheduling
Add the cpuset-scheduler: "true" annotation to your Deployment's pod template and set schedulerName to ack-co-scheduler:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-numa
labels:
app: nginx-numa
spec:
replicas: 2
selector:
matchLabels:
app: nginx-numa
template:
metadata:
annotations:
cpuset-scheduler: "true" # Enable CPU topology-aware scheduling.
labels:
app: nginx-numa
spec:
schedulerName: ack-co-scheduler # Route pods through the ACK co-scheduler.
containers:
- name: nginx-numa
image: nginx:1.13.3
ports:
- containerPort: 80
resources:
requests:
cpu: 4
limits:
cpu: 4
Key fields:
| Field | Description |
|---|---|
cpuset-scheduler: "true" |
Instructs the scheduler to pin this pod's CPU cores to a single NUMA node. Must be set under template.metadata.annotations. |
schedulerName: ack-co-scheduler |
Routes the pod through the ACK co-scheduler. |
resources.requests.cpu / resources.limits.cpu |
CPU resource requests and limits for the container. |
Verify CPU topology-aware scheduling
After the deployment is running, confirm that pods were scheduled with cpuset pinning:
kubectl get pods -l app=nginx-numa -o wide
To confirm NUMA pinning on a specific node, log on to the node and check the cpuset assigned to the container:
cat /sys/fs/cgroup/cpuset/kubepods/pod<pod-uid>/<container-id>/cpuset.cpus
The output shows the CPU cores allocated to the container. If they all belong to the same NUMA node, cpuset pinning is active.
For more information, see Enable CPU topology-aware scheduling.
ECI elastic scheduling
ECI elastic scheduling lets you control whether pods run on Elastic Compute Service (ECS) nodes, on Elastic Container Instance (ECI) resources, or on ECI resources only when ECS capacity is insufficient. This is useful for workloads with unpredictable or spiky resource demands, where you want to avoid over-provisioning ECS nodes while still handling traffic bursts.
Prerequisites
Deploy the ack-virtual-node component before enabling this feature. See Use ECI in ACK.
Enable ECI elastic scheduling
Add the alibabacloud.com/burst-resource annotation to your Deployment's pod template:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 4
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
annotations:
alibabacloud.com/burst-resource: eci # Use ECI when ECS capacity is insufficient.
labels:
app: nginx
spec:
schedulerName: ack-co-scheduler # Route pods through the ACK co-scheduler.
containers:
- name: nginx
image: nginx
resources:
limits:
cpu: 2
requests:
cpu: 2
Annotation values for `alibabacloud.com/burst-resource`:
| Value | Behavior |
|---|---|
| Not set | Use only existing ECS nodes in the cluster. |
eci |
Use ECS nodes first; automatically burst to ECI resources when ECS capacity is insufficient. |
eci_only |
Use only ECI resources. ECS nodes in the cluster are not used. |
Verify ECI elastic scheduling
After deploying, check which nodes the pods are running on:
kubectl get pods -l app=nginx -o wide
For more information, see Use ElasticResource to implement ECI elastic scheduling (deprecated).
Shared GPU scheduling
Shared GPU scheduling allows multiple pods to share a single GPU, improving GPU utilization for inference and other workloads that do not require a full GPU.
For setup and usage details, see: