Install ack-co-scheduler for Gang & Topology Scheduling in AI Workloads - ACK

Prerequisites

Before you begin, make sure you have:

A registered cluster with your self-managed Kubernetes cluster connected to it. See Create an ACK One registered cluster.

System components that meet the following version requirements:

Component	Version
Kubernetes	1.18.8 or later
Helm	3.0 or later
Docker	19.03.5
Operating system	CentOS 7.6, CentOS 7.7, Ubuntu 16.04, Ubuntu 18.04, Alibaba Cloud Linux

Usage notes

When deploying a job, set .template.spec.schedulerName to ack-co-scheduler. This tells Kubernetes to route the job's pods through the ACK co-scheduler instead of the default scheduler.

Install the ack-co-scheduler component

Use onectl for scripted or automated environments. Use the console if you prefer a UI-based approach.

Install using onectl

Install onectl on your machine. See Use onectl to manage registered clusters.

Run the following command:

onectl addon install ack-co-scheduler

Expected output:

Addon ack-co-scheduler, version **** installed.

Install using the console

Log on to the Container Service Management Console. In the left navigation pane, click Clusters.
Click the name of your cluster. In the left navigation pane, click Add-ons.
On the Add-ons page, click the Others tab. Find the ack-co-scheduler component and click Install in the lower-right corner of the card.
In the confirmation dialog box, click OK.

Gang scheduling

Gang scheduling is implemented based on the new Kube-scheduler framework and addresses the all-or-nothing scheduling problem for distributed jobs: all pods in a group are scheduled together, or none of them are. This prevents resource deadlocks in AI training jobs and multi-process workloads such as MPI, where every worker must run simultaneously. If some pods acquire resources while others cannot start, the entire job stalls.

Submit a TensorFlow distributed job

The following example submits a TensorFlow distributed training job with Gang Scheduling enabled. Both the PS and Worker pods use pod-group.scheduling.sigs.k8s.io labels to form a pod group.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "2"
        spec:
          schedulerName: ack-co-scheduler   # Route pods through the ACK co-scheduler.
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=cpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: '10'
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure
    Worker:
      replicas: 4
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "2"
        spec:
          schedulerName: ack-co-scheduler
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=gpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: 10
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure

Key fields:

Field	Description
`pod-group.scheduling.sigs.k8s.io/name`	Groups pods into a pod group. All pods sharing the same name are scheduled together.
`pod-group.scheduling.sigs.k8s.io/min-available`	Minimum number of pods that must be schedulable before any pod in the group starts. Set this value based on how many pods must run simultaneously for the job to make progress. In this example, at least 2 of the 5 pods (1 PS + 4 Workers) must be schedulable.
`schedulerName: ack-co-scheduler`	Routes the pod through the ACK co-scheduler. Set this on every pod template in the job.

Verify Gang scheduling

After submitting the job, check that pods are entering a pending state together:

kubectl get pods -l pod-group.scheduling.sigs.k8s.io/name=tf-smoke-gpu

Pods remain in Pending until the scheduler can place at least min-available pods simultaneously. This is expected behavior, not an error. If pods stay pending for an extended period, run the following command and check the Events section for scheduling messages:

kubectl describe pod <pod-name>

For more information, see Use Gang scheduling.

CPU topology-aware scheduling

CPU topology-aware scheduling pins container CPU cores to the same Non-Uniform Memory Access (NUMA) node, reducing cross-node memory access latency. This benefits CPU-intensive workloads such as real-time inference and latency-sensitive services where consistent, low-latency CPU access is critical.

Prerequisites

Deploy the resource-controller component before enabling this feature. See Manage add-ons.

Enable CPU topology-aware scheduling

Add the cpuset-scheduler: "true" annotation to your Deployment's pod template and set schedulerName to ack-co-scheduler:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-numa
  labels:
    app: nginx-numa
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-numa
  template:
    metadata:
      annotations:
        cpuset-scheduler: "true"   # Enable CPU topology-aware scheduling.
      labels:
        app: nginx-numa
    spec:
      schedulerName: ack-co-scheduler   # Route pods through the ACK co-scheduler.
      containers:
      - name: nginx-numa
        image: nginx:1.13.3
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 4
          limits:
            cpu: 4

Key fields:

Field	Description
`cpuset-scheduler: "true"`	Instructs the scheduler to pin this pod's CPU cores to a single NUMA node. Must be set under `template.metadata.annotations`.
`schedulerName: ack-co-scheduler`	Routes the pod through the ACK co-scheduler.
`resources.requests.cpu` / `resources.limits.cpu`	CPU resource requests and limits for the container.

Verify CPU topology-aware scheduling

After the deployment is running, confirm that pods were scheduled with cpuset pinning:

kubectl get pods -l app=nginx-numa -o wide

To confirm NUMA pinning on a specific node, log on to the node and check the cpuset assigned to the container:

cat /sys/fs/cgroup/cpuset/kubepods/pod<pod-uid>/<container-id>/cpuset.cpus

The output shows the CPU cores allocated to the container. If they all belong to the same NUMA node, cpuset pinning is active.

For more information, see Enable CPU topology-aware scheduling.

ECI elastic scheduling

ECI elastic scheduling lets you control whether pods run on Elastic Compute Service (ECS) nodes, on Elastic Container Instance (ECI) resources, or on ECI resources only when ECS capacity is insufficient. This is useful for workloads with unpredictable or spiky resource demands, where you want to avoid over-provisioning ECS nodes while still handling traffic bursts.

Prerequisites

Deploy the ack-virtual-node component before enabling this feature. See Use ECI in ACK.

Enable ECI elastic scheduling

Add the alibabacloud.com/burst-resource annotation to your Deployment's pod template:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      annotations:
        alibabacloud.com/burst-resource: eci   # Use ECI when ECS capacity is insufficient.
      labels:
        app: nginx
    spec:
      schedulerName: ack-co-scheduler   # Route pods through the ACK co-scheduler.
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Annotation values for `alibabacloud.com/burst-resource`:

Value	Behavior
Not set	Use only existing ECS nodes in the cluster.
`eci`	Use ECS nodes first; automatically burst to ECI resources when ECS capacity is insufficient.
`eci_only`	Use only ECI resources. ECS nodes in the cluster are not used.

Verify ECI elastic scheduling

After deploying, check which nodes the pods are running on:

kubectl get pods -l app=nginx -o wide

For more information, see Use ElasticResource to implement ECI elastic scheduling (deprecated).

Shared GPU scheduling

Shared GPU scheduling allows multiple pods to share a single GPU, improving GPU utilization for inference and other workloads that do not require a full GPU.

For setup and usage details, see:

Container Service for Kubernetes:Use the ack-co-scheduler component to enable co-scheduling

Prerequisites

Usage notes

Install the ack-co-scheduler component

Install using onectl

Install using the console

Gang scheduling

Submit a TensorFlow distributed job

Verify Gang scheduling

CPU topology-aware scheduling

Prerequisites

Enable CPU topology-aware scheduling

Verify CPU topology-aware scheduling

ECI elastic scheduling

Prerequisites

Enable ECI elastic scheduling

Verify ECI elastic scheduling

Shared GPU scheduling

What's next