Optimize GPU Pod Performance with NUMA-Aware Scheduling - ACK

In a Non-Uniform Memory Access (NUMA) architecture, GPU-heavy workloads on multi-GPU nodes suffer when CPUs and GPUs land on different NUMA nodes — cross-node memory access increases latency, limits bandwidth, and degrades throughput. ACK's NUMA topology-aware scheduling places a Pod's CPUs and GPUs on the same NUMA node, eliminating unnecessary cross-node traffic.

How it works

A NUMA node is the basic unit of a NUMA system. A NUMA set combines multiple nodes on one worker node to efficiently allocate resources and reduce processor memory contention. Worker nodes with eight GPUs typically have multiple NUMA nodes. Without CPU-GPU colocation on the same NUMA node, applications experience CPU contention and cross-NUMA communication overhead.

Native Kubernetes uses kubelet's CPU and NUMA policies for single-node resource binding, but this approach has cluster-level gaps:

Scheduler unawareness: The scheduler cannot evaluate remaining NUMA resources when making placement decisions, causing Pods to enter AdmissionError states and destabilizing the cluster.
Uncontrollable placement: Topology policies are node-level parameters, so you cannot use node affinity to control colocation across the cluster.
Policy inflexibility: Each node supports only one topology policy, requiring manual cluster partitioning and labeling that reduces overall resource utilization.

ACK solves these limitations through the Scheduler Framework. The gputopo-device-plugin and ack-koordlet components of ack-koordinator report CPU and GPU topology from each node to the scheduler, enabling you to declare NUMA placement policies at the Pod level.

Prerequisites

Before you begin, ensure that you have:

Cluster:

An ACK Pro cluster running version 1.24 or later. To upgrade, see Upgrade a cluster.

Nodes:

Nodes from the sccgn7ex instance family (GPU-accelerated supercomputing clusters) or Lingjun nodes. For instance family details, see Instance families. For Lingjun nodes, see Manage LINGJUN clusters and Lingjun nodes.
The label ack.node.gpu.schedule=topology added to each node where you want topology-aware GPU scheduling. For instructions, see Enable scheduling features.

Components:

kube-scheduler version 6.4.4 or later. To upgrade, go to the ACK console, click your cluster name, and choose Operations Management > Add-ons. For more information, see kube-scheduler.
The ack-koordinator add-on (formerly ack-slo-manager) installed with the following configuration:
- ACK Lingjun clusters: Install ack-koordinator directly with no extra configuration.
- ACK Pro clusters: Set the NodeTopologyReport field in the agentFeatures Feature Gate to true during installation.
The GPU topology reporting add-on (gputopo-device-plugin) installed. This add-on collects GPU-to-CPU NUMA topology information and reports it to the cluster. For installation instructions, see Install the GPU topology-aware scheduling add-on.

Important

If you install the GPU topology reporting add-on before ack-koordinator, restart the GPU topology reporting add-on after ack-koordinator installation completes.

Limitations

Incompatibility:

This feature provides unified CPU and GPU NUMA affinity scheduling. It is mutually exclusive with the legacy standalone scheduling policies. Do not enable this feature on workloads that already use topology-aware CPU scheduling or the legacy standalone version of topology-aware GPU scheduling.
Only CPU and GPU colocation is supported.

Resource specification requirements:

CPU requests for all containers in a Pod must be whole numbers (unit: cores), and requests must equal limits.
GPU resources must be requested using aliyun.com/gpu, not nvidia.com/gpu. Only whole GPU cards are supported.

Billing

This feature requires the Cloud-Native AI Suite, which may incur additional fees. For details, see Billing of the Cloud-Native AI Suite.

Worker node resources: ack-koordinator runs as a self-managed component on worker nodes and consumes CPU and memory. Configure resource requests for each module during installation.

Prometheus monitoring metrics: If you select Enable Prometheus Metrics for ACK-Koordinator during installation and use Alibaba Cloud Prometheus, the metrics count as custom metrics and incur fees based on cluster size and application count. Before enabling this option, review the Prometheus billing documentation for free quota and billing details. Monitor usage through billing and usage queries.

Enable NUMA topology-aware scheduling

Add the following annotations to your Pod spec. Comments in the YAML list all valid values for each field.

apiVersion: v1
kind: Pod
metadata:
  name: example
  annotations:
    # Enables CPU binding. Only "required" is supported.
    cpuset-scheduler: required
    # numaTopologyPolicy controls placement scope:
    #   "SingleNUMANode" – all CPUs and GPUs on the same NUMA node (strict)
    #   "Restricted"     – all CPUs and GPUs within the same NUMA set (strict)
    #   "BestEffort"     – attempts same-node placement; falls back if unavailable
    # singleNUMANodeExclusive controls which NUMA node types are eligible:
    #   "Required" (default) – avoids NUMA nodes already used by pods with a different topology type
    #   "Preferred"          – no restriction on NUMA node type
    scheduling.alibabacloud.com/numa-topology-spec: |
      {
        "numaTopologyPolicy": "SingleNUMANode",
        "singleNUMANodeExclusive": "Preferred"
      }
spec:
  containers:
  - name: example
    image: ghcr.io/huggingface/text-generation-inference:1.4
    resources:
      limits:
        aliyun.com/gpu: '4'
        cpu: '24'
      requests:
        aliyun.com/gpu: '4'
        cpu: '24'

Placement policy (numaTopologyPolicy)

numaTopologyPolicy controls the scope of CPU and GPU placement.

Value	Behavior	When the policy cannot be satisfied
`SingleNUMANode`	Places all CPUs and GPUs on the same NUMA node	Pod is not scheduled
`Restricted`	Places all CPUs and GPUs within the same NUMA set (multiple nodes on one worker)	Pod is not scheduled
`BestEffort`	Attempts to place CPUs and GPUs on the same NUMA node	Selects the next best available node

Exclusivity policy (singleNUMANodeExclusive)

singleNUMANodeExclusive controls which NUMA node types the Pod can use.

NUMA nodes are categorized by their current occupancy:

NUMA node type	Description
`idle`	No Pods running on it
`single`	Only Pods bound to a single NUMA node are running on it
`shared`	Only Pods spread across multiple NUMA nodes are running on it

Value	Placement rule
`Required` (default)	Single-NUMA Pods can only land on `idle` or `single` nodes. Multi-NUMA Pods can only land on `idle` or `shared` nodes.
`Preferred`	No restriction on NUMA node type

Performance comparison

The following test measures model loading time before and after enabling NUMA topology-aware scheduling. The test uses text-generation-inference to load a model on four GPU cards, with NVIDIA Nsight Systems measuring GPU loading speed.

Test environment: Lingjun nodes, text-generation-inference v1.4 (download), NVIDIA Nsight Systems (download)

Important

Test results vary by tool and environment. The data below was collected using NVIDIA Nsight Systems; your results may differ.

Without topology-aware scheduling

The following Deployment uses standard nvidia.com/gpu resource requests, with no NUMA annotations.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: tgi
  name: tgi-deployment-basic
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi
  template:
    metadata:
      labels:
        app: tgi
    spec:
      containers:
        - command:
            - sleep
            - 3600d
          image: ghcr.io/huggingface/text-generation-inference:1.4
          imagePullPolicy: IfNotPresent
          name: tgi
          ports:
            - containerPort: 80
              protocol: TCP
          resources:
            limits:
              cpu: '24'
              nvidia.com/gpu: '4'
            requests:
              cpu: '24'
              nvidia.com/gpu: '4'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /llm
              name: volume-1710932083254
      restartPolicy: Always
      schedulerName: default-scheduler
      volumes:
        - name: volume-1710932083254
          persistentVolumeClaim:
            claimName: model

Model loading time: 15.9s

With topology-aware scheduling

The following Deployment adds NUMA topology annotations and switches the GPU resource request from nvidia.com/gpu to aliyun.com/gpu. This change lets the ACK scheduler identify and manage GPU-CPU NUMA affinity.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: tgi-numa
  name: tgi-numa-deployment-basic
  namespace: yueming-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-numa
  template:
    metadata:
      annotations:
        cpuset-scheduler: required
        scheduling.alibabacloud.com/numa-topology-spec: |
          {
            "numaTopologyPolicy": "SingleNUMANode"
          }
      labels:
        app: tgi-numa
    spec:
      containers:
        - command:
            - sleep
            - 3600d
          image: ghcr.io/huggingface/text-generation-inference:1.4
          imagePullPolicy: IfNotPresent
          name: numa
          resources:
            limits:
              aliyun.com/gpu: '4'
              cpu: '24'
            requests:
              aliyun.com/gpu: '4'
              cpu: '24'
          volumeMounts:
            - mountPath: /llm
              name: volume-1710932083254
      restartPolicy: Always
      schedulerName: default-scheduler
      volumes:
        - name: volume-1710932083254
          persistentVolumeClaim:
            claimName: model

Model loading time: 5.4s — a 66% improvement over the baseline.

What's next

Enable nearest memory access acceleration for containers