In a Non-Uniform Memory Access (NUMA) architecture, GPU-heavy workloads on multi-GPU nodes suffer when CPUs and GPUs land on different NUMA nodes — cross-node memory access increases latency, limits bandwidth, and degrades throughput. ACK's NUMA topology-aware scheduling places a Pod's CPUs and GPUs on the same NUMA node, eliminating unnecessary cross-node traffic.
How it works
A NUMA node is the basic unit of a NUMA system. A NUMA set combines multiple nodes on one worker node to efficiently allocate resources and reduce processor memory contention. Worker nodes with eight GPUs typically have multiple NUMA nodes. Without CPU-GPU colocation on the same NUMA node, applications experience CPU contention and cross-NUMA communication overhead.
Native Kubernetes uses kubelet's CPU and NUMA policies for single-node resource binding, but this approach has cluster-level gaps:
Scheduler unawareness: The scheduler cannot evaluate remaining NUMA resources when making placement decisions, causing Pods to enter
AdmissionErrorstates and destabilizing the cluster.Uncontrollable placement: Topology policies are node-level parameters, so you cannot use node affinity to control colocation across the cluster.
Policy inflexibility: Each node supports only one topology policy, requiring manual cluster partitioning and labeling that reduces overall resource utilization.
ACK solves these limitations through the Scheduler Framework. The gputopo-device-plugin and ack-koordlet components of ack-koordinator report CPU and GPU topology from each node to the scheduler, enabling you to declare NUMA placement policies at the Pod level.
Prerequisites
Before you begin, ensure that you have:
Cluster:
An ACK Pro cluster running version 1.24 or later. To upgrade, see Upgrade a cluster.
Nodes:
Nodes from the sccgn7ex instance family (GPU-accelerated supercomputing clusters) or Lingjun nodes. For instance family details, see Instance families. For Lingjun nodes, see Manage LINGJUN clusters and Lingjun nodes.
The label
ack.node.gpu.schedule=topologyadded to each node where you want topology-aware GPU scheduling. For instructions, see Enable scheduling features.
Components:
kube-scheduler version 6.4.4 or later. To upgrade, go to the ACK console, click your cluster name, and choose Operations Management > Add-ons. For more information, see kube-scheduler.
The ack-koordinator add-on (formerly ack-slo-manager) installed with the following configuration:
ACK Lingjun clusters: Install ack-koordinator directly with no extra configuration.
ACK Pro clusters: Set the
NodeTopologyReportfield in theagentFeaturesFeature Gate totrueduring installation.
The GPU topology reporting add-on (gputopo-device-plugin) installed. This add-on collects GPU-to-CPU NUMA topology information and reports it to the cluster. For installation instructions, see Install the GPU topology-aware scheduling add-on.
If you install the GPU topology reporting add-on before ack-koordinator, restart the GPU topology reporting add-on after ack-koordinator installation completes.
Limitations
Incompatibility:
This feature provides unified CPU and GPU NUMA affinity scheduling. It is mutually exclusive with the legacy standalone scheduling policies. Do not enable this feature on workloads that already use topology-aware CPU scheduling or the legacy standalone version of topology-aware GPU scheduling.
Only CPU and GPU colocation is supported.
Resource specification requirements:
CPU requests for all containers in a Pod must be whole numbers (unit: cores), and requests must equal limits.
GPU resources must be requested using
aliyun.com/gpu, notnvidia.com/gpu. Only whole GPU cards are supported.
Billing
This feature requires the Cloud-Native AI Suite, which may incur additional fees. For details, see Billing of the Cloud-Native AI Suite.
Worker node resources: ack-koordinator runs as a self-managed component on worker nodes and consumes CPU and memory. Configure resource requests for each module during installation.
Prometheus monitoring metrics: If you select Enable Prometheus Metrics for ACK-Koordinator during installation and use Alibaba Cloud Prometheus, the metrics count as custom metrics and incur fees based on cluster size and application count. Before enabling this option, review the Prometheus billing documentation for free quota and billing details. Monitor usage through billing and usage queries.
Enable NUMA topology-aware scheduling
Add the following annotations to your Pod spec. Comments in the YAML list all valid values for each field.
apiVersion: v1
kind: Pod
metadata:
name: example
annotations:
# Enables CPU binding. Only "required" is supported.
cpuset-scheduler: required
# numaTopologyPolicy controls placement scope:
# "SingleNUMANode" – all CPUs and GPUs on the same NUMA node (strict)
# "Restricted" – all CPUs and GPUs within the same NUMA set (strict)
# "BestEffort" – attempts same-node placement; falls back if unavailable
# singleNUMANodeExclusive controls which NUMA node types are eligible:
# "Required" (default) – avoids NUMA nodes already used by pods with a different topology type
# "Preferred" – no restriction on NUMA node type
scheduling.alibabacloud.com/numa-topology-spec: |
{
"numaTopologyPolicy": "SingleNUMANode",
"singleNUMANodeExclusive": "Preferred"
}
spec:
containers:
- name: example
image: ghcr.io/huggingface/text-generation-inference:1.4
resources:
limits:
aliyun.com/gpu: '4'
cpu: '24'
requests:
aliyun.com/gpu: '4'
cpu: '24'Placement policy (numaTopologyPolicy)
numaTopologyPolicy controls the scope of CPU and GPU placement.
| Value | Behavior | When the policy cannot be satisfied |
|---|---|---|
SingleNUMANode | Places all CPUs and GPUs on the same NUMA node | Pod is not scheduled |
Restricted | Places all CPUs and GPUs within the same NUMA set (multiple nodes on one worker) | Pod is not scheduled |
BestEffort | Attempts to place CPUs and GPUs on the same NUMA node | Selects the next best available node |
Exclusivity policy (singleNUMANodeExclusive)
singleNUMANodeExclusive controls which NUMA node types the Pod can use.
NUMA nodes are categorized by their current occupancy:
| NUMA node type | Description |
|---|---|
idle | No Pods running on it |
single | Only Pods bound to a single NUMA node are running on it |
shared | Only Pods spread across multiple NUMA nodes are running on it |
| Value | Placement rule |
|---|---|
Required (default) | Single-NUMA Pods can only land on idle or single nodes. Multi-NUMA Pods can only land on idle or shared nodes. |
Preferred | No restriction on NUMA node type |
Performance comparison
The following test measures model loading time before and after enabling NUMA topology-aware scheduling. The test uses text-generation-inference to load a model on four GPU cards, with NVIDIA Nsight Systems measuring GPU loading speed.
Test environment: Lingjun nodes, text-generation-inference v1.4 (download), NVIDIA Nsight Systems (download)
Test results vary by tool and environment. The data below was collected using NVIDIA Nsight Systems; your results may differ.
With topology-aware scheduling
The following Deployment adds NUMA topology annotations and switches the GPU resource request from nvidia.com/gpu to aliyun.com/gpu. This change lets the ACK scheduler identify and manage GPU-CPU NUMA affinity.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: tgi-numa
name: tgi-numa-deployment-basic
namespace: yueming-test
spec:
replicas: 1
selector:
matchLabels:
app: tgi-numa
template:
metadata:
annotations:
cpuset-scheduler: required
scheduling.alibabacloud.com/numa-topology-spec: |
{
"numaTopologyPolicy": "SingleNUMANode"
}
labels:
app: tgi-numa
spec:
containers:
- command:
- sleep
- 3600d
image: ghcr.io/huggingface/text-generation-inference:1.4
imagePullPolicy: IfNotPresent
name: numa
resources:
limits:
aliyun.com/gpu: '4'
cpu: '24'
requests:
aliyun.com/gpu: '4'
cpu: '24'
volumeMounts:
- mountPath: /llm
name: volume-1710932083254
restartPolicy: Always
schedulerName: default-scheduler
volumes:
- name: volume-1710932083254
persistentVolumeClaim:
claimName: modelModel loading time: 5.4s — a 66% improvement over the baseline.

