When multiple applications run on the same node, CPU resource contention and context switching degrade latency-sensitive workloads. CPU topology-aware scheduling pins application processes to specific CPU cores, eliminating performance jitter from core migration and cross-Non-Uniform Memory Access (NUMA) memory access.
How it works
By default, Kubernetes uses the kernel's Completely Fair Scheduler (CFS) to distribute CPU time slices across all cores. CFS ignores physical CPU topology and can cause unpredictable latency in performance-sensitive applications.
The Kubernetes CPU Manager (with the static policy) can bind pods to exclusive CPU cores, but has three limitations:
-
No cluster-level topology awareness: The native kube-scheduler operates at the node level only and cannot see the CPU topology of the entire cluster.
-
No NUMA awareness: The
staticpolicy does not account for NUMA architecture when allocating cores, which can force memory access across NUMA nodes and introduce extra latency. -
Guaranteed QoS only: This policy applies only to pods with a
GuaranteedQoS class. It cannot be used withBurstableorBestEffortpods.
ACK addresses these limitations through collaboration between the ACK kube-scheduler and ack-koordinator, built on the Kubernetes Scheduling Framework:
-
Node topology reporting: ack-koordinator continuously detects the local physical CPU topology—including sockets, NUMA nodes, and caches—and reports this data to the scheduling center.
-
Global topology-aware scheduling: The kube-scheduler uses cluster-wide topology data to select the optimal node for each pod and plan the core allocation scheme. By default, the scheduler finds the core with the fewest bound applications. The resulting allocation scheme is written to the pod's annotation.
-
Local core pinning: After the pod lands on the target node, ack-koordinator reads the pod's annotation and modifies the
cpuset.cpusfile in the pod's cgroup to bind the pod to the assigned physical cores.
Use cases
CPU topology-aware scheduling works best for the following workloads:
-
Latency-sensitive applications: High-frequency trading systems and real-time data processing pipelines cannot tolerate CPU context switching delays. Without CPU pinning, the CFS scheduler may migrate your process between cores mid-execution—harmless in a test environment, but measurable in production under load.
-
NUMA-sensitive applications: Workloads on multi-socket servers (such as AMD or Intel Elastic Bare Metal Instances) suffer measurable latency when memory accesses cross NUMA boundaries. Pinning cores within a single NUMA node eliminates this overhead.
-
Deterministic computing: Scientific computing and big data analytics jobs require stable, predictable CPU throughput. Pinning eliminates the variance introduced by sharing cores with other workloads.
-
Legacy applications unaware of container CPU limits: Applications that spawn threads based on the total physical core count of the host rather than the container's CPU limit benefit from pinning, which binds the process to a specific set of cores.
Do not enable CPU topology-aware scheduling in the following scenarios:
-
CPU overcommitment environments: Core pinning reserves cores exclusively, which is incompatible with the resource-sharing model of overcommitment and causes resource waste.
-
I/O-intensive or general-purpose applications: Web services, middleware, and most I/O-bound workloads are not sensitive to CPU core switching and gain nothing from pinning.
Choose a pinning policy
ACK provides two pinning policies:
| Policy | Annotation | Pinning behavior | Recommended for |
|---|---|---|---|
| General | cpuset-scheduler: "true" |
1:1 pinning — binds exactly the number of cores in resources.limits.cpu, preferring cores within the same NUMA node |
Most latency-sensitive workloads |
| Automatic | cpuset-scheduler: "true" and cpu-policy: "static-burst" |
Analyzes real-time topology and resource usage; may pin more cores than requested, prioritizing a complete physical core cluster (CCX/CCD on AMD CPUs) | Large-scale AMD machines with 32 or more cores |
Prerequisites
Before you begin, ensure that you have:
-
An ACK managed cluster of the Pro edition
-
A node pool with
cpuManagerPolicyset tonone. For more information, see Customize kubelet configurations for a node pool -
ack-koordinator installed at version 0.2.0 or later. For installation instructions, see ack-koordinator
Step 2: Enable CPU topology-aware scheduling
Enable CPU topology-aware scheduling by adding annotations to the pod spec.
Do not specify nodeName directly on a pod when using this feature. The kube-scheduler does not participate in scheduling for pods with nodeName set, so topology-aware core selection does not apply. Use nodeSelector or node affinity instead.
General pinning policy
The general pinning policy pins exactly the number of cores specified in resources.limits.cpu to the pod, preferring cores within the same NUMA node.
Configuration:
-
For a pod: add
cpuset-scheduler: "true"tometadata.annotations. -
For a workload such as a Deployment: add
cpuset-scheduler: "true"tospec.template.metadata.annotations. -
resources.limits.cpumust be an integer.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
namespace: default
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
annotations:
cpuset-scheduler: "true" # Enables CPU topology-aware scheduling
labels:
app: nginx
spec:
containers:
- name: nginx
image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
ports:
- containerPort: 80
command:
- "sleep"
- "infinity"
resources:
requests:
cpu: 4
memory: 8Gi
limits:
cpu: 4 # Must be an integer
memory: 8Gi
Verify pinning:
After the pod is running, verify that core pinning is active using one of the following methods.
-
Verification:
You can view it in one of two ways.
Check the node's Cgroup file
Check the cgroup file on the node: The output shows a set of core IDs matching
limits.cpu: 4, confirming that pinning is active:-
cgroup v1: ``
shell # Replace <POD_UID> and <CONTAINER_ID> with the actual values cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus`` -
cgroup v2: ``
shell # Replace <POD_UID> and <CONTAINER_ID> with the actual values cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus.effective``
0-3Check the pod annotation
Check the pod annotation:
-
"nginx": the container namednginx. -
"0": the NUMA node ID. All pinned cores are on NUMA Node 0, which prevents performance loss from cross-NUMA memory access. -
"elems": {"0":{},"1":{},"2":{},"3":{}}: the physical CPU core IDs the container is pinned to (cores 0–3), matchinglimits.cpu: 4.
# Replace <your-pod-name> with the actual pod name kubectl get pod <your-pod-name> -n default -o yaml | grep "cpuset:"Expected output:
cpuset: '{"nginx":{"0":{"elems":{"0":{},"1":{},"2":{},"3":{}}}}}'The output fields mean:
-
Automatic pinning policy
The automatic pinning policy is optimized for specific hardware. It analyzes the node's CPU topology and resource usage in real time and prioritizes binding a complete physical core cluster (such as a CCX or CCD on AMD CPUs) to the pod. The number of pinned cores may exceed the number you requested, to maximize CPU locality and concurrency.
Use this policy for large-scale AMD machine types with 32 or more cores.
Configuration:
-
Add two annotations:
cpuset-scheduler: "true"andcpu-policy: "static-burst".-
For a pod: add them to
metadata.annotations. -
For a workload such as a Deployment: add them to
spec.template.metadata.annotations.
-
-
resources.limits.cpumust be an integer.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
namespace: default
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
annotations:
cpuset-scheduler: "true" # Enables CPU topology-aware scheduling
cpu-policy: "static-burst" # Enables automatic pinning and NUMA affinity
labels:
app: nginx
spec:
containers:
- name: nginx
image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
ports:
- containerPort: 80
command:
- "sleep"
- "infinity"
resources:
requests:
cpu: 4
memory: 8Gi
limits:
cpu: 4 # Must be an integer
memory: 8Gi
Verify pinning:
The automatic policy may pin more cores than requested. Verify the result using one of the following methods.
-
Verification:
The automatic pinning policy analyzes the node's CPU topology and resource usage in real time. The number of pinned cores may be greater than the number of cores that you explicitly request for the pod. You can verify the pinning status in the following two ways.
Check the node's Cgroup file
Check the cgroup file on the node: The output shows the specific pinned cores (more than the 4 requested in this example), confirming that pinning is active:
-
cgroup v1: ``
shell # Replace <POD_UID> and <CONTAINER_ID> with the actual values cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus`` -
cgroup v2: ``
shell # Replace <POD_UID> and <CONTAINER_ID> with the actual values cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus.effective``
0-7Check the pod annotation
Check the pod annotation:
-
"nginx": the container namednginx. -
"0": the NUMA node ID. All pinned cores are on NUMA Node 0, avoiding cross-NUMA memory access. -
"elems": the physical core IDs the container is pinned to (cores 0–7). The automatic policy may pin more cores than requested to align with the physical core cluster boundary.
# Replace <your-pod-name> with the actual pod name kubectl get pod <your-pod-name> -n default -o yaml | grep "cpuset:"Expected output:
cpuset: '{"nginx":{"0":{"elems":{"0":{},"1":{},"2":{},"3":{},"4":{},"5":{},"6":{},"7":{}}}}}'The output fields mean:
-
Apply in production
Observability
Before and after enabling core pinning, integrate with Alibaba Cloud Prometheus monitoring. Monitor the following metrics to observe the effect of pinning on your workloads:
-
Application metrics: response time (RT) and QPS
-
Node metrics: CPU usage and CPU throttling
Phased rollout
For Deployments with multiple replicas, use canary releases or phased updates to gradually enable or disable the pinning policy. This reduces the risk of a widespread performance regression.
Disable CPU topology-aware scheduling
-
Edit the application YAML file. Remove the
cpuset-scheduler: "true"annotation—and thecpu-policy: "static-burst"annotation if present—fromspec.template.metadata.annotations. -
Apply the modified YAML file during off-peak hours. Changes take effect after the pod restarts.
After disabling CPU pinning, the pod's processes are no longer bound to specific physical cores and may run on any available core on the node. Be aware of the following potential impacts:
-
CPU usage may increase slightly due to cross-core context switching.
-
For compute-intensive applications, performance jitter from CPU resource contention may return.
-
When multiple high-load pods share the same core, it can cause load spikes that trigger CPU throttling.
Billing
ack-koordinator is free to install and use. Additional charges may apply in the following cases:
-
ack-koordinator is a non-managed component that consumes worker node resources after installation. Configure resource requests for each module at installation time.
-
If you select the Enable Prometheus monitoring for ACK-Koordinator option and use Alibaba Cloud Prometheus, the monitoring metrics are counted as custom metrics and will incur fees. The cost depends on your cluster size and number of applications. Before enabling this option, review the Prometheus billing information to understand the free quota and pricing. Track your usage by querying your usage data.
What's next
-
GPU topology-aware scheduling — select the optimal GPU combination on a node to maximize training speed.
-
Dynamic resource overcommitment — reclaim allocated but unused cluster resources and provide them to low-priority jobs.