All Products
Search
Document Center

Container Service for Kubernetes:CPU topology-aware scheduling

Last Updated:Dec 27, 2025

When multiple applications run on the same node, CPU resource contention and frequent context switching can cause performance jitter in sensitive applications. CPU topology-aware scheduling binds application processes exclusively to specific CPU cores (CPU pinning). This reduces performance uncertainty that is caused by core switching and memory access across NUMA nodes.

How it works

By default, Kubernetes relies on the kernel's Completely Fair Scheduler (CFS) to balance the CPU load. This scheduler distributes the load across all cores by fairly allocating CPU time slices. However, this can cause performance jitter in sensitive applications because the scheduler ignores the physical CPU topology.

The Kubernetes CPU Manager (with the static policy) can bind pods to exclusive CPU cores, but it has the following limitations:

  • Lack of scheduler awareness: The native kube-scheduler makes decisions at the node level only. It cannot perceive the CPU topology of the entire cluster. As a result, it cannot find the optimal physical core layout for a pod on a global scale.

  • Lack of topology awareness: When allocating cores on a node, the static policy is unaware of the NUMA architecture. This can lead to memory access across NUMA nodes, which introduces extra latency.

  • Lack of flexibility: This policy requires that the pod's QoS class is <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed" id="5b2a90624cgtc">Guaranteed</a>. It cannot be applied to <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#burstable" id="174c05a5978cf">Burstable</a> or <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#burstable" id="6cada2573fokq">BestEffort</a> pods.

To address these limitations, ACK provides enhanced CPU topology-aware scheduling that is based on the new Scheduling Framework. This feature is achieved through the collaboration between the ACK kube-scheduler and ack-koordinator.

  1. Node topology reporting: The ack-koordinator component detects the local physical CPU topology, such as sockets, NUMA nodes, and caches, in real time and reports the topology to the scheduling center.

  2. Global topology-aware scheduling: The kube-scheduler uses the global topology information to select the optimal node for a pod from the entire cluster and plans the core allocation scheme. For example, when selecting the optimal node, the scheduler finds the core with the fewest bound applications by default. This allocation scheme is then written to the pod's annotation as part of the scheduling result.

  3. Local core pinning: After the pod is scheduled to the target node, ack-koordinator reads the pod's annotation and modifies the cpuset.cpus file in the corresponding Cgroup of the pod to bind the pod to the physical core.

Scenarios

  • Performance-sensitive applications: Applications that are extremely sensitive to CPU context switching latency, such as high-frequency trading and real-time data processing.

  • NUMA-sensitive applications: Applications that are deployed on multi-core and multi-socket servers, such as Elastic Bare Metal Instances (Intel and AMD). Their performance is heavily affected by memory access latency and they need to avoid memory access across NUMA nodes.

  • Deterministic computing power requirements: Applications that need stable and predictable computing power, such as scientific computing and big data analytics tasks.

  • Legacy application adaptation: Applications that are not yet adapted for cloud-native environments. For example, an application might set its thread count based on the number of physical cores on the entire machine instead of the container's specifications, which leads to poor performance.

Important

Do not enable this feature in the following scenarios:

  • CPU overcommitment environments: The exclusive resource nature of core pinning is incompatible with the resource sharing model of overcommitment. This can cause resource waste and interfere with overcommitment scheduling logic.

  • General-purpose or I/O-intensive applications: Most applications, such as web services and middleware, are not sensitive to CPU core switching and do not require this feature.

Preparations

Step 1: Deploy a sample application

This topic uses an Nginx application as an example to demonstrate how to enable CPU topology-aware scheduling and achieve CPU pinning.

  1. Create an nginx-app.yaml file.

    View YAML

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-deployment
      namespace: default
      labels:
        app: nginx
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
            ports:
            - containerPort: 80
            command:
            - "sleep"
            - "infinity"
            resources:
              requests:
                cpu: 4
                memory: 8Gi
              limits:
                # Set the CPU value. It must be an integer.
                cpu: 4 
                memory: 8Gi
  2. Deploy the application.

    kubectl apply -f nginx-app.yaml
  3. Log on to the node where the pod is located to obtain the pod UID and container ID.

    • Obtain the pod name.
      Run the kubectl get pods -n <your-namespace> command to view the pod name, such as nginx-deployment-6f5899*****.

    • Obtain the pod UID.

      # Replace <your-pod-name> with the actual pod name
      kubectl get pod <your-pod-name> -n default -o jsonpath='{.metadata.uid}{"\n"}'

      Expected output:

        uid: a78a02b5-c87f-4e74-9ddd-254c163*****
    • Obtain the container ID.

      # Replace <your-pod-name> with the actual pod name
      kubectl describe pod <your-pod-name> -n default

      In the output, find the Containers field, locate the Container ID, and remove the prefix, such as containerd://.

      Expected output:

      Containers:
        nginx:
          Container ID:   containerd://b8b88a70096aabb0aea197dd2aba78d15bcbe9145198ef46a0474b31*****
  4. Check the cgroup version of the node.

    stat -fc %T /sys/fs/cgroup/
    • If the output is cgroup_root, the system uses cgroup v1.

    • If the output is cgroup2fs, the system uses cgroup v2.

  5. Run the corresponding verification command based on the cgroup version to check the CPU pinning status.

    In the cgroup path, replace the hyphens (-) in the pod UID with underscores (_). For example, if the original pod UID is a78a02b5-c87f-4e74-9ddd-254c163*****, the format used in the path is a78a02b5_c87f_4e74_9ddd_254c163*****.
    • cgroup v1:

      # Replace <POD_UID> and <CONTAINER_ID> with the actual values
      cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus
    • cgroup v2:

      # Replace <POD_UID> and <CONTAINER_ID> with the actual values
      cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus.effective

    The expected output is a range, which means the container can use all cores on the node and is not pinned.

    0-31

Step 2: Enable CPU topology-aware scheduling

You can enable CPU topology-aware scheduling by adding an annotation to the pod.

  • General pinning policy: A general policy that follows a 1:1 pinning principle. It binds the number of cores specified by resources.limits.cpu to the pod. It prioritizes CPU cores within the same NUMA node to ensure optimal memory access performance.

  • Automatic pinning policy: A policy that is optimized for specific hardware. It prioritizes binding a complete physical core cluster (such as a CCX/CCD on an AMD CPU) to maximize CPU locality and improve concurrency. This is recommended for specific large-scale (32 cores or more) AMD machine types.

Important

When you enable CPU topology-aware scheduling, do not directly specify nodeName on the pod. The kube-scheduler does not participate in the scheduling process for such pods. Use fields such as nodeSelector to configure affinity policies to specify nodes for scheduling.

General pinning policy

  • Configuration:

    • Add the cpuset-scheduler: "true" annotation in the YAML file.

      • For a pod: Add the annotation to the metadata.annotations field.

      • For a workload (such as a deployment): Add the annotation to the spec.template.metadata.annotations field.

    • In the Containers field, configure the value of resources.limits.cpu (must be an integer) to define the CPU pinning scope.

  • Configuration example:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-deployment
      namespace: default
      labels:
        app: nginx
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          annotations:
            # Set to true to enable CPU topology-aware scheduling
            cpuset-scheduler: "true" 
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
            ports:
            - containerPort: 80
            command:
            - "sleep"
            - "infinity"
            resources:
              requests:
                cpu: 4
                memory: 8Gi
              limits:
                # Set the CPU value. It must be an integer
                cpu: 4 
                memory: 8Gi
  • Verification:

    You can view it in one of two ways.

    Check the node's Cgroup file

    After the pod is running, log on to its node again and check the CPU pinning status.

    Replace the hyphens (-) in the pod UID with underscores (_).
    • cgroup v1:

      # Replace <POD_UID> and <CONTAINER_ID> with the actual values
      cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus
    • cgroup v2:

      # Replace <POD_UID> and <CONTAINER_ID> with the actual values
      cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus.effective

    The expected output is a set of core IDs that matches the limits.cpu: 4 setting. This indicates that the pinning was successful.

    0-3

    Check the pod annotation

    After the pod is scheduled to the target node, ack-koordinator reads the pod's annotation and modifies the cpuset.cpus file in the corresponding Cgroup of the pod to bind the pod to the physical core.

    Check the pod's cpuset information.

    # Replace <your-pod-name> with the actual pod name
    kubectl get pod <your-pod-name> -n default -o yaml | grep "cpuset:"

    Expected output:

            cpuset: '{"nginx":{"0":{"elems":{"0":{},"1":{},"2":{},"3":{}}}}}'

    Output:

    • "nginx":{...}: Contains the configuration for the container named nginx.

    • "0":{...}: The outermost key "0" represents the Non-Uniform Memory Access (NUMA) node ID. In this example, all bound cores are on the same NUMA Node 0. This configuration prevents performance degradation caused by cross-NUMA memory access.

    • "elems":{"0":{},"1":{},"2":{},"3":{}}: The keys represent the physical CPU core IDs to which the container is pinned. In this example, the container is pinned to cores 0, 1, 2, and 3, which corresponds to the limits.cpu: 4 setting.

Automatic pinning policy

  • Configuration:

    • Add two annotations: cpuset-scheduler: "true" and cpu-policy: "static-burst".

      • For a pod: Add the annotations to the metadata.annotations field.

      • For a workload (such as a deployment): Add the annotations to the spec.template.metadata.annotations field.

    • In the Containers field, configure the value of resources.limits.cpu (must be an integer) to define the CPU pinning scope.

  • Configuration example:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-deployment
      namespace: default
      labels:
        app: nginx
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          annotations:
            # Set to true to enable CPU topology-aware scheduling
            cpuset-scheduler: "true" 
            # Set to static-burst to enable automatic pinning and NUMA affinity policy
            cpu-policy: "static-burst" 
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
            ports:
            - containerPort: 80
            command:
            - "sleep"
            - "infinity"
            resources:
              requests:
                cpu: 4
                memory: 8Gi
              limits:
                # Set the CPU value. It must be an integer
                cpu: 4
                memory: 8Gi
  • Verification:

    The automatic pinning policy analyzes the node's CPU topology and resource usage in real time. The number of pinned cores may be greater than the number of cores that you explicitly request for the pod. You can verify the pinning status in the following two ways.

    Check the node's Cgroup file

    After the pod is running, log on to its node again and check the CPU pinning status.

    Replace the hyphens (-) in the pod UID with underscores (_).
    • cgroup v1:

      # Replace <POD_UID> and <CONTAINER_ID> with the actual values
      cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus
    • cgroup v2:

      # Replace <POD_UID> and <CONTAINER_ID> with the actual values
      cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<POD_UID>.slice/cri-containerd-<CONTAINER_ID>.scope/cpuset.cpus.effective

    The expected output is a set of specific core IDs. This indicates that the pinning was successful.

    0-7

    Check the pod annotation

    After the pod is scheduled to the target node, ack-koordinator reads the pod's annotation and modifies the cpuset.cpus file in the corresponding Cgroup of the pod to bind the pod to the physical core.

    Check the pod's cpuset information.

    # Replace <your-pod-name> with the actual pod name
    kubectl get pod <your-pod-name> -n default -o yaml | grep "cpuset:"

    Expected output:

        cpuset: '{"nginx":{"0":{"elems":{"0":{},"1":{},"2":{},"3":{},"4":{},"5":{},"6":{},"7":{}}}}}'

    Output description:

    • "nginx":{...}: The configuration for the container named nginx.

    • "0":{...}: The outermost key "0" represents the NUMA node ID. In this example, all bound cores are on the same NUMA Node 0. This avoids the performance loss that is caused by memory access across NUMA nodes.

    • "elems":{"0":{},"1":{},"2":{},"3":{},"4":{},"5":{},"6":{},"7":{}}: The keys represent the physical CPU core IDs that are pinned. In this example, the container is pinned to cores 0 through 7.

Related operations

Disable CPU topology-aware scheduling (unpin CPU cores)

  1. Edit the application YAML file. Remove the cpuset-scheduler: "true" and cpu-policy: "static-burst" (if present) annotations from the spec.template.metadata.annotations field.

  2. Apply the modified YAML file during off-peak hours. The changes take effect after the pod restarts.

Important

After you unpin the CPU cores, the pod process is no longer bound to specific physical cores and may switch between all available CPU cores on the node. This may have the following impacts:

  • CPU usage may increase slightly due to context switching across cores.

  • For compute-intensive applications, performance jitter caused by CPU resource contention may reappear because cores are no longer exclusive.

  • When the processes of multiple high-load pods are scheduled to the same core, it can cause a sudden spike in the core's load, which may trigger container CPU throttling.

Recommendations for production environments

  • Observability: Before and after you enable core pinning, integrate with Alibaba Cloud Prometheus monitoring. Closely monitor key application performance metrics (such as response time (RT) and QPS) and node metrics such as CPU usage and CPU throttling to observe the performance changes that are caused by core pinning.

  • Phased rollout: For applications with multiple replicas, use canary releases or phased updates to gradually enable or disable the pinning policy. This helps control the risks that are associated with the change.

Billing

The ack-koordinator component is free to install and use, but additional fees may apply in the following scenarios:

  • ack-koordinator is a non-managed component and consumes worker node resources after installation. You can configure the resource requests for each module during installation.

  • By default, ack-koordinator exposes monitoring metrics for features such as resource profiling and fine-grained scheduling in Prometheus format. If you select the Enable Prometheus monitoring for ACK-Koordinator option and use Alibaba Cloud Prometheus, these metrics are considered custom metrics and will incur fees. The cost depends on your cluster size and the number of applications. Before you enable this feature, review the Prometheus billing information to understand the free quota and pricing for custom metrics. You can monitor and manage your resource usage by querying your usage data.

References