This page answers common questions about scheduling in ACK clusters, covering IP-aware scheduling, load-aware scheduling, Quality of Service (QoS), descheduling, and general troubleshooting.
General FAQ
How do I prevent Pod startup failures due to insufficient IP addresses in a virtual switch?
The native Kubernetes scheduler has no visibility into the number of available IP addresses on a Node. Pods can be scheduled to Nodes that have already exhausted their IP resources, causing those Pods to fail at startup—often producing a large number of abnormal Pods in a short period.
ACK Scheduler resolves this with IP-aware scheduling. It reads the k8s.aliyun.com/max-available-ip annotation on each Node, which marks the maximum number of available IP addresses, and uses this value to cap the number of Pods that require an independent IP address on that Node. When a Node runs out of IP resources, ACK Scheduler sets a SufficientIP condition on the Node's status, which blocks any new Pods that require an independent IP from being scheduled there.
This feature is automatically enabled by the kube-scheduler add-on. Your cluster must meet these requirements:
The cluster is an ACK managed cluster Pro Edition, the network plug-in is Terway v1.5.7 or later. See Create an ACK managed cluster.
The kube-scheduler version meets the following requirements:
| Cluster version | kube-scheduler version |
|---|---|
| 1.30 and later | All versions |
| 1.28 | v1.28.3-aliyun-6.3 and later |
| 1.26 | v1.26.3-aliyun-6.3 and later |
| 1.24 | v1.24.6-aliyun-6.3 and later |
| 1.22 | v1.22.15-aliyun-6.3 and later |
What is the default scheduling policy of ACK Scheduler?
ACK Scheduler follows the same default policy as the community Kubernetes scheduler. Scheduling a Pod involves two key steps:
[Filter](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#filter): Eliminates Nodes where the Pod cannot run. If no Nodes pass filtering, the Pod remains unschedulable.
[Score](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#scoring): Ranks the remaining Nodes and selects the best fit for the Pod.
For the full list of Filter and Score plug-ins enabled in the latest ACK Scheduler version, see kube-scheduler.
How do I avoid scheduling Pods to Nodes with high resource utilization?
The native Kubernetes scheduler makes decisions based on resource *requests*, not actual utilization. As a result, some Nodes can become overloaded while others sit idle. ACK provides three approaches to address this:
Set accurate resource requests and limits. Use the resource profiling feature to get container specification recommendations based on historical usage data. See Resource profiling.
Enable load-aware scheduling. ACK Scheduler can factor in actual Node load when making scheduling decisions. It analyzes historical load data, estimates the resource requirements for incoming Pods, and preferentially places them on Nodes with lower utilization. See Load-aware scheduling.
Enable load-aware hotspot descheduling. Node utilization shifts over time as traffic and workloads change. ACK's descheduling capability detects hot spots and evicts Pods from overloaded Nodes to restore balance. See Using load-aware hotspot descheduling.
A new Node is added to the cluster. Why aren't Pods scheduled to it?
Check the following in order:
Node status: If the Node is in the
NotReadystate, the scheduler will not place Pods on it. Wait for the Node to becomeReady.Scheduling constraints on the Pod: Check whether the Pod has a
nodeSelector,nodeAffinity, orpodAffinityrule, or whether the Node has taints that the Pod doesn't tolerate. These constraints can prevent the Pod from landing on the new Node even when it appears healthy.Resource request imbalance: The native Kubernetes scheduler places Pods based on resource requests, not actual utilization. If existing Nodes already have sufficient free *requested* capacity, the scheduler may not use the new Node. See How do I avoid scheduling Pods to Nodes with high resource utilization? for solutions.
Why does scheduling fail due to insufficient CPU or memory when overall cluster utilization is not high?
The scheduler evaluates requested resources, not actual consumption. A Node may have low actual CPU usage but still appear "full" from the scheduler's perspective if its Pods have high resource requests. Even with low cluster-wide utilization, scheduling can fail if no single Node has enough *requested* capacity to fit the new Pod.
See How do I avoid scheduling Pods to Nodes with high resource utilization? for solutions.
What do I need to know before using descheduling in ACK? Does it restart Pods?
ACK provides descheduling through Koordinator Descheduler. Two things to keep in mind:
Descheduling only evicts Pods. Koordinator Descheduler evicts running Pods but does not recreate them. Recreation is handled by the workload controller (Deployment, StatefulSet, and so on), and scheduling of the new Pod is handled by the scheduler.
Maintain enough replicas. During descheduling, the old Pod is evicted before the new one is created. Make sure your workload has enough replicas (
replicas) to remain available during the eviction window.
See Descheduling.
How do I schedule an application to a specific Node?
Add a label to the target Node and add a matching nodeSelector to your application's Pod spec. See Schedule an application to a specific node.
In a Deployment, how can I schedule a specific number of Pods to ECS and a specific number to ECI?
Use UnitedDeployment to define separate subsets for ECS and ECI resources. For example, set replicas: 10 in the ECS subset and replicas: 10 in the ECI subset within the UnitedDeployment spec. See Scale workloads based on UnitedDeployment.
How can I ensure high availability for a workload's Pods during scheduling?
Use podAntiAffinity to spread Pods across different zones or Nodes. For example, the following configuration attempts to schedule Pods with the security=S2 label to different zones. If the constraint cannot be satisfied, the scheduler falls back to other Nodes.
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
For additional options, see Pod affinity and anti-affinity and Pod topology spread constraints in the Kubernetes documentation.
How do I migrate ack-descheduler to Koordinator Descheduler?
The ack-descheduler add-on is no longer maintained. Migrate to Koordinator Descheduler, which is actively maintained. The migration process follows the same steps as migrating Kubernetes Descheduler to Koordinator Descheduler. See Migrate Kubernetes Descheduler to Koordinator Descheduler and the migration notice at \[Notice\] ack-descheduler migration.
The ack-descheduler add-on available in the ACK marketplace is based on open source Kubernetes Descheduler. Versions 0.20.0 and 0.27.1 are available and work the same as their corresponding open source versions.
Load-aware scheduling FAQ
Why aren't all Pods in a batch scheduled to the Node with the lowest load?
Placing all new Pods on the currently lowest-load Node would create a new hot spot as those Pods start consuming resources. To prevent this, the load-aware scheduling plug-in adjusts a Node's score when it already has newly scheduled Pods whose utilization hasn't been reported yet. This prevents over-concentration and balances the load more evenly across Nodes.
Besides Node load, what other factors affect scheduling results?
The Kubernetes scheduler uses multiple plug-ins that all contribute to the final Node ranking. Affinity rules, topology spread constraints, and other plug-ins each add or subtract from a Node's score. The final placement reflects the combined weights of all active plug-ins. Adjust the scoring weights of each plug-in based on your requirements.
After upgrading the scheduler, is the load-aware scheduling feature using the old protocol still supported?
The old protocol requires the alibabacloud.com/loadAwareScheduleEnabled: "true" annotation on Pods. ACK Scheduler is backward-compatible with this protocol, so upgrading the scheduler does not break existing Pods. After upgrading, switch to the global load-aware scheduling policy to avoid per-Pod annotation management.
ACK Scheduler 1.22 remains compatible with the old protocol. For version 1.24, backward compatibility ended on August 30, 2023. Upgrade your cluster and use the current configuration method. See Manually upgrade a cluster.
The following tables summarize protocol support by cluster version.
1.26 and later
| ACK Scheduler version | ack-koordinator version | Pod annotation protocol | Console switch |
|---|---|---|---|
| All versions | 1.1.1-ack.1 or later | Not supported | Supported |
1.24
| ACK Scheduler version | ack-koordinator version | Pod annotation protocol | Console switch |
|---|---|---|---|
| v1.24.6-ack-4.0 or later | 1.1.1-ack.1 or later | Supported | Supported |
| v1.24.6-ack-3.1 or later, earlier than v1.24.6-ack-4.0 | 0.8.0 or later | Supported | Not supported |
1.22 and earlier
| ACK Scheduler version | ack-koordinator version | Pod annotation protocol | Console switch |
|---|---|---|---|
| 1.22.15-ack-4.0 or later | 1.1.1-ack.1 or later | Supported | Supported |
| 1.22.15-ack-2.0 or later, earlier than 1.22.15-ack-4.0 | 0.8.0 or later | Supported | Not supported |
| v1.20.4-ack-4.0 to v1.20.4-ack-8.0; v1.18-ack-4.0 | 0.3.0 or later, earlier than 0.8.0 | Supported | Not supported |
QoS FAQ
After enabling the CPU Burst configuration, why are my Pods still throttled?
Several conditions can cause CPU throttling to persist:
Incorrect configuration format. If the CPU Burst policy is not formatted correctly, it won't take effect. Verify the configuration format. See Advanced parameter configuration.
CPU utilization at the cap. If actual CPU usage hits the
cfsQuotaBurstPercentlimit, throttling still occurs because there aren't enough physical CPU resources. Adjust the request and limit values to match your application's actual needs.Latency in `cpu.cfs_quota_us` adjustment. CPU Burst modifies two cgroup parameters:
cpu.cfs_quota_usandcpu.cfs_burst_us. Thecpu.cfs_quota_usparameter is only updated after ack-koordinator detects throttling, which introduces a small delay. Thecpu.cfs_burst_usparameter is set immediately based on configuration. For best results, use this feature with Alibaba Cloud Linux.Node safety threshold triggered. CPU Burst includes a protection mechanism that resets
cpu.cfs_quota_usto its baseline if overall Node utilization exceeds thesharePoolThresholdPercentthreshold. Set the Node's safety threshold based on your application's requirements to avoid this.
Is Alibaba Cloud Linux required for the CPU Burst policy?
No. The ack-koordinator CPU Burst policy works on all Alibaba Cloud Linux and CentOS open source kernels. However, Alibaba Cloud Linux is recommended because ack-koordinator can use kernel-level features to provide finer-grained CPU elasticity management. See Enable the CPU Burst feature on the cgroup v1 interface.
After an application uses Batch resources, why does its memory usage suddenly increase?
For containers configured with a Batch memory limit (kubernetes.io/batch-memory), ack-koordinator sets the cgroup memory limit after the container starts. If the application reads and acts on the cgroup limit at startup—before ack-koordinator has applied it—it may allocate more memory than the intended Batch limit. The operating system does not immediately reclaim that memory, so the cgroup limit can't be enforced until actual usage drops below the limit on its own.
To resolve this, either tune the application to keep memory usage below the Batch limit, or check the application's startup script to make sure the memory limit parameter is set before the application initializes.
Run the following command inside the container to view the current memory limit (in bytes):
# The unit is bytes.
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# Expected output.
1048576000
After increasing the number of CPU cores, why does performance decrease and CPU throttling increase?
Symptoms
An application running with 8 CPU cores shows 33 QPS (queries per second) and a 29 ms average response time. After increasing to 12 cores, QPS drops to 9.6 and average response time rises to 104 ms. CPU throttling for the 12-core Pod is nearly 100%, while the 8-core Pod showed only 0.15%. CPU topology-aware scheduling and core binding did not resolve the issue. The application performs normally when deployed directly on an ECS instance.
Cause
This is a known bug in the Linux kernel's Completely Fair Scheduler (CFS) affecting kernel versions earlier than 4.19 (such as the 3.10 kernel in CentOS 7). In each CFS scheduling period, each CPU core reserves 1 ms of quota. If unused within the period, this quota is not immediately reclaimed. The more cores allocated to a Pod, the larger the cumulative quota loss. For a Pod with *n* cores, up to *n–1* ms of CPU time per 100 ms scheduling period is unavailable to the workload. This causes apparent CPU throttling even when the Pod is not at its CPU limit, increasing latency and degrading throughput. The effect shows up in monitoring as a high ratio of container_cpu_cfs_throttled_periods_total to container_cpu_cfs_periods_total.
Solutions
The fundamental fix is upgrading the kernel. The optimization methods below reduce the impact but do not eliminate the root cause.
Method 1 (recommended): Upgrade the OS kernel
Upgrade to a kernel version of 4.19 or later, such as Alibaba Cloud Linux 3 Container-Optimized Edition, Alibaba Cloud Linux 3, or ContainerOS. See also the upstream fix: Linux kernel fix commit.
Method 2: Use the CPU Burst feature
Use ack-koordinator's CPU Burst feature to reserve idle CPU quota for burst use, partially offsetting the performance impact of throttling.
Method 3: Optimize the CPU scheduling policy
Use ack-koordinator's CPU topology-aware scheduling to pin CPU cores and improve scheduling stability. Alternatively, reduce the number of cores allocated to the Pod to lower the per-period quota loss.
ack-koordinator migration FAQ
Will the dynamic resource overcommitment feature of the legacy ack-slo-manager protocol be supported after upgrading to ack-koordinator?
Yes. ack-koordinator is compatible with the old protocol. The old protocol uses two components in the Pod spec:
The
alibabacloud.com/qosClassannotationThe
alibabacloud.com/reclaimedresource in requests and limits
ack-koordinator recognizes both the old and new protocols and calculates resource requests and availability uniformly across both. Upgrade your add-on without needing to update existing Pod configurations immediately.
Support for the old protocol ends on July 30, 2023. Update the resource parameters to the latest version as soon as possible.
The following table shows protocol support by scheduler and ack-koordinator version:
| Scheduler version | ack-koordinator version | alibabacloud.com protocol | koordinator.sh protocol |
|---|---|---|---|
| 1.18 or later, earlier than 1.22.15-ack-2.0 | 0.3.0 or later | Supported | Not supported |
| 1.22.15-ack-2.0 or later | 0.8.0 or later | Supported | Supported |
Will the CPU Burst feature of the legacy ack-slo-manager protocol be supported after upgrading to ack-koordinator?
Yes. The old protocol requires the alibabacloud.com/cpuBurst annotation on Pods. ack-koordinator is fully compatible with this annotation and handles the upgrade seamlessly.
Support for the old protocol ends on July 30, 2023. Update to the current protocol as soon as possible.
| ack-koordinator version | alibabacloud.com protocol | koordinator.sh protocol |
|---|---|---|
| 0.2.0 or later | Supported | Not supported |
| 0.8.0 or later | Supported | Supported |
Will the CPU QoS feature of the legacy ack-slo-manager protocol be supported after upgrading to ack-koordinator?
Yes. The old protocol (version 0.8.0 and earlier) enables CPU QoS by adding the alibabacloud.com/qosClass annotation to Pods. ack-koordinator remains compatible and supports gradual migration to the koordinator.sh protocol.
Backward compatibility ends on July 30, 2023. Migrate your Pods to the new protocol promptly.
| ack-koordinator version | alibabacloud.com protocol | koordinator.sh protocol |
|---|---|---|
| 0.5.2 or later, earlier than 0.8.0 | Supported | Not supported |
| 0.8.0 or later | Supported | Supported |
Will the container memory QoS feature be supported after upgrading from the legacy ack-slo-manager protocol to ack-koordinator?
Yes. The old protocol (version 0.8.0 and earlier) uses the alibabacloud.com/qosClass and alibabacloud.com/memoryQOS annotations. ack-koordinator is backward-compatible with both.
Backward compatibility ends on July 30, 2023. Migrate to the current protocol as soon as possible.
| ack-koordinator version | alibabacloud.com protocol | koordinator.sh protocol |
|---|---|---|
| 0.3.0 or later, earlier than 0.8.0 | Supported | Not supported |
| 0.8.0 or later | Supported | Supported |
Descheduling FAQ
Node utilization has reached the threshold, but Pods are not being evicted. What should I do?
| Cause | Solution |
|---|---|
| Descheduling scope not configured | The descheduler applies only to namespaces and Nodes within its configured scope. Check whether descheduling is enabled for the relevant namespaces and Nodes. |
| Descheduler not restarted after configuration change | Changes to the descheduler configuration take effect only after a restart. See Step 2: Enable the descheduling plug-in. |
| Average utilization below threshold | The descheduler measures average utilization over a window, not the instantaneous value. Descheduling triggers only if the average stays above the threshold for the configured duration (default: 10 minutes). The value from kubectl top node reflects only the last minute. Monitor utilization over a longer window and adjust hotspot detection settings if needed. |
| Insufficient capacity on other Nodes | Before evicting a Pod, the descheduler checks whether another Node can accommodate it. If no Node has enough free capacity (for example, no Node has 8 cores and 16 GiB free for an 8-core/16-GiB Pod), the Pod is not evicted. Add Nodes to create capacity. |
| Single-replica workload | Single-replica Pods are not evicted by default to protect availability. To allow eviction, add the descheduler.alpha.kubernetes.io/evict: "true" annotation to the Pod's TemplateSpec. This annotation is not supported in versions v1.3.0-ack1.6, v1.3.0-ack1.7, or v1.3.0-ack1.8. Upgrade the add-on to the latest version. See Install and manage add-ons. |
| Pod uses HostPath or EmptyDir | Pods using HostPath or EmptyDir are not evicted by default. To allow migration, enable evictLocalStoragePods. See Eviction and migration control configurations. |
| Too many replicas already unavailable or migrating | If the number of unavailable or migrating replicas for a workload exceeds maxUnavailablePerWorkload or maxMigratingPerWorkload, no further evictions occur. Wait for in-progress evictions to complete, or increase these limits. |
| Replica count at or below the migration limit | If a workload's total replica count is less than or equal to maxMigratingPerWorkload or maxUnavailablePerWorkload, the descheduler skips it. Decrease these values or switch them to a percentage. |
Why does the descheduler restart frequently?
This usually means the descheduler's ConfigMap is missing or has a formatting error. Check the ConfigMap content and correct any issues. See Advanced configuration parameters. After fixing the ConfigMap, restart the descheduler.
How do I use load-aware scheduling and hotspot descheduling together?
Enable both features and align their thresholds. When a Node's load exceeds the descheduler's highThresholds value, Pods are evicted from it. Set the loadAwareThreshold parameter in load-aware scheduling to the same value as highThresholds. This prevents evicted Pods from being immediately rescheduled back to the same overloaded Node—especially important in clusters with a small number of Nodes at similar utilization levels.
What utilization algorithm does the descheduler use?
The descheduler averages resource utilization over a rolling window and triggers descheduling only when the average remains above the threshold for a sustained period (default: approximately 10 minutes). For memory, the calculation excludes page cache, since the operating system can reclaim it. By contrast, kubectl top node includes page cache in its memory figures. To view the descheduler's actual memory baseline, use Alibaba Cloud Prometheus.
Others
During a stress test with wrk, the result shows "Socket errors: connect 54,". What should I do?
This error means the wrk client has run out of available TCP connections. Fix it by enabling TCP connection reuse on the stress testing machine.
Check whether TCP connection reuse is enabled:
sudo sysctl -n net.ipv4.tcp_tw_reuseAn output of
0or2means TCP connection reuse is not fully enabled.Enable TCP connection reuse:
sudo sysctl -w net.ipv4.tcp_tw_reuse=1Rerun the wrk test. The
Socket errors: connect 54, ...message should no longer appear.
These commands run only on the stress testing machine—no changes are needed on the target machine. After the test, restore the original setting with
sysctl -w net.ipv4.tcp_tw_reuse=<original_value>.
Why is no data displayed in the cluster colocation benefits section on the k8s-reclaimed-resource tab?
Verify that ack-koordinator is installed:
Log on to the ACK console. In the left navigation pane, click Clusters.
Click the cluster name. In the left navigation pane, choose Applications > Helm.
On the Helm page, check whether ack-koordinator is listed. If not, install it first. See Install and manage add-ons.
Check whether the colocation dashboard displays data: If not, verify that the
kube_node_labelsmetric is being collected:Log on to the ARMS console.
In the left navigation pane, choose Managed Service for Prometheus > Instances.
Select the region, click the Prometheus instance name, and then click Metric Management.
Click the Metrics button, search for
kube_node_labels, and verify that the metric has data.
Can I use Arm-based spot instances?
Yes. See Use spot instances.
What are the limitations of using Arm-based Nodes in an ACK cluster?
On the Add-ons page, only the following add-on categories support the Arm architecture:
Core components
Logging and monitoring
Storage
Network
Add-ons from the marketplace do not support the Arm architecture.