Online workloads typically reserve CPU and memory based on peak estimates, but actual usage is often much lower. This leaves a large pool of allocated-but-idle resources that standard BestEffort pods can share — but without scheduling guarantees or fairness controls. Dynamic resource overcommitment solves both problems: the ack-koordinator component monitors node load in real time, calculates reclaimable capacity, and exposes it as Batch extended resources (kubernetes.io/batch-cpu and kubernetes.io/batch-memory) that BestEffort pods can explicitly request.
To get the most out of this feature, read Pod Quality of Service Classes and Assign Memory Resources to Containers and Pods in the Kubernetes documentation.
How it works
ack-koordinator tracks per-node load continuously and publishes reclaimable capacity as extended resources on each node. BestEffort pods declare explicit requests and limits against these Batch resources, so the ACK scheduler can make informed placement decisions and enforce resource limits through the node's cgroup hierarchy.
The following diagram illustrates why standard resource overcommitment falls short:
Without dynamic overcommitment, the scheduler has no visibility into real node load, so it may place BestEffort pods on already-overloaded nodes. There is also no way to express different resource amounts per pod, so resources cannot be distributed fairly among BestEffort pods.
ack-koordinator introduces three terms to describe reclaimed resource capacity:
| Term | Description |
|---|---|
| Reclaimed | Resources that can be dynamically overcommitted at this moment |
| Buffered | Reserved resources held back from reclamation |
| Usage | Actual resource consumption |
QoS classes and Batch resources
Kubernetes assigns each pod a quality of service (QoS) class based on its resource configuration. Batch resources are designed specifically for the BestEffort class:
| QoS class | Resource configuration | Use case |
|---|---|---|
| Guaranteed | requests == limits for all containers |
Latency-sensitive production services |
| Burstable | requests < limits for at least one container |
General online workloads |
| BestEffort | No requests or limits — use Batch resources instead |
Batch jobs and offline tasks |
To use dynamic resource overcommitment, set koordinator.sh/qosClass: "BE" on the pod and replace standard resource fields with kubernetes.io/batch-cpu and kubernetes.io/batch-memory.
Billing
No fee is charged to install or use the ack-koordinator component. Note the following:
-
ack-koordinator is a non-managed component. After installation, it occupies worker node resources. Specify per-module resource requests at install time.
-
ack-koordinator can expose Prometheus metrics for features such as resource profiling and fine-grained scheduling. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, those metrics count as custom metrics and are billed accordingly. Before enabling, review the Billing topic for Managed Service for Prometheus and read Query the amount of observable data and bills to understand how costs are calculated.
Prerequisites
Before you begin, ensure that you have:
-
An ACK Pro cluster. For more information, see Create an ACK Pro cluster.
-
The ack-koordinator component installed at version 0.8.0 or later. For more information, see ack-koordinator.
Enable dynamic resource overcommitment
Enable and configure the feature by creating or updating a ConfigMap in the kube-system namespace.
Step 1: Create the ConfigMap
Create a file named configmap.yaml with the following content:
apiVersion: v1
kind: ConfigMap
metadata:
name: ack-slo-config
namespace: kube-system
data:
# colocation-config controls dynamic Batch resource calculation and updates.
# Related features: Dynamic resource overcommitment, load-aware scheduling.
colocation-config: |
{
"enable": true, # Required: enables Batch resource updates. Setting to false resets reclaimed resources to 0.
"metricAggregateDurationSeconds": 60, # How often (seconds) node metrics are aggregated. Use the default value.
"cpuReclaimThresholdPercent": 60, # Reclaim threshold for batch-cpu, as a % of allocatable CPU. Default: 65.
"memoryReclaimThresholdPercent": 70, # Reclaim threshold for batch-memory, as a % of allocatable memory. Default: 65.
"memoryCalculatePolicy": "usage" # How batch-memory capacity is calculated: "usage" (default) or "request".
}
ThecpuReclaimThresholdPercentandmemoryReclaimThresholdPercentvalues in this example (60 and 70) are sample values. The actual defaults are 65 for both parameters.
The following table describes each parameter in detail:
| Parameter | Type | Default | Description |
|---|---|---|---|
enable |
Boolean | false |
Enables dynamic Batch resource updates. Setting this to false resets reclaimable resources to 0. |
metricAggregateDurationSeconds |
Int | 60 |
How often (in seconds) the system aggregates node metrics to recalculate Batch resource capacity. Use the default value. |
cpuReclaimThresholdPercent |
Int | 65 |
Reclaim threshold for batch-cpu resources, as a percentage of allocatable CPU. See Calculate Batch resource capacity. |
memoryReclaimThresholdPercent |
Int | 65 |
Reclaim threshold for batch-memory resources, as a percentage of allocatable memory. See Calculate Batch resource capacity. |
memoryCalculatePolicy |
String | "usage" |
How batch-memory capacity is calculated. "usage": includes unallocated resources and allocated-but-idle resources (based on actual usage of Guaranteed and Burstable pods). "request": includes only unallocated resources (based on memory requests of Guaranteed and Burstable pods). |
Calculate Batch resource capacity
ack-koordinator applies the following formula to calculate the amount of Batch resources available on each node.
Usage-based calculation (default, memoryCalculatePolicy: "usage"):
nodeBatchAllocatable = nodeAllocatable × thresholdPercent − podUsage(non-BE) − systemUsage
Request-based calculation (memoryCalculatePolicy: "request", applies to batch-memory only):
nodeBatchAllocatable = nodeAllocatable × thresholdPercent − podRequest(non-BE) − systemUsage
Where:
| Variable | Description |
|---|---|
nodeAllocatable |
Total allocatable CPU or memory on the node |
thresholdPercent |
The configured reclaim threshold percentage |
podUsage(non-BE) |
Actual resource usage of Guaranteed and Burstable pods |
podRequest(non-BE) |
Sum of resource requests for Guaranteed and Burstable pods |
systemUsage |
System-level resource consumption on the node |
Step 2: Apply the ConfigMap
Check whether the ack-slo-config ConfigMap already exists in the kube-system namespace:
-
If it exists, use
kubectl patchto merge your changes without overwriting other settings:kubectl patch cm -n kube-system ack-slo-config --patch "$(cat configmap.yaml)" -
If it does not exist, create it:
kubectl apply -f configmap.yaml
Apply for Batch resources
After enabling dynamic resource overcommitment, configure pods to request Batch resources.
-
A pod cannot request both Batch resources and standard resources at the same time.
-
For Deployments or other workloads, set the label on
template.metadata, not on the workload object itself. -
ack-koordinator dynamically adjusts available Batch capacity based on real-time node load. In rare cases, kubelet may lag in reporting node status, causing pods to fail scheduling due to insufficient resources. If this happens, delete and recreate the affected pods.
-
Batch resource amounts must be integers. batch-cpu uses the millicore unit (1 core = 1000 millicores).
Step 1: Check available Batch resources on the node
# Replace $nodeName with the actual node name.
kubectl get node $nodeName -o yaml
Look for the status.allocatable section in the output:
status:
allocatable:
# Unit: millicore. The following example shows 50 cores available.
kubernetes.io/batch-cpu: 50000
# Unit: bytes. The following example shows 50 GB available.
kubernetes.io/batch-memory: 53687091200
Step 2: Configure the pod to use Batch resources
Add the koordinator.sh/qosClass: "BE" label to the pod metadata and set kubernetes.io/batch-cpu and kubernetes.io/batch-memory in the container's resources field:
metadata:
labels:
# Required: sets the pod's QoS class to BestEffort.
koordinator.sh/qosClass: "BE"
spec:
containers:
- resources:
requests:
# Unit: millicore. "1k" = 1000 millicores = 1 core.
kubernetes.io/batch-cpu: "1k"
# Unit: bytes.
kubernetes.io/batch-memory: "1Gi"
limits:
kubernetes.io/batch-cpu: "1k"
kubernetes.io/batch-memory: "1Gi"
Example
This example deploys a BestEffort test pod that uses Batch resources and verifies that the resource limits are enforced in the node's cgroup.
-
Check available Batch resources on the node:
kubectl get node $nodeName -o yamlExpected output:
status: allocatable: kubernetes.io/batch-cpu: 50000 kubernetes.io/batch-memory: 53687091200 -
Create a file named
be-pod-demo.yaml:apiVersion: v1 kind: Pod metadata: labels: koordinator.sh/qosClass: "BE" name: be-demo spec: containers: - command: - "sleep" - "100h" image: registry-cn-beijing.ack.aliyuncs.com/acs/stress:v1.0.4 imagePullPolicy: Always name: be-demo resources: limits: kubernetes.io/batch-cpu: "50k" kubernetes.io/batch-memory: "10Gi" requests: kubernetes.io/batch-cpu: "50k" kubernetes.io/batch-memory: "10Gi" schedulerName: default-scheduler -
Deploy the pod:
kubectl apply -f be-pod-demo.yaml -
Verify that the resource limits are reflected in the node's cgroup. Check the CPU limit:
cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/cpu.cfs_quota_usExpected output (50 cores):
5000000Check the memory limit:
cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/memory.limit_in_bytesExpected output (10 GB):
10737418240
Monitor Batch resource usage
ACK clusters integrate with Managed Service for Prometheus. To view Batch resource usage:
-
Log on to the ACK console. In the left navigation pane, click ACK consoleClusters.
-
On the Clusters page, click the target cluster name. In the left-side pane, choose Operations > Prometheus Monitoring.
-
Click the Others tab, then click the k8s-reclaimed-resource tab. This dashboard shows cluster mixed revenue, and resource capacity at the cluster, node, and pod levels. For more information, see Enable the colocation monitoring feature.
If you have built a custom Prometheus dashboard, use the following metrics to query Batch resource data:
# Allocatable batch-cpu on the node
koordlet_node_resource_allocatable{resource="kubernetes.io/batch-cpu",node="$node"}
# batch-cpu already allocated on the node
koordlet_container_resource_requests{resource="kubernetes.io/batch-cpu",node="$node"}
# Allocatable batch-memory on the node
kube_node_status_allocatable{resource="kubernetes.io/batch-memory",node="$node"}
# batch-memory already allocated on the node
koordlet_container_resource_requests{resource="kubernetes.io/batch-memory",node="$node"}
FAQ
After upgrading from ack-slo-manager to ack-koordinator, does the old overcommitment configuration still work?
Yes. ack-koordinator is backward compatible with the earlier ack-slo-manager protocol. The ACK Pro cluster scheduler can calculate requested and available resources using both the old and new protocol formats simultaneously, so you can upgrade without reconfiguring existing workloads.
The earlier protocol uses:
-
The
alibabacloud.com/qosClasspod annotation -
The
alibabacloud.com/reclaimedfield for resource requests and limits
ack-koordinator supports these through protocol versions dated no later than July 30, 2023. Migrate existing workloads to the koordinator.sh protocol when convenient.
The following table shows compatibility across component versions:
| Scheduler version | ack-koordinator | alibabacloud.com protocol | koordinator.sh protocol |
|---|---|---|---|
| ≥1.18 and <1.22.15-ack-2.0 | ≥0.3.0 | Supported | Not supported |
| ≥1.22.15-ack-2.0 | ≥0.8.0 | Supported | Supported |
Why does memory usage spike right after the pod starts?
Symptom: Memory usage jumps immediately after a container starts, exceeding the expected kubernetes.io/batch-memory limit.
Cause: When a container is created, ack-koordinator sets the cgroup memory limit based on kubernetes.io/batch-memory. Some applications read the cgroup limit at startup to determine how much memory to allocate internally. If the application reads the cgroup before ack-koordinator has written the limit, it may allocate more memory than intended. The operating system does not immediately reclaim that memory, so usage stays elevated until it naturally drops below the configured limit.
Check: Run the following command inside the container to confirm the memory limit is set correctly:
# Unit: bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# Expected output example
1048576000
Fix: Configure the application's memory limit in its startup script before the main process begins. This ensures the limit is in place before the application reads the cgroup.
Why does a BestEffort pod stay in Pending state?
Symptom: A pod configured with Batch resources remains in Pending state and cannot be scheduled.
Check: Run kubectl describe pod <pod-name> and look for scheduling failure events.
Common causes and fixes:
| Cause | Fix |
|---|---|
| Insufficient Batch resources on all nodes | Run kubectl get node <node> -o yaml and check status.allocatable for batch-cpu and batch-memory. Reduce pod requests or wait for resources to be reclaimed. |
| kubelet has not yet synchronized node status | Delete and recreate the pod. ack-koordinator dynamically adjusts Batch capacity, and kubelet may lag in reporting the updated allocatable resources. |
| Pod is requesting both Batch and standard resources | A pod cannot request Batch resources and standard resources at the same time. Remove one set of resource fields. |
What's next
ack-koordinator provides additional controls to protect online workloads from interference caused by BestEffort pods. See the following topics: