Large-scale ACK managed cluster Pro clusters—typically those with more than 500 nodes or 10,000 pods—place unique demands on the control plane. Cluster performance and availability depend on resource count, access frequency, and access patterns. The true signal of whether your cluster is at large scale is the API Server's request success rate and latency: if those degrade, the control plane is under pressure regardless of node count.
This topic provides recommendations for planning and operating large-scale clusters. Adjust them based on your specific environment and business requirements.
Under the shared responsibility model, ACK manages the security of the cluster control plane components—including Kubernetes control plane components and etcd—and the underlying Alibaba Cloud infrastructure. You are responsible for securing your business applications and configuring your cloud resources. For details, see Shared responsibility model.
Single cluster vs. multiple clusters
A single large-scale cluster reduces management overhead and improves resource utilization. In some business scenarios, however, splitting services across multiple clusters makes more sense.
Consider multiple clusters when:
| Consideration | When to split |
|---|---|
| Isolation | Prevent issues in one environment (for example, testing) from affecting production. Splitting clusters reduces the blast radius of failures. |
| Geographic distribution | Deploy clusters in specific regions to meet availability and latency requirements for end users. |
| Single-cluster size limits | The Kubernetes architecture has inherent performance limits. Before planning a large-scale cluster, review the community's capacity limits and SLOs, then check your quota at the Quota Center. If your requirements exceed community or ACK limits, split into multiple clusters. |
To manage multiple clusters for tasks such as application deployment, traffic management, job distribution, and monitoring, enable fleet management.
Keep clusters on the latest version
Newer Kubernetes versions include stability, performance, and scalability improvements that directly benefit large-scale clusters. Notable examples:
-
v1.31: kube-apiserver serves consistent reads for List requests from its cache, reducing direct etcd access and lowering etcd load. See Consistent reads from cache.
-
v1.33: kube-apiserver uses streaming encoding (StreamingCollectionEncodingToJSON and StreamingCollectionEncodingToProtobuf) for List operations, reducing kube-apiserver memory usage at scale. See Streaming List responses.
ACK releases supported Kubernetes versions in sync with the community and discontinues support for expired versions—including stopping new feature releases, bug fixes, and security patches—while providing only limited technical support for expired versions. Monitor version release announcements through the console and upgrade promptly.
-
Supported versions: Version guide
-
Upgrade considerations and procedures: Upgrade clusters
-
Manual upgrade: Manually upgrade an ACK cluster
-
Automatic upgrade: Automatically upgrade a cluster
Monitor cluster resource limits
Stay within the following limits to maintain availability and performance in large-scale clusters.
| Resource | Limit | Action |
|---|---|---|
| etcd database size (DB Size) | Keep below 8 GB |
Important
If etcd exceeds its size limit, it raises a |
| Total data per resource type in etcd | Keep below 800 MB per type | When defining a new CustomResourceDefinition (CRD), estimate the final number of custom resources (CRs) in advance. When using Helm, consider switching to Helm's SQL storage backend if secrets-based release storage approaches the Kubernetes secrets size limit. |
| API Server CLB connections and bandwidth | Maximum bandwidth: 5,120 Mbps; see CLB instances for connection limits | For clusters with 1,000 or more nodes, use pay-by-usage Classic Load Balancer (CLB) instances. Clusters created after February 2023 with Kubernetes 1.20 or later use ENI (Elastic Network Interface) direct connection by default for improved bandwidth. See Access the API server using an internal endpoint. |
| Services per namespace | Keep below 5,000 | The kubelet injects service information as environment variables into pods. Too many services per namespace causes slow pod startup or startup failures. Set enableServiceLinks: false in the podSpec to opt out. See Accessing the service. |
| Total services in the cluster | Keep below 10,000 total; 500 for LoadBalancer-type services | Excess services increase the network rules kube-proxy processes, degrading its performance. For LoadBalancer-type services, the sync delay to the CLB can reach minutes when counts are high. |
| Backend pods per service endpoint | Keep below 3,000 | Use EndpointSlices instead of Endpoints in large-scale clusters—EndpointSlices split endpoints into smaller chunks, reducing data transmitted per change. If you use a custom controller that reads Endpoints directly, keep the count below 1,000 per Endpoints object; above this, the object is automatically truncated. See Over-capacity endpoints. |
| Total endpoints across all services | Keep below 64,000 | Excess endpoints overload the API Server and degrade network performance. |
| Pending pods | Keep below 10,000 | High pending pod counts cause the scheduler to generate repeated events, which can trigger event storms. |
| Secrets in clusters with KMS V1 encryption | Keep below 2,000 | With KMS V1, each encryption generates a new data encryption key (DEK). On cluster startup or upgrade, all secrets are decrypted sequentially. Too many secrets significantly slow startup. See Encryption at rest for secrets using KMS. |
Configure control plane component parameters
ACK managed cluster Pro lets you customize control plane component parameters for kube-apiserver, kube-controller-manager, and kube-scheduler. In large-scale clusters, tuning throttling-related parameters is critical.
kube-apiserver
kube-apiserver limits concurrent request handling to protect the control plane. When the limit is exceeded, it returns HTTP 429 (Too Many Requests) and instructs clients to retry. Without server-side throttling, excessive requests can crash the control plane.
Throttling mechanisms
Two throttling mechanisms exist depending on Kubernetes version:
-
Earlier than v1.18: Maximum concurrency throttling only. The
--max-requests-inflightand--max-mutating-requests-inflightstartup parameters cap read and write request concurrency respectively. No priority differentiation—slow, low-priority requests can block urgent ones. ACK managed cluster Pro supports customizing these parameters. See Customize the parameters of control plane components. -
v1.18 and later: API Priority and Fairness (APF) provides fine-grained traffic management. APF classifies and isolates requests by priority, ensuring high-priority requests are processed first while maintaining fairness. APF entered Beta in v1.20 and is enabled by default. In clusters running v1.20 or later, the total concurrent request capacity equals the sum of
--max-requests-inflightand--max-mutating-requests-inflight. APF uses two CRD types to allocate that capacity: kube-apiserver automatically maintains these objects. To view the current configuration: View PriorityLevelConfiguration:-
PriorityLevelConfiguration: Defines priority levels and the proportion of total concurrency each level receives.
-
FlowSchema: Maps incoming requests to a PriorityLevelConfiguration.
ACK adds
ack-system-leader-electionandack-defaultto the FlowSchema for ACK core components. The remaining entries are consistent with the Kubernetes community defaults.kubectl get PriorityLevelConfiguration # Expected output NAME TYPE ASSUREDCONCURRENCYSHARES QUEUES HANDSIZE QUEUELENGTHLIMIT AGE catch-all Limited 5 <none> <none> <none> 4m20s exempt Exempt <none> <none> <none> <none> 4m20s global-default Limited 20 128 6 50 4m20s leader-election Limited 10 16 4 50 4m20s node-high Limited 40 64 6 50 4m20s system Limited 30 64 6 50 4m20s workload-high Limited 40 128 6 50 4m20s workload-low Limited 100 128 6 50 4m20sView FlowSchema:
kubectl get flowschemas # Expected output NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD AGE MISSINGPL exempt exempt 1 <none> 4d18h False probes exempt 2 <none> 4d18h False system-leader-election leader-election 100 ByUser 4d18h False endpoint-controller workload-high 150 ByUser 4d18h False workload-leader-election leader-election 200 ByUser 4d18h False system-node-high node-high 400 ByUser 4d18h False system-nodes system 500 ByUser 4d18h False ack-system-leader-election leader-election 700 ByNamespace 4d18h False ack-default workload-high 800 ByNamespace 4d18h False kube-controller-manager workload-high 800 ByNamespace 4d18h False kube-scheduler workload-high 800 ByNamespace 4d18h False kube-system-service-accounts workload-high 900 ByNamespace 4d18h False service-accounts workload-low 9000 ByUser 4d18h False global-default global-default 9900 ByUser 4d18h False catch-all catch-all 10000 ByUser 4d18h False -
Responding to throttling
Detect throttling by checking for HTTP 429 responses or monitoring the apiserver_flowcontrol_rejected_requests_total metric. When throttling occurs:
-
Increase the concurrency limit: Monitor API Server resource usage. If utilization is low, increase the sum of
max-requests-inflightandmax-mutating-requests-inflight:-
500–3,000 nodes: set the sum to 2,000–3,000
-
3,000+ nodes: set the sum to 3,000–5,000
-
-
Adjust PriorityLevelConfiguration:
-
For requests that must not be throttled, create a new FlowSchema and map it to a high-priority level such as
workload-high. Useexemptwith caution—exempt requests bypass all APF throttling. -
For slow clients that cause high API Server load, create a FlowSchema that maps those requests to a low-concurrency PriorityLevelConfiguration.
-
-
ACK manages kube-apiserver as a highly available multi-zone deployment with at least 2 replicas, scaling to a maximum of 6 as control plane load increases. Total actual concurrent requests = number of replicas x concurrency limit per replica.
-
Modifying kube-apiserver parameters triggers a rolling update. In a large-scale cluster, this causes clients to re-perform List-Watch operations, which can temporarily spike API Server load and cause brief unavailability.
kube-controller-manager and kube-scheduler
Both components control their QPS to the API Server through configuration parameters. See Customize the parameters of control plane components and Customize scheduler parameters.
-
kube-controller-manager: For clusters with 1,000 or more nodes, set
kubeAPIQPS/kubeAPIBurstto 300/500 or higher. -
kube-scheduler: No adjustment is typically needed. When the pod scheduling rate exceeds 300/s, set
connectionQPS/connectionBurstto 800/1000.
kubelet
The default kube-api-qps/kube-api-burst value is 5/10, which is sufficient for most clusters. If you observe slow pod status updates, scheduling delays, or slow persistent volume mounting, increase these values. See Customize kubelet configurations for a node pool.
-
Increasing kubelet QPS raises the rate at which each node communicates with the API Server. Increase the value gradually and monitor API Server performance to avoid overloading the control plane.
-
ACK limits parallel kubelet updates to no more than 10 nodes per batch per node pool to protect control plane stability during rollouts.
Plan scaling rates
The control plane is typically under low pressure during stable operation, even in large clusters. The risk comes from rapid, large-scale changes—creating or deleting many resources at once, or scaling many nodes simultaneously.
For example, a 5,000-node cluster running stable workloads may show little control plane pressure. But a 1,000-node cluster that creates 10,000 short-lived jobs within a minute, or concurrently scales out 2,000 nodes, can push the control plane to its limits.
The following numbers are reference guidelines, not hard limits. Many factors affect control plane capacity. Always scale gradually: increase the rate only after confirming the control plane is responding normally.
Node scaling:
For clusters with more than 2,000 nodes, when manually scaling through node pools:
-
Single node pool, single operation: no more than 100 nodes
-
Across multiple node pools simultaneously: no more than 300 nodes total
Pod scaling:
When a pod is associated with a service, each scaling event updates the Endpoints or EndpointSlice and pushes that update to all nodes—creating a cluster-wide data propagation event. In large clusters, this effect is amplified.
For clusters with more than 5,000 nodes:
-
Pods not associated with a service endpoint: update QPS ≤ 300/s
-
Pods associated with a service endpoint: update QPS ≤ 10/s
For deployments using a Rolling Update strategy, set smaller values for maxUnavailable and maxSurge to reduce the pod replacement rate.
Optimize client access patterns
As cluster resource counts grow, frequent API Server requests amplify control plane load and can cause cascading failures. Follow these guidelines when building controllers or tools that access the API Server.
Use informers for cached data access:
-
Use client-go informers to read resources from a local cache instead of issuing direct List requests to the API Server.
-
Informers maintain a single Watch connection and serve read requests locally, reducing API Server load significantly.
Optimize direct API Server requests:
-
Set `resourceVersion=0` in List requests to read from the API Server's cache rather than going through to etcd. This reduces API Server–etcd round-trips and speeds up responses:
k8sClient.CoreV1().Pods("").List(context.Background(), metav1.ListOptions{ResourceVersion: "0"}) -
Use label selectors and field selectors to narrow List request scope and reduce response payload size. Note: etcd is a key-value store and cannot filter by label or field—the API Server handles that filtering from its cache. Always combine selectors with
resourceVersion=0to avoid hitting etcd directly. -
Use protobuf for non-CRD resources. protobuf uses less memory and bandwidth than JSON. Specify multiple content types in the
Acceptheader to fall back to JSON when protobuf is unavailable:Accept: application/vnd.kubernetes.protobuf, application/json -
Avoid repeated or broad List calls. List less often, and chunk large lists when informers are unavailable.
-
Add exponential backoff and retry policies to prevent a burst of retries from overwhelming the API Server after a transient error.
Use a centralized controller design:
Avoid deploying independent controllers on every node that each watch the full cluster state. On startup, all such controllers issue simultaneous List requests to sync state, which can crash the control plane.
Instead, run one or a small group of centrally managed controller instances for the entire cluster. A centralized controller issues a single List request on startup and maintains the minimum number of Watch connections, dramatically reducing API Server pressure.
Plan large-scale workloads
Disable automatic ServiceAccount token mounting for pods that don't need API access
The kubelet establishes a persistent Watch connection for each secret mounted into a pod. A large number of Watch connections degrades control plane performance.
-
Before Kubernetes v1.22: When no ServiceAccount is specified, Kubernetes automatically mounts a secret for the default ServiceAccount. For batch jobs and application pods that don't access the API Server, set
automountServiceAccountToken: falseto skip this mounting. This avoids creating unnecessary secrets and Watch connections. See Opt out of API credential automounting. -
Kubernetes v1.22 and later: Use the TokenRequest API to obtain short-lived, automatically rotated tokens, mounted as a projected volume. This improves security and reduces the number of Watch connections the kubelet maintains. See Use ServiceAccount token volume projection.
Keep Kubernetes object counts and sizes under control
Clean up unused resources—ConfigMaps, Secrets, PVCs—promptly to reduce system overhead and keep etcd lean.
-
Limit deployment history: Set
revisionHistoryLimitto a lower value to control how many old ReplicaSets Kubernetes retains. The default is 10. In clusters with many frequently updated deployments, high history retention increases kube-controller-manager's management overhead. See revisionHistoryLimit. -
Clean up completed jobs automatically: Use
ttlSecondsAfterFinishedto delete completed jobs and their pods after a specified period. This prevents accumulation of job objects in clusters that run many CronJobs. See TTL controller for finished resources.
Configure appropriate resource limits for informer-based components
Informer-based components (such as controllers and kube-scheduler) maintain a local cache of the resources they watch. Their memory usage scales with the number and size of those resources.
In large-scale clusters, you should pay close attention to the memory consumption of these components to prevent out-of-memory (OOM) errors. When a component runs out of memory, it is killed and restarts. Each restart triggers a new List-Watch cycle, which puts additional pressure on the API Server. Frequent restarts create a loop that degrades the control plane.
Increase the memory limits of informer-based components to match the actual resource scale they are managing.
Monitor control plane metrics
Use the control plane component monitoring dashboard to track core metrics and detect issues early. See Control plane component monitoring.
Control plane resource usage
| Metric | PromQL | Description |
|---|---|---|
| Memory usage | memory_utilization_byte{container="kube-apiserver"} |
API Server memory usage, in bytes |
| CPU usage | cpu_utilization_core{container="kube-apiserver"}*1000 |
API Server CPU usage, in millicores |
kube-apiserver
For the full metric list and viewing instructions, see kube-apiserver component monitoring metrics.
Resource object count:
| Metric | PromQL | Notes |
|---|---|---|
| Resource object count | max by(resource)(apiserver_storage_objects) |
Kubernetes 1.22 and later |
max by(resource)(etcd_object_counts) |
Kubernetes 1.22 and earlier; both metrics coexist in v1.22 for compatibility |
Request latency:
| Metric | PromQL | Description |
|---|---|---|
| GET request latency | histogram_quantile($quantile, sum(irate(apiserver_request_duration_seconds_bucket{verb="GET",resource!="",subresource!~"log|proxy"}[$interval])) by (pod, verb, resource, subresource, scope, le)) |
GET response time by API Server pod, resource, and scope |
| LIST request latency | histogram_quantile($quantile, sum(irate(apiserver_request_duration_seconds_bucket{verb="LIST"}[$interval])) by (pod_name, verb, resource, scope, le)) |
LIST response time by API Server pod, resource, and scope |
| Write request latency | histogram_quantile($quantile, sum(irate(apiserver_request_duration_seconds_bucket{verb!~"GET|WATCH|LIST|CONNECT"}[$interval])) by (cluster, pod_name, verb, resource, scope, le)) |
Mutating request response time by verb, resource, and scope |
Request throttling:
| Metric | PromQL | Description |
|---|---|---|
| Request throttle rate | sum(irate(apiserver_dropped_requests_total{request_kind="readOnly"}[$interval])) by (name) |
Throttling rate for read requests; No data or 0 means no throttling |
sum(irate(apiserver_dropped_requests_total{request_kind="mutating"}[$interval])) by (name) |
Throttling rate for mutating requests |
kube-scheduler
For the full metric list and viewing instructions, see kube-scheduler component monitoring metrics.
Pending pods:
| Metric | PromQL | Description |
|---|---|---|
| Scheduler pending pods | scheduler_pending_pods{job="ack-scheduler"} |
Breakdown by type: unschedulable (cannot be scheduled), backoff (failed and waiting to retry), active (ready to schedule) |
Request latency:
| Metric | PromQL | Description |
|---|---|---|
| kube-apiserver request latency | histogram_quantile($quantile, sum(rate(rest_client_request_duration_seconds_bucket{job="ack-scheduler"}[$interval])) by (verb,url,le)) |
Time between kube-scheduler sending a request and kube-apiserver returning a response, by verb and URL |
kube-controller-manager
For the full metric list and viewing instructions, see kube-controller-manager component monitoring metrics.
Workqueue:
| Metric | PromQL | Description |
|---|---|---|
| Workqueue depth | sum(rate(workqueue_depth{job="ack-kube-controller-manager"}[$interval])) by (name) |
Rate of change in workqueue length over the specified interval |
| Workqueue processing delay | histogram_quantile($quantile, sum(rate(workqueue_queue_duration_seconds_bucket{job="ack-kube-controller-manager"}[5m])) by (name, le)) |
Time events spend waiting in the workqueue |
etcd
For the full metric list and viewing instructions, see etcd component monitoring metrics.
Key-value count:
| Metric | PromQL | Description |
|---|---|---|
| Total KV count | etcd_debugging_mvcc_keys_total |
Total number of key-value pairs in the etcd cluster |
Database size:
| Metric | PromQL | Description |
|---|---|---|
| Disk size | etcd_mvcc_db_total_size_in_bytes |
Total size of the etcd backend database |
| Database usage | etcd_mvcc_db_total_size_in_use_in_bytes |
Actual in-use size of the etcd backend database |
What's next
-
Cluster quotas and limits: Quotas and limits
-
Network planning: Plan CIDR blocks for an ACK managed cluster
-
High-reliability workload configurations: Recommended workload configurations
-
Cluster troubleshooting: Troubleshooting and FAQ about cluster management