All Products
Search
Document Center

Container Service for Kubernetes:Recommendations for using large-scale clusters

Last Updated:Mar 26, 2026

Large-scale ACK managed cluster Pro clusters—typically those with more than 500 nodes or 10,000 pods—place unique demands on the control plane. Cluster performance and availability depend on resource count, access frequency, and access patterns. The true signal of whether your cluster is at large scale is the API Server's request success rate and latency: if those degrade, the control plane is under pressure regardless of node count.

This topic provides recommendations for planning and operating large-scale clusters. Adjust them based on your specific environment and business requirements.

Under the shared responsibility model, ACK manages the security of the cluster control plane components—including Kubernetes control plane components and etcd—and the underlying Alibaba Cloud infrastructure. You are responsible for securing your business applications and configuring your cloud resources. For details, see Shared responsibility model.

Single cluster vs. multiple clusters

A single large-scale cluster reduces management overhead and improves resource utilization. In some business scenarios, however, splitting services across multiple clusters makes more sense.

Consider multiple clusters when:

Consideration When to split
Isolation Prevent issues in one environment (for example, testing) from affecting production. Splitting clusters reduces the blast radius of failures.
Geographic distribution Deploy clusters in specific regions to meet availability and latency requirements for end users.
Single-cluster size limits The Kubernetes architecture has inherent performance limits. Before planning a large-scale cluster, review the community's capacity limits and SLOs, then check your quota at the Quota Center. If your requirements exceed community or ACK limits, split into multiple clusters.

To manage multiple clusters for tasks such as application deployment, traffic management, job distribution, and monitoring, enable fleet management.

Keep clusters on the latest version

Newer Kubernetes versions include stability, performance, and scalability improvements that directly benefit large-scale clusters. Notable examples:

ACK releases supported Kubernetes versions in sync with the community and discontinues support for expired versions—including stopping new feature releases, bug fixes, and security patches—while providing only limited technical support for expired versions. Monitor version release announcements through the console and upgrade promptly.

Monitor cluster resource limits

Stay within the following limits to maintain availability and performance in large-scale clusters.

Resource Limit Action
etcd database size (DB Size) Keep below 8 GB
Important

If etcd exceeds its size limit, it raises a no space alarm and stops accepting write requests—the cluster becomes read-only and all mutations (creating pods, scaling deployments, and so on) are rejected. To stay within the limit: clean up unused resources promptly; for frequently modified resources, keep individual object size below 100 KB (each etcd update creates a new historical version, so large frequently-updated objects consume disproportionate space).

Total data per resource type in etcd Keep below 800 MB per type When defining a new CustomResourceDefinition (CRD), estimate the final number of custom resources (CRs) in advance. When using Helm, consider switching to Helm's SQL storage backend if secrets-based release storage approaches the Kubernetes secrets size limit.
API Server CLB connections and bandwidth Maximum bandwidth: 5,120 Mbps; see CLB instances for connection limits For clusters with 1,000 or more nodes, use pay-by-usage Classic Load Balancer (CLB) instances. Clusters created after February 2023 with Kubernetes 1.20 or later use ENI (Elastic Network Interface) direct connection by default for improved bandwidth. See Access the API server using an internal endpoint.
Services per namespace Keep below 5,000 The kubelet injects service information as environment variables into pods. Too many services per namespace causes slow pod startup or startup failures. Set enableServiceLinks: false in the podSpec to opt out. See Accessing the service.
Total services in the cluster Keep below 10,000 total; 500 for LoadBalancer-type services Excess services increase the network rules kube-proxy processes, degrading its performance. For LoadBalancer-type services, the sync delay to the CLB can reach minutes when counts are high.
Backend pods per service endpoint Keep below 3,000 Use EndpointSlices instead of Endpoints in large-scale clusters—EndpointSlices split endpoints into smaller chunks, reducing data transmitted per change. If you use a custom controller that reads Endpoints directly, keep the count below 1,000 per Endpoints object; above this, the object is automatically truncated. See Over-capacity endpoints.
Total endpoints across all services Keep below 64,000 Excess endpoints overload the API Server and degrade network performance.
Pending pods Keep below 10,000 High pending pod counts cause the scheduler to generate repeated events, which can trigger event storms.
Secrets in clusters with KMS V1 encryption Keep below 2,000 With KMS V1, each encryption generates a new data encryption key (DEK). On cluster startup or upgrade, all secrets are decrypted sequentially. Too many secrets significantly slow startup. See Encryption at rest for secrets using KMS.

Configure control plane component parameters

ACK managed cluster Pro lets you customize control plane component parameters for kube-apiserver, kube-controller-manager, and kube-scheduler. In large-scale clusters, tuning throttling-related parameters is critical.

kube-apiserver

kube-apiserver limits concurrent request handling to protect the control plane. When the limit is exceeded, it returns HTTP 429 (Too Many Requests) and instructs clients to retry. Without server-side throttling, excessive requests can crash the control plane.

Throttling mechanisms

Two throttling mechanisms exist depending on Kubernetes version:

  • Earlier than v1.18: Maximum concurrency throttling only. The --max-requests-inflight and --max-mutating-requests-inflight startup parameters cap read and write request concurrency respectively. No priority differentiation—slow, low-priority requests can block urgent ones. ACK managed cluster Pro supports customizing these parameters. See Customize the parameters of control plane components.

  • v1.18 and later: API Priority and Fairness (APF) provides fine-grained traffic management. APF classifies and isolates requests by priority, ensuring high-priority requests are processed first while maintaining fairness. APF entered Beta in v1.20 and is enabled by default. In clusters running v1.20 or later, the total concurrent request capacity equals the sum of --max-requests-inflight and --max-mutating-requests-inflight. APF uses two CRD types to allocate that capacity: kube-apiserver automatically maintains these objects. To view the current configuration: View PriorityLevelConfiguration:

    • PriorityLevelConfiguration: Defines priority levels and the proportion of total concurrency each level receives.

    • FlowSchema: Maps incoming requests to a PriorityLevelConfiguration.

    ACK adds ack-system-leader-election and ack-default to the FlowSchema for ACK core components. The remaining entries are consistent with the Kubernetes community defaults.
    kubectl get PriorityLevelConfiguration
    # Expected output
    NAME              TYPE      ASSUREDCONCURRENCYSHARES   QUEUES   HANDSIZE   QUEUELENGTHLIMIT   AGE
    catch-all         Limited   5                          <none>   <none>     <none>             4m20s
    exempt            Exempt    <none>                     <none>   <none>     <none>             4m20s
    global-default    Limited   20                         128      6          50                 4m20s
    leader-election   Limited   10                         16       4          50                 4m20s
    node-high         Limited   40                         64       6          50                 4m20s
    system            Limited   30                         64       6          50                 4m20s
    workload-high     Limited   40                         128      6          50                 4m20s
    workload-low      Limited   100                        128      6          50                 4m20s

    View FlowSchema:

    kubectl get flowschemas
    # Expected output
    NAME                           PRIORITYLEVEL     MATCHINGPRECEDENCE   DISTINGUISHERMETHOD   AGE     MISSINGPL
    exempt                         exempt            1                    <none>                4d18h   False
    probes                         exempt            2                    <none>                4d18h   False
    system-leader-election         leader-election   100                  ByUser                4d18h   False
    endpoint-controller            workload-high     150                  ByUser                4d18h   False
    workload-leader-election       leader-election   200                  ByUser                4d18h   False
    system-node-high               node-high         400                  ByUser                4d18h   False
    system-nodes                   system            500                  ByUser                4d18h   False
    ack-system-leader-election     leader-election   700                  ByNamespace           4d18h   False
    ack-default                    workload-high     800                  ByNamespace           4d18h   False
    kube-controller-manager        workload-high     800                  ByNamespace           4d18h   False
    kube-scheduler                 workload-high     800                  ByNamespace           4d18h   False
    kube-system-service-accounts   workload-high     900                  ByNamespace           4d18h   False
    service-accounts               workload-low      9000                 ByUser                4d18h   False
    global-default                 global-default    9900                 ByUser                4d18h   False
    catch-all                      catch-all         10000                ByUser                4d18h   False

Responding to throttling

Detect throttling by checking for HTTP 429 responses or monitoring the apiserver_flowcontrol_rejected_requests_total metric. When throttling occurs:

  • Increase the concurrency limit: Monitor API Server resource usage. If utilization is low, increase the sum of max-requests-inflight and max-mutating-requests-inflight:

    • 500–3,000 nodes: set the sum to 2,000–3,000

    • 3,000+ nodes: set the sum to 3,000–5,000

  • Adjust PriorityLevelConfiguration:

    • For requests that must not be throttled, create a new FlowSchema and map it to a high-priority level such as workload-high. Use exempt with caution—exempt requests bypass all APF throttling.

    • For slow clients that cause high API Server load, create a FlowSchema that maps those requests to a low-concurrency PriorityLevelConfiguration.

Important
  • ACK manages kube-apiserver as a highly available multi-zone deployment with at least 2 replicas, scaling to a maximum of 6 as control plane load increases. Total actual concurrent requests = number of replicas x concurrency limit per replica.

  • Modifying kube-apiserver parameters triggers a rolling update. In a large-scale cluster, this causes clients to re-perform List-Watch operations, which can temporarily spike API Server load and cause brief unavailability.

kube-controller-manager and kube-scheduler

Both components control their QPS to the API Server through configuration parameters. See Customize the parameters of control plane components and Customize scheduler parameters.

  • kube-controller-manager: For clusters with 1,000 or more nodes, set kubeAPIQPS/kubeAPIBurst to 300/500 or higher.

  • kube-scheduler: No adjustment is typically needed. When the pod scheduling rate exceeds 300/s, set connectionQPS/connectionBurst to 800/1000.

kubelet

The default kube-api-qps/kube-api-burst value is 5/10, which is sufficient for most clusters. If you observe slow pod status updates, scheduling delays, or slow persistent volume mounting, increase these values. See Customize kubelet configurations for a node pool.

Important
  • Increasing kubelet QPS raises the rate at which each node communicates with the API Server. Increase the value gradually and monitor API Server performance to avoid overloading the control plane.

  • ACK limits parallel kubelet updates to no more than 10 nodes per batch per node pool to protect control plane stability during rollouts.

Plan scaling rates

The control plane is typically under low pressure during stable operation, even in large clusters. The risk comes from rapid, large-scale changes—creating or deleting many resources at once, or scaling many nodes simultaneously.

For example, a 5,000-node cluster running stable workloads may show little control plane pressure. But a 1,000-node cluster that creates 10,000 short-lived jobs within a minute, or concurrently scales out 2,000 nodes, can push the control plane to its limits.

Important

The following numbers are reference guidelines, not hard limits. Many factors affect control plane capacity. Always scale gradually: increase the rate only after confirming the control plane is responding normally.

Node scaling:

For clusters with more than 2,000 nodes, when manually scaling through node pools:

  • Single node pool, single operation: no more than 100 nodes

  • Across multiple node pools simultaneously: no more than 300 nodes total

Pod scaling:

When a pod is associated with a service, each scaling event updates the Endpoints or EndpointSlice and pushes that update to all nodes—creating a cluster-wide data propagation event. In large clusters, this effect is amplified.

For clusters with more than 5,000 nodes:

  • Pods not associated with a service endpoint: update QPS ≤ 300/s

  • Pods associated with a service endpoint: update QPS ≤ 10/s

For deployments using a Rolling Update strategy, set smaller values for maxUnavailable and maxSurge to reduce the pod replacement rate.

Optimize client access patterns

As cluster resource counts grow, frequent API Server requests amplify control plane load and can cause cascading failures. Follow these guidelines when building controllers or tools that access the API Server.

Use informers for cached data access:

  • Use client-go informers to read resources from a local cache instead of issuing direct List requests to the API Server.

  • Informers maintain a single Watch connection and serve read requests locally, reducing API Server load significantly.

Optimize direct API Server requests:

  • Set `resourceVersion=0` in List requests to read from the API Server's cache rather than going through to etcd. This reduces API Server–etcd round-trips and speeds up responses:

    k8sClient.CoreV1().Pods("").List(context.Background(), metav1.ListOptions{ResourceVersion: "0"})
  • Use label selectors and field selectors to narrow List request scope and reduce response payload size. Note: etcd is a key-value store and cannot filter by label or field—the API Server handles that filtering from its cache. Always combine selectors with resourceVersion=0 to avoid hitting etcd directly.

  • Use protobuf for non-CRD resources. protobuf uses less memory and bandwidth than JSON. Specify multiple content types in the Accept header to fall back to JSON when protobuf is unavailable:

    Accept: application/vnd.kubernetes.protobuf, application/json

    See Alternate representations of resources.

  • Avoid repeated or broad List calls. List less often, and chunk large lists when informers are unavailable.

  • Add exponential backoff and retry policies to prevent a burst of retries from overwhelming the API Server after a transient error.

Use a centralized controller design:

Avoid deploying independent controllers on every node that each watch the full cluster state. On startup, all such controllers issue simultaneous List requests to sync state, which can crash the control plane.

Instead, run one or a small group of centrally managed controller instances for the entire cluster. A centralized controller issues a single List request on startup and maintains the minimum number of Watch connections, dramatically reducing API Server pressure.

Plan large-scale workloads

Disable automatic ServiceAccount token mounting for pods that don't need API access

The kubelet establishes a persistent Watch connection for each secret mounted into a pod. A large number of Watch connections degrades control plane performance.

  • Before Kubernetes v1.22: When no ServiceAccount is specified, Kubernetes automatically mounts a secret for the default ServiceAccount. For batch jobs and application pods that don't access the API Server, set automountServiceAccountToken: false to skip this mounting. This avoids creating unnecessary secrets and Watch connections. See Opt out of API credential automounting.

  • Kubernetes v1.22 and later: Use the TokenRequest API to obtain short-lived, automatically rotated tokens, mounted as a projected volume. This improves security and reduces the number of Watch connections the kubelet maintains. See Use ServiceAccount token volume projection.

Keep Kubernetes object counts and sizes under control

Clean up unused resources—ConfigMaps, Secrets, PVCs—promptly to reduce system overhead and keep etcd lean.

  • Limit deployment history: Set revisionHistoryLimit to a lower value to control how many old ReplicaSets Kubernetes retains. The default is 10. In clusters with many frequently updated deployments, high history retention increases kube-controller-manager's management overhead. See revisionHistoryLimit.

  • Clean up completed jobs automatically: Use ttlSecondsAfterFinished to delete completed jobs and their pods after a specified period. This prevents accumulation of job objects in clusters that run many CronJobs. See TTL controller for finished resources.

Configure appropriate resource limits for informer-based components

Informer-based components (such as controllers and kube-scheduler) maintain a local cache of the resources they watch. Their memory usage scales with the number and size of those resources.

In large-scale clusters, you should pay close attention to the memory consumption of these components to prevent out-of-memory (OOM) errors. When a component runs out of memory, it is killed and restarts. Each restart triggers a new List-Watch cycle, which puts additional pressure on the API Server. Frequent restarts create a loop that degrades the control plane.

Increase the memory limits of informer-based components to match the actual resource scale they are managing.

Monitor control plane metrics

Use the control plane component monitoring dashboard to track core metrics and detect issues early. See Control plane component monitoring.

Control plane resource usage

Metric PromQL Description
Memory usage memory_utilization_byte{container="kube-apiserver"} API Server memory usage, in bytes
CPU usage cpu_utilization_core{container="kube-apiserver"}*1000 API Server CPU usage, in millicores

kube-apiserver

For the full metric list and viewing instructions, see kube-apiserver component monitoring metrics.

Resource object count:

Metric PromQL Notes
Resource object count max by(resource)(apiserver_storage_objects) Kubernetes 1.22 and later
max by(resource)(etcd_object_counts) Kubernetes 1.22 and earlier; both metrics coexist in v1.22 for compatibility

Request latency:

Metric PromQL Description
GET request latency histogram_quantile($quantile, sum(irate(apiserver_request_duration_seconds_bucket{verb="GET",resource!="",subresource!~"log|proxy"}[$interval])) by (pod, verb, resource, subresource, scope, le)) GET response time by API Server pod, resource, and scope
LIST request latency histogram_quantile($quantile, sum(irate(apiserver_request_duration_seconds_bucket{verb="LIST"}[$interval])) by (pod_name, verb, resource, scope, le)) LIST response time by API Server pod, resource, and scope
Write request latency histogram_quantile($quantile, sum(irate(apiserver_request_duration_seconds_bucket{verb!~"GET|WATCH|LIST|CONNECT"}[$interval])) by (cluster, pod_name, verb, resource, scope, le)) Mutating request response time by verb, resource, and scope

Request throttling:

Metric PromQL Description
Request throttle rate sum(irate(apiserver_dropped_requests_total{request_kind="readOnly"}[$interval])) by (name) Throttling rate for read requests; No data or 0 means no throttling
sum(irate(apiserver_dropped_requests_total{request_kind="mutating"}[$interval])) by (name) Throttling rate for mutating requests

kube-scheduler

For the full metric list and viewing instructions, see kube-scheduler component monitoring metrics.

Pending pods:

Metric PromQL Description
Scheduler pending pods scheduler_pending_pods{job="ack-scheduler"} Breakdown by type: unschedulable (cannot be scheduled), backoff (failed and waiting to retry), active (ready to schedule)

Request latency:

Metric PromQL Description
kube-apiserver request latency histogram_quantile($quantile, sum(rate(rest_client_request_duration_seconds_bucket{job="ack-scheduler"}[$interval])) by (verb,url,le)) Time between kube-scheduler sending a request and kube-apiserver returning a response, by verb and URL

kube-controller-manager

For the full metric list and viewing instructions, see kube-controller-manager component monitoring metrics.

Workqueue:

Metric PromQL Description
Workqueue depth sum(rate(workqueue_depth{job="ack-kube-controller-manager"}[$interval])) by (name) Rate of change in workqueue length over the specified interval
Workqueue processing delay histogram_quantile($quantile, sum(rate(workqueue_queue_duration_seconds_bucket{job="ack-kube-controller-manager"}[5m])) by (name, le)) Time events spend waiting in the workqueue

etcd

For the full metric list and viewing instructions, see etcd component monitoring metrics.

Key-value count:

Metric PromQL Description
Total KV count etcd_debugging_mvcc_keys_total Total number of key-value pairs in the etcd cluster

Database size:

Metric PromQL Description
Disk size etcd_mvcc_db_total_size_in_bytes Total size of the etcd backend database
Database usage etcd_mvcc_db_total_size_in_use_in_bytes Actual in-use size of the etcd backend database

What's next