Cluster inspection alerts and solutions FAQ - Container Compute Service

Alibaba Cloud Container Compute Service (ACS) integrates with Container Intelligence Service (CIS) to run periodic cluster inspections. When an inspection detects a potential risk, it generates an alert. This topic lists each alert, explains what triggers it and what breaks, and provides steps to resolve it.

Note

Note: Check items may vary based on your cluster configuration. The items in your inspection report take precedence over this topic. For instructions on running cluster inspections, see Work with the cluster inspection feature.

Check items and alerts

Check item	Inspection item	Alert
Resource quotas	Quota on SLB instances	Insufficient quota on SLB instances in a VPC
Resource quotas	Quota on SLB backend servers	Insufficient quota on SLB backend servers
Resource quotas	Quota on SLB listeners	Insufficient quota on SLB listeners
Resource watermarks	SLB bandwidth usage	Excessive SLB bandwidth usage
Resource watermarks	Number of SLB connections	Excessive number of SLB connections
Resource watermarks	Rate of new SLB connections	Excessively high rate of new SLB connections
Resource watermarks	SLB QPS	Excessively high SLB QPS
Versions and certificates	Kubernetes version of a cluster	Outdated Kubernetes version of a cluster
Cluster risks	Whether an SLB instance is associated with the API server	No SLB instance associated with the API server
Cluster risks	Status of the SLB instance associated with the API server	Abnormal status of the SLB instance associated with the API server
Cluster risks	Configuration of the listener on port 6443 for the SLB instance associated with the API server	Errors in the listener configuration on port 6443 for the SLB instance associated with the API server
Cluster risks	Access control configuration of the SLB instance associated with the API server	Errors in the access control configuration of the SLB instance associated with the API server
Cluster risks	Cluster IP address of the DNS service	Abnormal cluster IP address of the DNS service
Cluster risks	Endpoints of the DNS service	No endpoints available for the DNS service
Cluster risks	Whether one SLB port is shared by multiple Services	One SLB port shared by multiple Services

Resource quotas

Insufficient quota on SLB instances in a VPC

Condition: Fewer than five Server Load Balancer (SLB) instances can still be created in the cluster VPC.

Impact: Each LoadBalancer Service consumes one SLB instance. When the quota is exhausted, new LoadBalancer Services fail to work.

Solution: Request a quota increase in the Quota Center console. The default limit is 60 SLB instances per Alibaba Cloud account. For quota details, see Quotas.

Insufficient quota on SLB backend servers

Condition: Fewer than the maximum number of ECS instances can still be associated with an SLB instance.

Impact: Backend pods are spread across multiple ECS instances. When the backend server quota is exhausted, no additional ECS instances can be associated with the SLB instance, causing traffic routing to fail.

Solution: Request a quota increase in the Quota Center console. The default limit is 200 backend servers per SLB instance. For quota details, see Quotas.

Insufficient quota on SLB listeners

Condition: The quota on the number of listeners per SLB instance is running low.

Impact: Each port on a LoadBalancer Service maps to one SLB listener. When the listener quota is exhausted, ports without a listener stop receiving traffic.

Solution: Request a quota increase in the Quota Center console. The default limit is 50 listeners per SLB instance. For quota details, see Quotas.

Insufficient quota on SLB instances

Condition: Fewer than five SLB instances can still be created in your account.

Impact: An SLB instance is created for each LoadBalancer Service. When the SLB instance quota is exhausted, newly created LoadBalancer Services cannot work as expected.

Solution: Request a quota increase in the Quota Center console. The default limit is 60 SLB instances per Alibaba Cloud account.

Resource watermarks

Excessive SLB bandwidth usage

Condition: Peak outbound bandwidth over the previous three days exceeded 80% of the bandwidth limit.

Impact: When the bandwidth limit is reached, the SLB instance drops packets, causing network jitter or increased response latency.

Solution: Upgrade the SLB instance to a higher bandwidth tier. For instructions, see Use an existing SLB instance.

Excessive number of SLB connections

Condition: Peak concurrent connections over the previous three days exceeded 80% of the connection limit.

Impact: When the connection limit is reached, clients cannot establish new connections to the SLB instance.

Solution: Upgrade the SLB instance before connections reach the limit to avoid service interruptions. For instructions, see Use an existing SLB instance.

Excessively high rate of new SLB connections

Condition: The peak rate of new connections over the previous three days exceeded 80% of the upper limit.

Impact: When the rate limit is reached, clients cannot establish new connections within a short period.

Solution: Upgrade the SLB instance before the rate reaches the limit to avoid service interruptions. For instructions, see Use an existing SLB instance.

Excessively high SLB QPS

Condition: Peak queries per second (QPS) over the previous three days exceeded 80% of the upper limit.

Impact: When the QPS limit is reached, clients cannot connect to the SLB instance.

Solution: Upgrade the SLB instance before QPS reaches the limit to avoid service interruptions. For instructions, see Use an existing SLB instance.

Versions and certificates

Outdated Kubernetes version of a cluster

Condition: The cluster is running an outdated Kubernetes major version, or the current version is nearing end of support.

Impact: Outdated versions no longer receive security patches or feature updates and may lose compatibility with newer workloads and tooling.

Solution: Update the cluster to a supported Kubernetes version as soon as possible.

Cluster risks

No SLB instance associated with the API server

Condition: No SLB instance is fronting the API server.

Impact: With only a single API server and no load balancer, the API server becomes a single point of failure (SPOF). If that API server fails, the cluster becomes unresponsive.

Solution: Associate an SLB instance with the API server.

Abnormal status of the SLB instance associated with the API server

Condition: The SLB instance in front of the API server is in an abnormal state.

Impact: All cluster operations — pod scheduling, service deployment, scale-out — are interrupted or delayed. Service discovery also fails because it depends on the API server.

Solution: Check the SLB instance configurations, including backend servers, listening ports, and health checks.

Errors in the listener configuration on port 6443 for the SLB instance associated with the API server

Condition: The HTTPS listener on port 6443 of the API server's SLB instance is misconfigured or missing.

Impact: All requests routed through the SLB instance to the API server fail. This includes kubectl operations, dashboard access, and API calls from other services. Service name resolution also fails because DNS queries depend on the API server.

Solution: Check the SLB instance configurations, including backend servers, listening ports, and health checks. Verify that an HTTPS listener on port 6443 is configured.

Errors in the access control configuration of the SLB instance associated with the API server

Condition: The access control configuration on the API server's SLB instance contains errors.

Impact: Cluster management operations — node management, pod scheduling, service deployment — are blocked or limited. Any workload that depends on the API server for communication or service discovery is also affected.

Solution:

Review security groups and access control lists (ACLs) on the SLB instance. Verify that the required IP addresses and port 6443 are allowed.
Check the TLS/SSL configuration on both the SLB instance and the API server. Verify that certificates are valid.

Abnormal cluster IP address of the DNS service

Condition: The cluster IP address of the DNS service has not been assigned or is invalid.

Impact: DNS resolution fails across the cluster, causing cascading failures in workloads that rely on service name resolution.

Solution:

Check network plugin configurations for conflicts or errors.
Redeploy CoreDNS to restore a valid cluster IP address assignment.

No endpoints available for the DNS service

Condition: The DNS service has zero backend endpoints.

Impact: The DNS service is completely unavailable. All workloads that depend on service name resolution fail.

Solution: Check the Corefile configuration. Verify that the forward or proxy directive points to a valid set of backend DNS servers.

One SLB port shared by multiple Services

Condition: Multiple Kubernetes Services are sharing the same port on a single SLB instance.

Impact: Port conflicts cause one or more of those Services to become unavailable.

Solution: Delete or update the conflicting Services so that each Service uses a distinct port on the shared SLB instance.