Container Service for Kubernetes (ACK) provides the cluster inspection feature, which allows you to periodically inspect your cluster and identify potential risks. This topic describes the alerts that are generated by the cluster inspection feature for common issues and the solutions to these issues.

Background information

For more information about cluster inspection, see Use the cluster inspection feature to identify potential risks.

The following table describes the check items that are supported by cluster inspection.

Check itemAlert

ResourceQuotas

Insufficient quota on VPC route entries
Insufficient quota on SLB instances in the VPC
Insufficient quota on SLB instances that can be associated with an ECS instance
Insufficient quota on SLB backend servers
Insufficient quota on SLB listeners

ResourceLevel

Excessive SLB bandwidth usage
Excessive SLB connections
Excessively high rate of new SLB connections per second
Excessively high SLB QPS
Insufficient number of available pod CIDR blocks
Excessive number of NAT gateway connections
Excessively high CPU usage on nodes
Excessively high memory usage on nodes
Insufficient number of idle vSwitch IP addresses

Versions&Certificates

Outdated Kubernetes version of the cluster
Cluster certificate about to expire
Outdated CoreDNS version
Outdated Ingress version
Outdated systemd version on nodes
Outdated operating system version on nodes

ClusterRisk

Abnormal CoreDNS deployment
CoreDNS pods deployed on master nodes
Docker hang error on nodes

Insufficient quota on VPC route entries

Alert description: The number of route entries that you can add to the route table of the virtual private cloud (VPC) is less than five. In a cluster that has Flannel installed, each node occupies a VPC route entry. If the quota on VPC route entries is exhausted, you cannot add nodes to the cluster. Clusters that have Terway installed do not use VPC route entries.

Solution: By default, you can add at most 200 route entries to the route table of a VPC. If you want to increase the quota, Go to the Quota Center page to submit a ticket. For more information, see Quotas.

Insufficient quota on SLB instances in the VPC

Alert description: The number of Server Load Balancer (SLB) instances that you can create in the VPC is less than five. Each LoadBalancer Service in an ACK cluster occupies an SLB instance. If the quota is exhausted, the new LoadBalancer Services that you create cannot work as expected.

Solution: By default, you can create at most 60 SLB instances. If you want to increase the quota, Submit a ticket. For more information, see Quotas.

Insufficient quota on SLB instances that can be associated with an ECS instance

Alert description: The quota on the number of SLB instances that can be associated with an Elastic Compute Service (ECS) instance is insufficient. For pods that are connected to LoadBalancer Services, the ECS instances on which the pods are deployed are associated with SLB instances. If the quota is exhausted, the new pods that you deploy and associate with the LoadBalancer services cannot process requests as expected.

Solution: By default, you can add an ECS instance to at most 50 SLB server groups. To increase the quota, Submit a ticket. For more information, see Quotas.

Insufficient quota on SLB backend servers

Alert description: The quota on the number of ECS instances that can be associated with an SLB instance is insufficient. If you create a large number of LoadBalancer Services, the backend pods are distributed across multiple ECS instances. If the quota on ECS instances that can be associated with an SLB instance is exhausted, you cannot associate ECS instances with the SLB instance.

Solution: By default, you can associate at most 200 backend servers with an SLB instance. To increase the quota, Submit a ticket. For more information, see Quotas.

Insufficient quota on SLB listeners

Alert description: The quota on the number of listeners that you can add to an SLB instance is insufficient. A LoadBalancer Service listens on specific ports. Each port corresponds to an SLB listener. If the number of ports on which a LoadBalancer Service listens exceeds the quota, the ports that are not monitored by listeners cannot provide services as expected.

Solution: By default, you can add at most 50 listeners to an SLB instance. To increase the quota, Submit a ticket. For more information, see Quotas.

Excessive SLB bandwidth usage

Alert description: The peak value of outbound bandwidth usage within the previous three days is higher than 80% of the bandwidth limit. If the bandwidth resources of the SLB instance are exhausted, the SLB instance may drop packets. This causes network jitters or increases the response latency.

Solution: If the bandwidth usage of the SLB instance is excessively high, upgrade the SLB instance. For more information, see Use an existing SLB instance.

Excessive SLB connections

Alert description: The peak value of SLB connections within the previous three days is higher than 80% of the upper limit. If the number of SLB connections reaches the upper limit, clients cannot establish connections to the SLB instance.

Solution: If an excessive number of connections are established to the SLB instance within the previous three days, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high rate of new SLB connections per second

Alert description: The highest rate of new SLB connections per second within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.

Solution: If the rate of new SLB connections per second is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high SLB QPS

Alert description: The highest queries per second (QPS) value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.

Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Insufficient number of available pod CIDR blocks

Alert description: The number of available pod CIDR blocks in an ACK cluster that has Flannel installed is less than five. Each node in a cluster is attached to a pod CIDR block. You can add less than five nodes to the cluster. If all of the pod CIDR blocks are used, the new nodes that you add to the cluster cannot work as expected.

Solution: Submit a ticket.

Excessive number of NAT gateway connections

Alert description: The peak value of NAT gateway connections within the previous seven days reaches 85% of the upper limit. If the number of NAT gateway connections reaches the upper limit, your applications cannot access the Internet through the NAT gateway. As a result, service interruptions may occur.

Solution: Upgrade the NAT gateway. For more information, see FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways.

Excessively high CPU usage on nodes

Alert description: The CPU usage on nodes within the previous seven days is excessively high. A large number of pods are scheduled to the nodes, and the pods compete for resources. This increases the CPU usage and may result in service interruptions.

Solution: If the peak value of the CPU usage on nodes within the previous seven days reaches the upper limit, set the Pod request and Limit parameters to proper values to avoid service interruptions. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.

Excessively high memory usage on nodes

Alert description: The memory usage on nodes within the previous seven days is excessively high. A large number of pods are scheduled to the nodes, and the pods compete for resources. This increases the memory usage, leads to out of memory errors, and may result in service interruptions.

Solution: If the peak value of the memory usage on nodes within the previous seven days reaches 90% of the upper limit, set the Pod request and Limit parameters to proper values to avoid service interruptions. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.

Insufficient number of idle vSwitch IP addresses

Alert description: The number of idle vSwitch IP addresses in a cluster that has Terway installed is less than 10. Each pod occupies a vSwitch IP address. If the vSwitch IP addresses are exhausted, new pods are not assigned IP addresses and cannot launch as expected.

Solution: Create a vSwitch for the cluster or change the vSwitch that is specified for the cluster. For more information, see What do I do if an ACK cluster in which Terway is installed has insufficient idle vSwitch IP addresses?.

Outdated Kubernetes version of the cluster

Alert description: The Kubernetes version of the cluster is outdated or will be outdated soon. ACK clusters can stably run the latest three major versions of Kubernetes. ACK allows you to update ACK clusters from the two previous major versions to the latest major version. For example, you can update ACK clusters from Kubernetes V1.16 or Kubernetes V1.18 to Kubernetes V1.20. Stability issues or update failures may arise in ACK clusters that run an outdated Kubernetes major version.

Solution: If your cluster runs an outdated Kubernetes major version, update the cluster at the earliest opportunity. For more information, see Update the Kubernetes version of an ACK cluster.

Cluster certificate about to expire

Alert description: If the certificate of the cluster expires, the cluster cannot work as expected.

Solution: Renew the certificate at the earliest opportunity. For more information, see Renew expiring Kubernetes cluster certificates.

Outdated CoreDNS version

Alert description: The version of the CoreDNS component that is installed in the cluster is outdated. This may cause DNS resolution errors. The latest CoreDNS version provides higher stability and new features.

Solution: To avoid DNS resolution errors, update the CoreDNS component at the earliest opportunity. For more information, see Manually update CoreDNS.

Outdated Ingress version

Alert description: The version of the Ingress component that is installed in the cluster is outdated. This may cause Ingress errors and interrupt traffic forwarding. The latest Ingress version provides higher stability and new features.

Solution: To avoid service interruptions that are caused by Ingress errors, update the Ingress component at the earliest opportunity. For more information, see Nginx Ingress FAQ.

Outdated systemd version on nodes

Alert description: The systemd version is outdated and has stability issues that can cause the Docker and containerd components to malfunction.

Solution: For more information about how to fix this issue, see What do I do if the error message "Reason:KubeletNotReady Message:PLEG is not healthy:" appears in the logs of the kubelets in a Kubernetes cluster that runs CentOS 7.6?.

Outdated operating system version on nodes

Alert description: The operating system version is outdated and has stability issues that can cause the Docker and containerd components to malfunction.

Solution: Create a new node pool, temporarily migrate the workloads to the new node pool, and then update the operating system of the nodes in the current node pool. For more information, see Create a node pool.

Abnormal CoreDNS deployment

Alert description: The CoreDNS pods are deployed on the same node. If the node fails or restarts, CoreDNS cannot provide services as expected.

Solution: Update CoreDNS to the latest version. In the latest CoreDNS version, two CoreDNS pods cannot be deployed on the same node. For more information, see Configure ACK to automatically update CoreDNS.

CoreDNS pods deployed on master nodes

Alert description: If the CoreDNS pods are deployed on master nodes, the master nodes may be overloaded. This may affect the control plane.

Solution: Delete the CoreDNS pods from the master nodes one after another and wait for the system to recreate the CoreDNS pods. Then, check whether the CoreDNS pods are scheduled to worker nodes.

Docker hang error on nodes

Alert description: Docker hangs on nodes.

Solution: Run the systemctl restart docker command to restart Docker on the nodes.