What check items are included - Container Service for Kubernetes

Container Service for Kubernetes provides the cluster inspection feature. You can configure inspection rules to periodically inspect your cluster and identify potential risks. This topic describes the alerts that are generated by the cluster inspection feature for common issues and the solutions to these issues.

Check items

Note

For more information about cluster inspection, see Work with the cluster inspection feature.

Check item	Alert
ResourceQuotas	Insufficient quota on VPC route entries
	Insufficient quota on SLB instances in the VPC
	Insufficient quota on SLB instances that can be associated with an ECS instance
	Insufficient quota on SLB backend servers
	Insufficient quota on SLB listeners
ResourceLevel	Excessive SLB bandwidth usage
	Excessive SLB connections
	Excessively high rate of new SLB connections per second
	Excessively high SLB QPS
	Insufficient number of available pod CIDR blocks
	Excessive number of NAT gateway connections
	Excessively high CPU usage on nodes
	Excessively high memory usage on nodes
	Insufficient number of idle vSwitch IP addresses
	Excessively high rate of new SLB connections per second of the Ingress controller
	Excessively high SLB QPS of the Ingress controller
Versions&Certificates	Outdated Kubernetes version of the cluster
	Cluster certificate about to expire
	Outdated CoreDNS version
	Outdated Ingress version
	Outdated systemd version on nodes
	Outdated operating system version on nodes
ClusterRisk	Abnormal CoreDNS deployment
	CoreDNS pods deployed on control planes
	Docker hang error on nodes
	Incorrect maximum number of pods supported by the node
	SLB health check failures of the Ingress controller
	Low percentage of ready Ingress pods
	Error logs in the Ingress controller pod
	Use of rewrite-target annotation without specifying capture groups
	Improper canary releases rules of the NGINX Ingress
	Incorrect NGINX Ingress annotations
	Deprecated components
	Connectivity errors to the Kubernetes API server
	Inaccessibility to the Internet

Insufficient quota on VPC route entries

Alert description: The number of route entries that you can add to the route table of the virtual private cloud (VPC) is less than five. In a cluster that has Flannel installed, each node occupies one VPC route entry. If the quota on VPC route entries is exhausted, you cannot add nodes to the cluster. Clusters that have Terway installed do not use VPC route entries.

Solution: By default, you can add at most 200 route entries to the route table of a VPC. To increase the quota, submit an application in the Go to the Quota Center page to submit a ticket. For more information, see Quotas.

Insufficient quota on SLB instances in the VPC

Alert description: The number of Server Load Balancer (SLB) instances that you can create in the VPC is less than five. Each LoadBalancer Service in an ACK cluster occupies one SLB instance. If the quota is exhausted, the new LoadBalancer Services that you create cannot work as expected.

Solution: By default, you can have a maximum of 60 SLB instances within each Alibaba Cloud account. To increase the quota, submit an application in the Go to the Quota Center page to submit a ticket. For more information, see Quotas.

Insufficient quota on SLB instances that can be associated with an ECS instance

Alert description: The quota on the number of SLB instances that can be associated with an Elastic Compute Service (ECS) instance is insufficient. For pods that are connected to LoadBalancer Services, the ECS instances on which the pods are deployed are associated with SLB instances. If the quota is exhausted, the new pods that you deploy and associate with the LoadBalancer Services cannot process requests as expected.

Solution: By default, you can add an ECS instance to at most 50 SLB server groups. To increase the quota, submit an application in the Go to the Quota Center page to submit a ticket. For more information, see Quotas.

Insufficient quota on SLB backend servers

Alert description: The quota on the number of ECS instances that can be associated with an SLB instance is insufficient. If you create a large number of LoadBalancer Services, the backend pods are distributed across multiple ECS instances. If the quota on ECS instances that can be associated with an SLB instance is exhausted, you cannot associate ECS instances with the SLB instance.

Solution: By default, you can associate at most 200 backend servers with an SLB instance. To increase the quota, submit an application in the Go to the Quota Center page to submit a ticket. For more information, see Quotas.

Insufficient quota on SLB listeners

Alert description: The quota on the number of listeners that you can add to an SLB instance is insufficient. A LoadBalancer Service listens on specific ports. Each port corresponds to an SLB listener. If the number of ports on which a LoadBalancer Service listens exceeds the quota, the ports that are not monitored by listeners cannot provide services as expected.

Solution: By default, you can add at most 50 listeners to an SLB instance. To increase the quota, submit an application in the Go to the Quota Center page to submit a ticket. For more information, see Quotas.

Excessive SLB bandwidth usage

Alert description: The peak value of outbound bandwidth usage within the previous three days is higher than 80% of the bandwidth limit. If the bandwidth resources of the SLB instance are exhausted, the SLB instance may drop packets. This causes network jitters or increases the response latency.

Solution: If the bandwidth usage of the SLB instance is excessively high, upgrade the SLB instance. For more information, see Use an existing SLB instance.

Excessive SLB connections

Alert description: The peak value of SLB connections within the previous three days is higher than 80% of the upper limit. If the number of SLB connections reaches the upper limit, clients cannot establish connections to the SLB instance.

Solution: If an excessive number of connections are established to the SLB instance within the previous three days, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high rate of new SLB connections per second

Alert description: The highest rate of new SLB connections per second within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.

Solution: If the rate of new SLB connections per second is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high SLB QPS

Alert description: The highest queries per second (QPS) value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.

Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Insufficient number of available pod CIDR blocks

Alert description: The number of available pod CIDR blocks in an ACK cluster that has Flannel installed is less than five. Each node in a cluster is attached to a pod CIDR block. You can add less than five nodes to the cluster. If all of the pod CIDR blocks are used, the new nodes that you add to the cluster cannot work as expected.

Solution: Submit a ticket.

Excessive number of NAT gateway connections

Alert description: The peak value of NAT gateway connections within the previous seven days reaches 85% of the upper limit. If the number of NAT gateway connections reaches the upper limit, your applications cannot access the Internet through the NAT gateway. As a result, service interruptions may occur.

Solution: Upgrade the NAT gateway. For more information, see FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways.

Excessively high CPU usage on nodes

Alert description: The CPU usage on nodes within the previous seven days is excessively high. A large number of pods are scheduled to the nodes, and the pods compete for resources. This increases the CPU usage and may result in service interruptions.

Solution: If the peak value of the CPU usage on nodes within the previous seven days reaches the upper limit (100%), set the Pod request and Limit parameters to proper values to avoid service interruptions. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.

Excessively high memory usage on nodes

Alert description: The memory usage on nodes within the previous seven days is excessively high. A large number of pods are scheduled to the nodes, and the pods compete for resources. This increases the memory usage, leads to out of memory (OOM) errors, and may result in service interruptions.

Solution: If the peak value of the memory usage on nodes within the previous seven days reaches 90% of the upper limit, set the Pod request and Limit parameters to proper values to avoid service interruptions. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.

Insufficient number of idle vSwitch IP addresses

Alert description: The number of idle vSwitch IP addresses in a cluster that has Terway installed is less than 10. Each pod occupies one vSwitch IP address. If the vSwitch IP addresses are exhausted, new pods are not assigned IP addresses and cannot launch as expected.

Solution: Create a vSwitch for the cluster or change the vSwitch that is specified for the cluster. For more information, see What do I do if an ACK cluster in which Terway is installed has insufficient idle vSwitch IP addresses?.

Excessively high rate of new SLB connections per second of the Ingress controller

Solution: If the rate of new SLB connections per second is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high SLB QPS of the Ingress controller

Alert description: The highest QPS value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.

Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Outdated Kubernetes version of the cluster

Alert description: The Kubernetes version of the cluster is outdated or will be outdated soon. ACK clusters can stably run the latest three major versions of Kubernetes. ACK allows you to update ACK clusters from the two previous major versions to the latest major version. For example, you can update ACK clusters from Kubernetes V1.16 or Kubernetes V1.18 to Kubernetes V1.20. Stability issues or update failures may arise in ACK clusters that run an outdated Kubernetes major version. For more information about the release notes of Kubernetes versions supported by ACK, see Support for Kubernetes versions and Overview of Kubernetes versions supported by ACK.

Solution: If your cluster runs an outdated Kubernetes major version, update the cluster at the earliest opportunity. For more information, see Update the Kubernetes version of an ACK cluster.

Cluster certificate about to expire

Alert description: If the certificate of the cluster expires, the cluster cannot work as expected.

Solution: Renew the certificate at the earliest opportunity. For more information, see Renew expiring Kubernetes cluster certificates.

Outdated CoreDNS version

Alert description: The version of the CoreDNS component that is installed in the cluster is outdated. This may cause DNS resolution errors. The latest CoreDNS version provides higher stability and new features.

Solution: To avoid DNS resolution errors, update the CoreDNS component at the earliest opportunity. For more information, see Manually update CoreDNS.

Outdated Ingress version

Alert description: The version of the Ingress component that is installed in the cluster is outdated. This may cause Ingress errors and interrupt traffic forwarding. The latest Ingress version provides higher stability and new features.

Solution: To avoid service interruptions that are caused by Ingress errors, update the Ingress component at the earliest opportunity. For more information, see Nginx Ingress FAQ.

Outdated systemd version on nodes

Alert description: The systemd version is outdated and has stability issues that can cause the Docker and containerd components to malfunction.

Solution: For more information about how to fix this issue, see What do I do if the error message "Reason:KubeletNotReady Message:PLEG is not healthy:" appears in the logs of the kubelets in a Kubernetes cluster that runs CentOS 7.6?.

Outdated operating system version on nodes

Alert description: The operating system version is outdated and has stability issues that can cause the Docker and containerd components to malfunction.

Solution: Create a new node pool, temporarily migrate the workloads to the new node pool, and then update the operating system of the nodes in the current node pool. For more information, see Create a node pool.

Abnormal CoreDNS deployment

Alert description: The CoreDNS pods are deployed on the same node. If the node fails or restarts, CoreDNS cannot provide services as expected.

Solution: Update CoreDNS to the latest version. In the latest CoreDNS version, two CoreDNS pods cannot be deployed on the same node. For more information, see Configure ACK to automatically update CoreDNS.

CoreDNS pods deployed on control planes

Alert description: If the CoreDNS pods are deployed on control planes, the control planes may be overloaded. This may affect the control plane.

Solution: Delete the CoreDNS pods from the control planes one after another and wait for the system to recreate the CoreDNS pods. Then, check whether the CoreDNS pods are scheduled to worker nodes.

Docker hang error on nodes

Alert description: Docker hangs on nodes.

Solution: Run the systemctl restart docker command to restart Docker on the nodes.

Incorrect maximum number of pods supported by the node

Alert description: The maximum number of pods supported by the node is different from the theoretical value.

Solution: If the maximum number of pods supported by a node is different from the theoretical value and you have never modified this limit, Submit a ticket.

SLB health check failures of the Ingress controller

Alert description: The SLB instance fails health checks during the previous three days. The failures may be caused by high component loads or incorrect component configurations.

Solution: The SLB instance fails health checks during the previous three days. To avoid service interruptions, check whether abnormal events are generated for the Ingress controller Service and whether the component loads are excessively high. For more information about how to troubleshoot Ingress controller SLB issues, see NGINX Ingress controller troubleshooting.

Low percentage of ready Ingress pods

Alert description: The percentage of ready pods among the pods created for the Ingress Deployment is lower than 100%. In this case, the Ingress Deployment cannot be started and fails health checks.

Solution: Use the pod diagnostics feature or refer to the Ingress troubleshooting documentation to identify the pods that are not ready. For more information, see NGINX Ingress controller troubleshooting.

Error logs in the Ingress controller pod

Alert description: The Ingress controller pod generates error logs. This indicates that Ingress controller does not work as expected.

Solution: Troubleshoot the issues based on the error logs. For more information, see NGINX Ingress controller troubleshooting.

Use of rewrite-target annotation without specifying capture groups

Alert description: The rewrite-target annotation is specified in the rules of the NGINX Ingress but capture groups are not specified. In Ingress controller 0.22.0 or later, you must specify capture groups if the rewrite-target annotation is configured. Otherwise, traffic forwarding is interrupted.

Solution: Reconfigure the rules of the NGINX Ingress and specify capture groups. For more information, see NGINX Ingress controller troubleshooting.

Improper canary releases rules of the NGINX Ingress

Alert description: The service-match or service-weight annotation is configured for more than two Services. The service-match or service-weight annotation supports at most two Services for traffic distribution. If the service-match or service-weight annotation is configured for more than two Services, the additional Services are ignored and traffic is not forwarded as expected.

Solution: Reduce the number of Services to two.

Incorrect NGINX Ingress annotations

Alert description: The open source NGINX Ingress controller uses annotations that start with nginx.com/nginx.org instead of annotations that start with nginx.ingress.kubernetes.io. Annotations that start with nginx.com/nginx.org cannot be recognized by the NGINX Ingress controller. If the annotations are used, the relevant configurations are not applied to the NGINX Ingress controller.

Solution: Use the annotations supported by the NGINX Ingress controller. For more information about NGINX Ingress annotations, see Alibaba Cloud documentation or Community documentation.

Deprecated components

Alert description: Deprecated components are installed in the cluster.

Solution: The alicloud-application-controller component is discontinued. If the component is installed in the cluster, you may fail to update or use the cluster as expected. If deprecated components are installed in the cluster, uninstall the components. For more information, see Manage system components.

Connectivity errors to the Kubernetes API server

Alert description: The node cannot connect to the Kubernetes API server of the cluster.

Solution: Check cluster configurations. For more information, see Troubleshoot ACK clusters.

Inaccessibility to the Internet

Alert description: The node cannot access the Internet.

Solution: Check whether SNAT is enabled for the cluster. For more information about how to enable SNAT, see Enable an existing ACK cluster to access the Internet.