Due to the expansion of clusters and frequent scaling of cluster nodes, manual troubleshooting can no longer meet the requirements for quickly locating the cause of scaling failure and tracking historical issues. It is difficult for users to identify anomalies that can only be detected based on long-term data collection and analysis. This topic describes how to quickly troubleshoot issues based on the charts displayed in the node scaling dashboard. These charts can display detailed information about pods, nodes, and their changes.

Prerequisites

Node scaling dashboard details

The node scaling dashboard consists of four areas. The following sections describe the data that you can view in each area.

Overview area

The five charts in the overview area display node scaling data that is important to O&M engineers.
  • Total number of nodes: the total number of nodes in the cluster. The number indicates the capacity of the cluster.
  • Number of available nodes: the number of KubeletReady nodes in the cluster. If the number is different from the total number of nodes, some nodes are not in the KubeletNotReady state. If a node is not in the KubeletNotReady state, the node is being added to the cluster or the node has failed. Pay close attention to these nodes.
  • Cluster scalability: indicates whether the cluster is scalable. If NO is displayed, the number of nodes that are not in the Ready state is greater than the specified upper limit. In this scenario, the cluster cannot perform scale-out activities.
  • Most recent scale-out activities: the number of node scale-out activities that are performed within a specified time range.
  • Most recent scale-in activities: the number of node scale-in activities that are performed within a specified time range.

Pod details

The following charts are displayed in the pod details area:
  • Unschedulable pod trend: displays the trend of pods that are in the Pending state over time. The number of unschedulable pods usually indicates the need of the cluster for scaling out nodes.
  • Evicted pod trend: displays the trend of pods that are being evicted over time. If the pods on a node are evicted, this indicates that resource consumption on the node has reached the threshold.

Node details

  • Node status trend: displays the total number of nodes, the number of nodes that are in the KubeletReady state, and the number of nodes that are in the KubeletNotReady state. KubeletNotReady nodes do not include nodes that are added to the cluster within the last 10 minutes.
  • Node scale-out trend and node scale-in trend: display the trend of node scale-out activities and node scale-in activities that are performed over time. The number of scale-out activities equals the number of ScaledUpGroup events that are generated. A ScaledUpGroup event is generated each time cluster-autoscaler performs a scale-out activity. The number of scale-in activities equals the number of ScaleDown events that are generated. A ScaleDown event is generated each time cluster-autoscaler performs a scale-in activity.

List of scaling activities

The scaling activity list displays all events related to scaling activities to help you quickly find scaling activities and troubleshoot issues.

Work with the node scaling dashboard

Identify issues

  • Check whether abnormal nodes exist: Check whether the total number of nodes equals the number of available nodes. If they are not equal, some nodes in the cluster are abnormal.
  • Check whether the cluster is properly sized: Most online workloads have peak hours and off-peak hours. The auto scaling feature is designed to enable Kubernetes clusters to automatically scale with the fluctuation of workloads. You can refer to the Node details section, analyze the statistics that are collected within a specific time range, and then compare the statistics with the workload fluctuation history. If the cluster cannot scale during peak hours and off-peak hours as expected, optimize the scaling configurations accordingly.

Troubleshoot issues

  • Pending pods exist in the cluster but nodes are not scaled out. View the cluster scalability chart and check whether the cluster is scalable.
    • If the cluster is not scalable, cluster-autoscaler cannot perform scale-out activities. In this case, troubleshoot the issue.
    • If the cluster is scalable, search for the name of the pod that triggers the scale-out activity or the NotTriggerScaleUp event in the preceding scaling activity list. Then, check the cause of scale-out failure in the reason field.
  • Check the time when a pod triggers a scale-out activity: Search for the name of the pod that triggers the scale-out activity or the NotTriggerScaleUp event in the preceding scaling activity list, and then check the time the scale-out activity is triggered.
  • Check the cause of scale-out failure: Search for the FailedToScaleUpGroup event in the preceding scaling activity list and check the reason why cluster-autoscaler failed to perform the scale-out activity in the reason field.
  • Check the time when a node triggers a scale-in activity: Search for the name of the node that triggers the scale-in activity or the ScaleDown event in the preceding scaling activity list, and then check the time when the scale-in activity is triggered.
  • Check the cause of scale-in failure: Search for the name of the node that triggers the scale-in activity or the ScaleDownFailed event in the preceding scaling activity list, and then check the cause of scale-in failure.