Container Service cannot connect to a node if the node status is Exception.

Reason analysis

Node exception occurs mainly because of your heavy node load, including the CPU usage, memory usage, network traffic, and I/O of the node.

Swarm clusters

You can view the monitoring data of your node either in the Container Service console or Alibaba Cloud CloudMonitor console.

  • View node monitoring data in Container Service console
    1. Log on to the Container Service console.
    2. Click Swarm > Clusters in the left-side navigation pane.
    3. Click the cluster name.
    4. Click Monitor at the right of the node that you want to view.
  • View node monitoring data in Alibaba Cloud CloudMonitor console
    1. Log on to the CloudMonitor console.
    2. Click Cloud Service Monitoring > Container Service in the left-side navigation pane.
    3. Click Node Monitoring at the right of the cluster in which the node you want to view resides.
    4. Click Monitoring Charts at the right of the node to view the monitoring data of this node.
    Note
    To monitor the node load in real time, you can create alarm rules for the node. Click Create Alarm Rule in the upper-right corner of the page.

Kubernetes clusters

You can view the monitoring data of your node either in the Container Service console or in the Kubernetes application group.

  • View node monitoring data in Container Service console
    1. Log on to the Container Service console.
    2. Click Kubernetes > Clusters > > Nodesin the left-side navigation pane.
    3. Select the cluster from the Cluster drop-down list. Click Monitor at the right of the node that you want to view.
  • View node monitoring data in Kubernetes application group of CloudMonitor console
    1. Log on to the Container Service console.
    2. Click Kubernetes > Clusters in the left-side navigation pane.
    3. Click More > at the right of the cluster and then select Upgrade monitoring service.Click OK in the displayed dialog box.
    4. Log on to the CloudMonitor console.
    5. Click Application Groups in the left-side navigation pane.

Solutions

You can solve the node exception by:

  • Reducing the number of containers deployed on the node.
  • Restricting the resources used by the containers. See Restrict container resources for swarm clusters.
  • Reducing the load to get the node back to normal.
  • Expanding the node or cluster.
  • Adding monitoring charts and creating alarm rules for the group resources in the cluster, which avoids the node from being overloaded.