All Products
Search
Document Center

Container Service for Kubernetes:FAQ about cluster management

Last Updated:Mar 06, 2024

This topic provides answers to some frequently asked questions about cluster management.

Are ACK clusters that run Alibaba Cloud Linux compatible with container images that are based on CentOS?

Yes, Container Service for Kubernetes (ACK) clusters that run Alibaba Cloud Linux are compatible with container images that are based on CentOS. For more information, see Use Alibaba Cloud Linux 3.

Can I change the container runtime of a cluster from containerd to Docker?

After a cluster is created, you cannot change the container runtime used by the cluster. However, you can create node pools that use different container runtimes in the cluster. The container runtimes used by node pools in the cluster can be different. For more information, see Node pool overview. You can change the container runtime of a node from Docker to containerd. For more information, see Change the container runtime from Docker to containerd.

What are the differences between containerd, Docker, and Sandboxed-Container?

Container Service for Kubernetes (ACK) supports the following container runtimes: containerd, Docker, and Sandboxed-Container. We recommend that you use containerd as the container runtime. You can use Docker as the container runtime in clusters whose Kubernetes versions are 1.22 and earlier. You can use Sandboxed-Container in clusters whose Kubernetes versions are 1.24 and earlier. For more information about containerd, Docker, and Sandboxed-Container, see Comparison of Docker, containerd, and Sandboxed-Container. If your cluster uses Docker as the container runtime, you must change the container runtime to containerd before you can update the Kubernetes version of your cluster to 1.24 or later. For more information, see Change the container runtime from Docker to containerd.

Is ACK certified for Level 3 Cybersecurity?

ACK and other Alibaba Cloud services have been certified for Level 3 Cybersecurity. The following lists describe the items for which Alibaba Cloud and you must bear responsibility to guarantee the security of Alibaba Cloud services and your business:

  • Items for which Alibaba Cloud bears responsibility:

    • The security of infrastructure resources for Alibaba Cloud services.

    • The security of etcd and control plane nodes in the cluster.

    • The security compliance of control plane components in the cluster. Alibaba Cloud also accepts security inspections from third parties.

  • Items for which you bear responsibility:

    • Security configurations of the data plane, including the configurations of security groups of virtual private clouds (VPCs).

    • Configurations of nodes and pods.

    • Operating systems of nodes, including upgrades and security patches.

    • Other related software.

    • Access control on devices and networks, such as firewall rules.

    • Platform-level identity verification and access control by using Resource Access Management (RAM) or other services.

    • Security of sensitive data.

Can I update an ACK dedicated cluster after I accidentally delete a master node of the cluster?

No. After a master node of an ACK dedicated cluster is deleted, you cannot add another master node or update the Kubernetes version of the cluster. You can create another ACK dedicate cluster. For more information, see Create an ACK dedicated cluster.

How do I connect to master nodes?

How do I collect the diagnostic data of an ACK cluster?

ACK provides the cluster diagnostics feature that you can use to diagnose clusters with a few clicks. This feature helps you troubleshoot cluster issues and node anomalies. For more information, see Work with cluster diagnostics.

You can also collect diagnostic data from master nodes and worker nodes for further analysis. The following section describes how to collect diagnostic data from Linux nodes and Windows nodes.

Collect diagnostic data from Linux nodes

Worker nodes support Linux and Windows, whereas master nodes support only Linux. The following steps apply to master nodes and worker nodes that run Linux. The following example describes how to collect diagnostic data from a master node:

  1. Log on to the master node and run the following command to download a diagnostic script:

    curl -o /usr/local/bin/diagnose_k8s.sh http://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/public/diagnose/diagnose_k8s.sh
    Note

    You can download the diagnostic script for Linux nodes only from the China (Hangzhou) region.

  2. Run the following command to make the diagnostic script executable:

    chmod u+x /usr/local/bin/diagnose_k8s.sh
  3. Run the following command to go to a specified directory:

    cd /usr/local/bin
  4. Run the following command to run the diagnostic script:

    diagnose_k8s.sh

    The following output is returned. Each time you run the diagnostic script, a log file with a different name is generated. In this example, the log file is named diagnose_1514939155.tar.gz. The name is subject to the actual conditions.

    ......
    + echo 'please get diagnose_1514939155.tar.gz for diagnostics'
    please get diagnose_1514939155.tar.gz for diagnostics
    + echo 'Upload diagnose_1514939155.tar.gz'
    Upload diagnose_1514939155.tar.gz
  5. Run the following command to query the log file that stores the diagnostic data:

    ls -ltr | grep diagnose_1514939155.tar.gz
    Note

    Replace diagnose_1514939155.tar.gz with the actual name of the generated log file.

Collect diagnostic data from Windows nodes

To collect diagnostic data from a Windows worker node, perform the following steps to download and run a diagnostic script:

Note

Windows can run only on worker nodes.

  1. Log on to an abnormal node. Open the Run command window, enter cmd, and then click OK to open Command Prompt.

  2. Run the following command to switch to PowerShell:

    powershell
  3. Run the following command to download and run a diagnostic script:

    The diagnostic script for a Windows node can be downloaded only from the region where the node resides. Replace [$Region_ID] in the command with the actual region ID of the node.

    Invoke-WebRequest -UseBasicParsing -Uri http://aliacs-k8s-[$Region_ID].oss-[$Region_ID].aliyuncs.com/public/pkg/windows/diagnose/diagnose.ps1 | Invoke-Expression

    If the following output is returned, the diagnostic data of the node is collected.

    INFO: Compressing diagnosis clues ...
    INFO: ...done
    INFO: Please get diagnoses_1514939155.zip for diagnostics
    Note

    The diagnoses_1514939155.zip file is stored in the directory where the diagnostic script is run.

How do I troubleshoot ACK cluster issues?

Step 1: Check cluster nodes

  1. Run the following command to check whether all cluster nodes are in the Ready state:

    kubectl get nodes

    The following figure shows the expected output.p

    • If all clusters nodes are in the Ready state, the nodes run as expected.

    • If any node is not in the Ready state, perform Step 2.

  2. Run the following command to query the details and events of a node:

    Replace [$NODE_NAME] with the actual node name.

    kubectl describe node [$NODE_NAME]
    Note

    For more information about the kubectl output, see Node status.

Step 2: Check cluster components

If all cluster nodes run as expected, check the logs of cluster components.

  1. Run the following command to view all components in the kube-system namespace:

    kubectl get pods -n kube-system

    The following figure shows the expected output. 1Components whose names start with kube- are system components. Components whose names start with coredns- are DNS components. The output shows that all cluster components run as expected. If any component does not run as expected, perform the following step.

  2. Run the following command to query the log of a component:

    Replace [$Component_Name] with the actual component name.

    kubectl logs -f [$Component_Name] -n kube-system

Step 3: Check the kubelet

  1. Run the following command to view the status of the kubelet:

    systemctl status kubelet
  2. If the kubelet is not in the Active state, run the following command to view the kubelet log. Identify and resolve issues based on the log.

    journalctl -u kubelet

Common cluster issues

The following table describes common issues and solutions for ACK clusters.

Issue

Solution

The API server or a control plane component stops running. Results:

  • You cannot create, stop, or update pods, Services, or Deployments.

  • All existing pods and Services run as expected unless the pods and Services need to call the ACK API to perform operations such as managing Kubernetes dashboards.

The components of ACK support high availability. We recommend that you check whether the components are abnormal. For example, the API server of an ACK cluster uses a Server Load Balancer (SLB) instance. You can check why your SLB instance stops running.

The backend data of the API server is lost. Results:

  • The API server cannot be started.

  • All existing pods and Services run as expected unless the pods and Services need to call the ACK API to perform operations such as managing Kubernetes dashboards.

  • The API server can be started only after the backend data of the API server is restored or recreated.

If you have created a snapshot before the issue occurs, you can restore data from the snapshot to resolve the issue. If no snapshot is created in advance, join the DingTalk group 8000019579 for technical support. You can use the following methods to prevent this issue:

A node fails and all pods on the node stop running.

Create pods by using workloads such as Deployments, StatefulSets, and DaemonSets. Do not directly create pods. Otherwise, the system may not be able to schedule the pods to healthy nodes.

The kubelet fails. Results:

  • You cannot create pods on a node where the kubelet fails.

  • The kubelet may accidentally delete specific pods.

  • Specific nodes are marked as unhealthy.

  • Deployments or ReplicationControllers create pods on other nodes.

  • If you have created a snapshot before the issue occurs, you can restore data from the snapshot to resolve the issue. If no snapshot is created, join the DingTalk group 8000019579 for technical support. Create snapshots for the volumes managed by the kubelet on a regular basis. For more information, see Use volume snapshots created from disks.

  • Do not directly create pods. Instead, create pods by using workloads such as Deployments, StatefulSets, and DaemonSets. The system attempts to schedule the pods to healthy nodes.

Other causes such as invalid configurations.

If you have created a snapshot before the issue occurs, you can restore data from the snapshot to resolve the issue. If no snapshot is created, join the DingTalk group 8000019579 for technical support. Create snapshots for the volumes managed by the kubelet on a regular basis. For more information, see Use volume snapshots created from disks.