All Products
Search
Document Center

Container Service for Kubernetes:Usage notes and instructions on high-risk operations

Last Updated:Nov 15, 2023

Container Service for Kubernetes (ACK) is a managed service that you can use to run Kubernetes without the need to manage the technical architectures and key components of Kubernetes. This means that you no longer have to worry about misconfigured control plane components that cause service downtime or interruptions. We recommend that you read this topic to fully understand the risks that may arise when you use ACK.

Table of contents

Item

References

Usage notes

High-risk operations

Usage notes

Data plane components

Data plane components are system components that run on your Elastic Compute Service (ECS) instances. Data plane components include CoreDNS, Ingress, kube-proxy, Terway, and kubelet. This means that both you and ACK share responsibility for ensuring the stability of data plane components.

ACK provides the following features for data plane components:

  • Management and maintenance capabilities including custom component configurations, periodic component optimization, bug fixes, Common Vulnerabilities and Exposures (CVE) patches, and the relevant documentation.

  • Observability into components by providing monitoring and alerting capabilities and generating log files for key components, which are obtainable through Simple Log Service (SLS).

  • Best practices and suggestions for component configurations based on the size of the cluster in which the components are deployed.

  • Periodic component inspection and alerting. The inspection items include but are not limited to component versions, component configurations, component loads, component topology, and the number of component pods.

We recommend that you follow these suggestions when you use data plane components:

  • Use the latest component version. New releases may contain bug fixes and new features. When a new component version is released, choose an appropriate time to update your components based on the instructions provided in the user guide. This helps prevent issues that may be caused by outdated components. For more information, see Component overview.

  • Specify the email addresses and mobile phone numbers of alert contacts in the alert center of ACK. Then, specify the notification methods. Alibaba Cloud can then use the specified notification methods to send alerts and notifications. For more information, see Alert management.

  • When you receive alerts that notify you of component stability risks, follow the instructions to mitigate the risks at the earliest opportunity.

  • If you want to configure custom component parameters, we recommend that you call the ACK API or go to the Operations > Add-ons page of your cluster in the ACK console. Custom component parameters that are modified by using other methods may cause the component to malfunction. For more information, see Manage components.

  • Do not use the APIs of Infrastructure as a Service (IaaS) services to modify the environment of data plane components. For example, do not use the ECS API to change the status of the ECS instances on which data plane components run, or modify the security groups or network settings of the worker nodes. Do not use the Server Load Balancer (SLB) API to modify the configurations of the SLB instances that are used by your cluster. Improper changes to IaaS resources may cause the components to malfunction.

  • Some data plane components may inherit bugs or vulnerabilities from their open source versions. We recommend that you update your components when ACK provides updated versions to ensure the stability of your business.

Cluster update

Use the cluster update feature of ACK to update the Kubernetes versions of your ACK clusters. Other methods may cause stability or compatibility issues. For more information, see Update an ACK cluster.

ACK provides the following features to support cluster updates:

  • Version updates for ACK clusters.

  • Pre-update checks to ensure that an ACK cluster meets the conditions for version updates.

  • Release notes that describe new Kubernetes versions and compare new versions with earlier versions.

  • Pre-update notifications that inform you of the risks that may arise due to resource changes caused by version updates.

We recommend that you follow these suggestions when you use the cluster update feature:

  • Before you perform an update, we recommend that you perform a precheck and fix the identified issues.

  • Read and understand the release notes of new Kubernetes versions. Check the status of your cluster and workloads based on the update risks that are reported by ACK. Then, evaluate the impacts of updating the cluster. For more information, see Overview of Kubernetes versions supported by ACK.

  • You cannot roll back cluster updates. Before you update a cluster, prepare for the update and make sure that you have a backup plan.

  • Update your cluster to the latest Kubernetes version before the Kubernetes version that is used by your cluster is deprecated by ACK. For more information, see Support for Kubernetes versions.

Kubernetes configurations

  • Do not change key Kubernetes configurations. For example, do not change the following directories or modify the paths, links, and content of the files in the directories:

    • /var/lib/kubelet

    • /var/lib/docker

    • /etc/kubernetes

    • /etc/kubeadm

    • /var/lib/containerd

  • Do not use the annotations that are reserved by Kubernetes in YAML templates. Otherwise, your application may fail to locate resources or send requests, and may behave abnormally. Labels prefixed with kubernetes.io/ or k8s.io/ are reserved for key components. Example: pv.kubernetes.io/bind-completed: "yes".

ACK Serverless clusters

In the following scenarios, ACK Serverless clusters are not eligible for compensation clauses:

  • To simplify cluster O&M, ACK Serverless clusters provide fully-managed system components, which are deployed and maintained by ACK after you enable these components. ACK Serverless is not liable for any business loss caused by user errors such as accidental deletion of Kubernetes resources used by the fully-managed system components and therefore no compensation is provided.

Cluster registration

  • When you register an external Kubernetes cluster with ACK in the ACK console, make sure that the network connectivity between the cluster and Alibaba Cloud is stable.

  • ACK allows you to register external Kubernetes clusters but does not ensure the stability of the external clusters and cannot prevent accidental operations on these clusters. Proceed with caution when you configure labels, annotations, and tags for the nodes in an external cluster by using the cluster registration proxy. Improper configurations may cause applications to malfunction.

App catalogs

The application marketplace of ACK provides the app catalog feature to help you install applications that are developed based on open source versions. ACK cannot prevent the defects in open source applications. Proceed with caution when you install these applications. For more information, see App Marketplace.

High-risk operations

The following operations are considered high-risk operations in ACK. Improper usage may cause stability issues, and in severe circumstances may cause your cluster to fail. Read and understand the impacts of the following high-risk operations before you perform these operations:

High-risk operations on clusters

Category

High-risk operation

Impact

How to recover

API Server

Delete the SLB instance that is used to expose the API server.

You cannot manage the cluster.

Unrecoverable. You must create a new cluster. For more information about how to create a cluster, see Create an ACK managed cluster.

Worker nodes

Modify the security group of nodes.

The nodes may become unavailable.

Add the nodes to the original security group again. The security group is created when you create the cluster. For more information, see Manage ECS instances in security groups.

The subscriptions of nodes expire or nodes are removed.

The nodes become unavailable.

Unrecoverable.

Reinstall the node OS.

Components are uninstalled from nodes.

Remove the nodes and then add the nodes to the cluster again. For more information, see Remove a node and Add existing ECS instances to an ACK cluster.

Update component versions.

The nodes may become unavailable.

Roll back to the original component versions.

Change the IP addresses of nodes.

The nodes become unavailable.

Change the IP addresses of the nodes to the original IP addresses.

Modify the parameters of key components, such as kubelet, docker, and containerd.

The nodes may become unavailable.

Refer to the ACK official documentation and configure the component parameters.

Modify node OS configurations.

The nodes may become unavailable.

Restore the configurations, or remove the worker nodes and then purchase new nodes.

Modify the system time of nodes.

The components on the nodes do not work as expected.

Reset the system time of the nodes.

Master nodes in ACK dedicated clusters

Modify the security group of master nodes.

The master nodes may become unavailable.

Add the master nodes to the original security group again. The security group is created when you create the cluster. For more information, see Manage ECS instances in security groups.

The subscriptions of master nodes expire or master nodes are removed.

The master nodes become unavailable.

Unrecoverable.

Reinstall the node OS.

Components are uninstalled from master nodes.

Unrecoverable.

Update master nodes or the etcd component.

The cluster may become unavailable.

Roll back to the original component versions.

Delete or format the directories that store business-critical data on nodes, for example, /etc/kubernetes.

The master nodes become unavailable.

Unrecoverable.

Change the IP addresses of master nodes.

The master nodes become unavailable.

Change the IP addresses of the master nodes to the original IP addresses.

Modify the parameters of key components, such as etcd, kube-apiser, and docker.

The master nodes may become unavailable.

Refer to the ACK official documentation and configure the component parameters.

Replace the certificates of master nodes or the etcd component.

The cluster may become unavailable.

Unrecoverable.

Increase or decrease the number of master nodes.

The cluster may become unavailable.

Unrecoverable.

Modify the system time of nodes.

The components on the nodes do not work as expected.

Reset the system time of the nodes.

Other services

Use Resource Access Management (RAM) to modify permissions.

Resources such as SLB instances may fail to be created.

Restore the permissions.

High-risk operations on node pools

High-risk operation

Impact

How to recover

Delete scaling groups.

Node pool exceptions occur.

Unrecoverable. You must create new node pools. For more information about how to create a node pool, see Procedure.

Use kubectl to remove nodes from a node pool.

The number of nodes in the node pool that is displayed in the ACK console is different from the actual number.

Remove nodes in the ACK console, by calling the ACK API, or by configuring the Expected Nodes parameter of the node pool. For more information, see Remove a node and Create a node pool.

Manually release ECS instances.

Incorrect information may be displayed on the node pool details page. A node pool is configured with the Expected Nodes parameter when the node pool is created. After you release the ECS instances, ACK automatically scales out the node pool to the value of the Expected Nodes parameter.

Unrecoverable. To release ECS instances in a node pool, configure the Expected Nodes parameter of the node pool in the ACK console or by calling the ACK API. You can also remove the nodes that are deployed on the ECS instances. For more information, see Create a node pool and Remove a node.

Manually scale in or scale out a node pool that has auto scaling enabled.

The auto scaling component automatically adjusts the number of nodes in the node pool after you manually scale in or scale out the node pool.

Unrecoverable. You do not need to manually scale a node pool that has auto scaling enabled.

Change the upper or lower limit of instances that a scaling group can contain.

Scaling errors may occur.

  • For a node pool that has auto scaling disabled, the default upper limit of instances for the scaling group is 2,000, and the default lower limit of instances is 0.

  • For a node pool that has auto scaling enabled, make sure that the upper and lower limits of instances for the scaling group are the same as the upper and lower limits of instances for the node pool.

Add existing nodes to a cluster without backing up the data on the nodes.

The data on the nodes is lost after the nodes are added to the cluster.

Unrecoverable.

  • If you want to add existing nodes in Manual mode, you must first back up the data on the nodes.

  • If you want to add existing nodes in Auto mode, you must first back up the data on the system disks because the system disks will be replaced when the nodes are added to the cluster.

Store business-critical data on the system disk.

If you enable auto repair for a node pool, the system may handle node exceptions by resetting node configurations. As a result, data on the system disk is lost.

Unrecoverable. Store business-critical data to data disks, cloud disks, Apsara File Storage NAS (NAS) file systems, or Object Storage Service (OSS) buckets.

High-risk operations on networks and load balancing

High-risk operation

Impact

How to recover

Specify the following kernel parameter setting: net.ipv4.ip_forward=0.

Network connectivity issues occur.

Replace the setting with the following content: net.ipv4.ip_forward=1.

Specify the following kernel parameter settings:

  • net.ipv4.conf.all.rp_filter = 1|2

  • net.ipv4.conf.[ethX].rp_filter = 1|2

    Note

    ethX specifies the network interface controllers whose names start with eth.

Network connectivity issues occur.

Replace the settings with the following content:

  • net.ipv4.conf.all.rp_filter = 0

  • net.ipv4.conf.[ethX].rp_filter = 0

Specify the following kernel parameter setting: net.ipv4.tcp_tw_reuse = 1.

Pods fail to pass health checks.

Replace the setting with the following content: net.ipv4.tcp_tw_reuse = 0.

Specify the following kernel parameter setting: net.ipv4.tcp_tw_recycle = 1.

Network address translation errors occur.

Replace the setting with the following content: net.ipv4.tcp_tw_recycle = 0.

Specify the following kernel parameter setting: net.ipv4.ip_local_port_range.

Network connectivity issues occasionally occur.

Replace the setting with the following content: net.ipv4.ip_local_port_range="32768 60999".

Install firewall software, such as Firewalld or ufw.

The container network becomes inaccessible.

Uninstall the firewall software and restart the nodes.

The security group of a node does not open UDP port 53 for the pod CIDR block.

DNS cannot work as expected in the cluster.

Refer to the ECS official documentation and modify the security group configuration to open UDP port 53 for the pod CIDR block.

Modify or delete the tags that ACK added to SLB instances.

The SLB instances do not work as normal.

Restore the tags.

Modify the configurations of the SLB instances that are managed by ACK, including the configurations of the instances, listeners, and vServer groups.

The SLB instances do not work as normal.

Restore the SLB configurations.

Remove the service.beta.kubernetes.io/alibaba-cloud-loadbalancer-id: ${YOUR_LB_ID} annotation that is used to specify an existing SLB instance from the Service configuration.

The SLB instances do not work as normal.

Add the annotation to the Service configuration.

Note

If a Service is configured to use an existing SLB instance, you cannot modify the configuration to create a new SLB instance for the Service. To use a new SLB instance, you must create a new Service.

Delete the SLB instances that are created by ACK in the SLB console.

Errors may occur in the cluster network.

Delete SLB instances by deleting the Services that are associated with the SLB instances. For more information about how to delete a Service, see Delete a Service.

Manually delete the nginx-ingress-lb Service in the kube-system namespace of a cluster that has the NGINX Ingress controller installed.

The NGINX Ingress controller does not run as normal or may stop running.

Use the following YAML template to create a Service that has the same name:

apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    app: nginx-ingress-lb
  name: nginx-ingress-lb
  namespace: kube-system
spec:
  externalTrafficPolicy: Local
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
  selector:
    app: ingress-nginx
  type: LoadBalancer

Configure the nameserver parameter in the DNS configuration file of an ECS node. The DNS configuration file is named /etc/resolv.conf.

If the DNS server is not configured properly, DNS resolution may fail. As a result, the cluster cannot run as expected.

If you want to use a self-managed DNS server, we recommend that you configure the DNS server in CoreDNS. For more information, see Configure CoreDNS.

High-risk operations on storage

High-risk operation

Impact

How to recover

Unmount cloud disks that are mounted to pods in the ECS console.

I/O errors occur when you write data to the pods.

Restart the pods and clear residual data on the nodes.

Unmount disks from their mount paths on nodes.

Pod data is written to local disks.

Restart the pods.

Manage cloud disks on the nodes.

Pod data is written to local disks.

Unrecoverable.

Mount a cloud disk to multiple pods.

Pod data is written to local disks or I/O errors occur when you write data to the pods.

Mount the cloud disk only to one pod.

Important

Alibaba Cloud disks cannot be shared. Each disk can be mounted only to one pod.

Manually delete the NAS directories that are mounted to pods.

I/O errors occur when you write data to the pods.

Restart the pods.

Delete the NAS file systems that are mounted to pods or delete the mount targets that are used to mount NAS file systems.

I/O hangs occur when you write data to the pods.

Restart the ECS instances. For more information about how to restart an ECS instance, see Restart an instance.

High-risk operations on logs

High-risk operation

Impact

How to recover

Delete the /tmp/ccs-log-collector/pos directory on a node.

Duplicate logs are collected.

Unrecoverable. The /tmp/ccs-log-collector/pos directory contains information about the path from which logs are collected.

Delete the /tmp/ccs-log-collector/buffer directory on a node.

Logs are lost.

Unrecoverable. The /tmp/ccs-log-collector/buffer directory stores cached log files that need to be consumed.

Delete the aliyunlogconfig CustomResourceDefinition (CRD) objects.

Logs cannot be collected.

Recreate the aliyunlogconfig CRD objects that are deleted and the related resources. Logs that are generated within the period of time during which the aliyunlogconfig CRD objects do not exist cannot be collected.

If you delete the aliyunlogconfig CRD objects, the related log collection tasks are also deleted. After you recreate the aliyunlogconfig CRD objects, you must also relaunch the log collection tasks.

Uninstall logging components.

Logs cannot be collected.

Reinstall the logging component and manually create the aliyunlogconfig CRD objects. Logs that are generated within the period of time during which the logging component and the aliyunlogconfig CRD objects do not exist cannot be collected.

If you delete the logging component, the aliyunlogconfig CRD objects and Logtail are also deleted. Logs that are generated within the period of time during which the logging component does not exist cannot be collected.