Container Service for Kubernetes (ACK) is a fully managed service that is used to manage the technical architectures and key components of Kubernetes. However, when you manage components that are not managed by ACK or applications that are deployed in ACK clusters, improper operations may occur and cause business interruptions. To better estimate and avoid the risks that may arise, make sure that you read and understand the recommendations and usage notes in this topic before you get started with ACK.

Usage notes

Data plane components

Data plane components are system components that run on your Elastic Compute Service (ECS) instances. Data plane components include CoreDNS, Ingress, kube-proxy, Terway, and kubelet. You and the ACK team must collaborate to ensure the stability of the data plane components that run on your ECS instances.

ACK provides the following features for data plane components:
  • Allows you to customize component parameters, periodically optimizes the components, fixes bugs, patches Common Vulnerabilities and Exposures (CVE) vulnerabilities, and provides relevant documentation.
  • Provides observability capabilities such as component monitoring and alerting. ACK also collects the logs of key components and allows you to view and analyze the logs in Log Service.
  • Provides best practices and suggestions for configurations. ACK provides suggestions for component configurations based on the size of the cluster in which the components are deployed.
  • Supports periodic component inspection and alerting. The check items include but are not limited to the component version, component configurations, component loads, component topology, and number of component pods.
We recommend that you follow these suggestions when you use data plane components:
  • Use the latest component version. New releases may contain bug fixes and new features. Select a proper time and follow the user guide to update your components. This way, the stability of your business is not affected. For more information, see Component overview.
  • Specify the email addresses and mobile phone numbers of alert contacts in the Alert Center that is provided by ACK. Then, specify the notification methods. This way, Alibaba Cloud can use the specified notification methods to send alerts and notices to you. For more information, see Alert management..
  • If you receive alerts that indicate risks to component stability, follow the instructions to mitigate the risks at the earliest opportunity.
  • Configure the custom parameters of the components by calling the API or on the Operations > Add-ons page of your cluster in the ACK console. If you use other methods to configure the components, the components may not function as normal. For more information, see Manage system components.
  • Do not use the APIs of Infrastructure as a service (IaaS) services to modify the environment of the components. For example, do not use the ECS API to change the status of the ECS instances on which the components run, modify the security groups of the worker nodes, or modify the network settings of the worker nodes. Do not use the Server Load Balancer (SLB) API to modify the configurations of the SLB instances that are used in your cluster. Improper changes to the IaaS resources may cause the components to malfunction.
  • Several data plane components may have the same bugs or vulnerabilities as the corresponding open source versions. Update your components at the earliest opportunity in case the bugs or vulnerabilities affect your business.

Cluster updates

Use the cluster update feature that is provided by ACK to update the Kubernetes version of ACK clusters. If you use other methods to update the Kubernetes version of your ACK clusters, stability or compatibility issues may occur in the ACK clusters. For more information, see Update the Kubernetes version of an ACK cluster.

ACK provides the following features to support cluster updates:
  • Provides the Kubernetes update feature for ACK clusters.
  • Provides the precheck feature to check whether an ACK cluster is ready for an update.
  • Provides release notes to describe the new Kubernetes versions and compare the new versions with earlier versions.
  • Displays the potential risks due to resource updates after you update the Kubernetes version of an ACK cluster.
We recommend that you follow these suggestions when you use the cluster update feature:
  • Perform a precheck before you update the cluster and fix the issues that are reported in the precheck result.
  • Read the release notes of the new Kubernetes versions, confirm the status of the cluster and workloads based on the update risks that are reported by ACK, and evaluate the impacts of the risks. For more information, see Overview of Kubernetes versions supported by ACK.
  • You cannot roll back cluster updates. Before you update a cluster, prepare for the update and make sure that you have a backup plan.
  • Update your cluster to the latest Kubernetes version before the Kubernetes version that your cluster uses is deprecated by ACK. For more information, see ACK releases of Kubernetes.

Kubernetes configurations

  • Do not change key Kubernetes configurations. For example, do not change the following directories or modify the paths, links, and content of the files in the directories:
    • /var/lib/kubelet
    • /var/lib/docker
    • /etc/kubernetes
    • /etc/kubeadm
    • /var/lib/containerd
  • Do not use annotations that are reserved by Kubernetes in YAML templates. Otherwise, resource unavailability, application failures, and exceptions may occur. Labels that start with kubernetes.io/ or k8s.io/ are reserved for key components. Example: pv.kubernetes.io/bind-completed: "yes".

Cluster registration

  • When you register an external Kubernetes cluster to ACK in the the ACK console, make sure that the network connectivity between the cluster and Alibaba Cloud is stable.
  • ACK allows you to register external Kubernetes clusters but does not ensure the stability of the external clusters or prevent improper operations. Proceed with caution when you configure labels, annotations, and tags for the nodes in an external cluster on the cluster registration proxy. Improper configurations may cause the application to malfunction.

App catalogs

The application marketplace of ACK provides the app catalog feature to help install applications that are optimized based on open source versions. ACK cannot prevent the defects in open source applications. Proceed with caution when you install these applications. For more information, see App Marketplace.

High-risk operations

The following high-risk operations may adversely affect your business when you work with ACK. Read and understand the impacts of the following high-risk operations before you perform these operations:

High-risk operations on clusters

Category High-risk operation Impact How to recover
API Server Delete the SLB instance that is used to expose the API server. You cannot manage the cluster. Unrecoverable. You must create a new cluster.
Worker nodes Modify the security groups of worker nodes. The worker nodes may become unavailable. Add the worker nodes to the original security groups again. The security groups are created when you create the cluster. For more information, see Add an ECS instance to a security group.
Do not renew expired worker nodes or remove worker nodes. The worker nodes become unavailable. Unrecoverable.
Reinstall the node OS. Components are uninstalled from worker nodes. Remove the worker nodes and then add the nodes to the cluster again.
Update component versions. The worker nodes may become unavailable. Roll back to the original component versions.
Change the IP addresses of worker nodes. The worker nodes become unavailable. Change the IP addresses of the worker nodes to the original IP addresses.
Modify the parameters of key components, such as kubelet, docker, and containerd. The worker nodes may become unavailable. Refer to the ACK official documentation and configure the component parameters.
Modify node OS configurations The worker nodes may become unavailable. Restore the configurations, or remove the worker nodes and then purchase new nodes.
Master nodes in ACK dedicated clusters Modify the security groups of master nodes. The master nodes may become unavailable. Add the master nodes to the original security groups again. The security groups are created when you create the cluster. For more information, see Add an ECS instance to a security group.
Do not renew expired master nodes or remove master nodes. The master nodes become unavailable. Unrecoverable.
Reinstall the node OS. Components are uninstalled from master nodes. Unrecoverable.
Update master nodes or the etcd. The cluster may become unavailable. Roll back to the original version.
Delete or format the directories that store business-critical data on nodes, for example, /etc/kubernetes. The master nodes become unavailable. Unrecoverable.
Change the IP addresses of master nodes. The master nodes become unavailable. Change the IP addresses of the master nodes to the original IP addresses.
Modify the parameters of key components, such as etcd, kube-apiser, and docker. The master nodes may become unavailable. Refer to the ACK official documentation and configure the component parameters.
Replace the certificates of master nodes or the etcd. The cluster may become unavailable. Unrecoverable.
Increase or decrease the number of master nodes. The cluster may become unavailable. Unrecoverable.
Others Use Resource Access Management (RAM) to modify permissions. Resources such as SLB instances may fail to be created. Restore the permissions.

High-risk operations on node pools

High-risk operation Impact How to recover
Delete scaling groups. Node pool exceptions occur. Unrecoverable. You must create new node pools.
Use kubectl to remove nodes from a node pool. The number of nodes in the node pool that is displayed in the ACK console is different from the actual number. Remove the nodes by using the ACK console, by calling the ACK API, or by configuring the Expected Nodes parameter of the node pool. For more information, see Remove a node and Modify the expected number of nodes in a node pool.
Manually release ECS instances. The node pool details page may be improperly displayed. The node pool is configured by using the Expected Nodes parameter. The node pool automatically scales out to maintain the number of expected nodes after you release the ECS instances. Unrecoverable. To release ECS instances in a node pool, configure the Expected Nodes parameter of the node pool in the ACK console or by calling the ACK API. You can also remove the nodes that are deployed on the ECS instances. For more information, see Modify the expected number of nodes in a node pool and Remove a node.
Manually scale in or scale out a node pool that has auto scaling enabled. The auto scaling component automatically adjusts the number of nodes in the node pool after you manually scale in or scale out the node pool. Unrecoverable. You do not need to manually scale in or scale out a node pool that has auto scaling enabled.
Change the upper limit or lower limit of instances that a scaling group can contain. Scaling errors may occur.
  • For a node pool that has auto scaling disabled, the default upper limit of instances for the scaling group is 2,000, and the default lower limit of instances is 0.
  • For a node pool that has auto scaling enabled, make sure that the upper and lower limits of instances for the scaling group are the same as the upper and lower limits of instances for the node pool.
Add existing nodes without backing up the data on the nodes. The data on the nodes is lost after the nodes are added. Unrecoverable.
  • If you add existing nodes in Manual mode, you must first back up the data on the nodes.
  • If you add existing nodes in Auto mode, you must first back up the data on the system disks of the existing nodes because the system disks are replaced after the nodes are added to the cluster.
Store business-critical data on the system disk. If you enable auto repair for a node pool, the system may handle node exceptions by resetting node configurations. As a result, data on the system disk is lost. Unrecoverable. Store business-critical data to data disks, Alibaba Cloud disks, Apsara File Storage NAS (NAS) file systems, or Object Storage Service (OSS) buckets.

High-risk operations on networks and load balancing

High-risk operation Impact How to recover
Specify the following kernel parameter setting: net.ipv4.ip_forward=0. Network connectivity issues occur. Replace the setting with the following content: net.ipv4.ip_forward=1.
Specify the following kernel parameter settings:
  • net.ipv4.conf.all.rp_filter = 1|2
  • net.ipv4.conf.[ethX].rp_filter = 1|2
    Note ethX specifies the network interface controllers whose names start with eth.
Network connectivity issues occur. Replace the settings with the following content:
  • net.ipv4.conf.all.rp_filter = 0
  • net.ipv4.conf.[ethX].rp_filter = 0
Specify the following kernel parameter setting: net.ipv4.tcp_tw_reuse = 1. Pods fail to pass health checks. Replace the setting with the following content: net.ipv4.tcp_tw_reuse = 0.
Specify the following kernel parameter setting: net.ipv4.tcp_tw_recycle = 1. Network address translation errors occur. Replace the setting with the following content: net.ipv4.tcp_tw_recycle = 0.
Specify the following kernel parameter setting: net.ipv4.ip_local_port_range. Network connectivity issues occasionally occur. Replace the setting with the following content: net.ipv4.ip_local_port_range="32768 60999".
Install firewall software, for example, Firewalld or ufw. The container network become inaccessible. Uninstall the firewall software and restart the nodes.
The security group of a node does not open UDP port 53 for the pod CIDR block. DNS resolutions cannot be performed as normal in the cluster. Refer to the ECS official documentation and modify the security group configuration to open UDP port 53 for the pod CIDR block.
Modify or delete the tags that ACK added to SLB instances. The related SLB instances cannot work as normal. Restore the tags.
Modify the configurations of the SLB instances that are managed by ACK, including the configurations of the instances, listeners, and vServer groups. The SLB instances cannot work as normal. Restore the SLB configurations.
Remove the service.beta.kubernetes.io/alibaba-cloud-loadbalancer-id: ${YOUR_LB_ID} annotation that is used to specify an existing SLB instance from the Service configuration. The existing SLB instance does not distribute traffic. Add the annotation to the Service configuration.
Note If a Service uses an existing SLB instance, you cannot modify the configuration to create an SLB instance for the Service. To use a new SLB instance, you must create a new Service.
Delete SLB instances that are created by ACK in the SLB console. Errors may occur in the cluster network. Delete SLB instances by deleting the Services that are associated with the SLB instances.
Manually delete the nginx-ingress-lb Service from the kube-system namespace of a cluster that has the NGINX Ingress controller installed. The NGINX Ingress controller does not run as normal or may stop running. Use the following YAML template to create a Service that has the same name:
apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    app: nginx-ingress-lb
  name: nginx-ingress-lb
  namespace: kube-system
spec:
  externalTrafficPolicy: Local
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
  selector:
    app: ingress-nginx
  type: LoadBalancer

High-risk operations on storage

High-risk operation Impact How to recover
Unmount cloud disks that are mounted to pods in the ECS console. I/O errors occur when you write data to the pods. Restart the pods and clear residual data on the node.
Unmount disks from their mount paths on nodes. Pod data is written to local disks. Restart the pod.
Manage cloud disks on the nodes. Pod data is written to local disks. Unrecoverable.
Mount a cloud disk to multiple pods. Pod data is written to local disks or I/O errors occur when you write data to the pods. Mount the cloud disk only to one pod.
Manually delete the NAS directories that are mounted to pods. I/O errors occur when you write data to the pods. Restart the pod.
Delete the NAS file systems that are mounted to pods or delete the mount targets that are used to mount NAS file systems. I/O hangs occur when you write data to the pods. Restart the ECS instances.

High-risk operations on logs

High-risk operation Impact How to recover
Delete the /tmp/ccs-log-collector/pos directory on ECS instances. Duplicate logs are collected. Unrecoverable. The data in the /tmp/ccs-log-collector/pos directory contains information about the path from which logs are collected.
Delete the /tmp/ccs-log-collector/buffer directory on ECS instances. Logs are lost. Unrecoverable. The /tmp/ccs-log-collector/buffer directory stores cached log files that need to be consumed.
Delete the aliyunlogconfig CustomResourceDefinitions (CRDs). Logs cannot be collected. Recreate the aliyunlogconfig CRDs that are deleted and the related resources. Logs that are generated within the period of time during which the aliyunlogconfig CRDs do not exist cannot be collected.

If you delete the aliyunlogconfig CRDs, the related log collection tasks are also deleted. After you recreate the aliyunlogconfig CRDs, you must also relaunch the log collection tasks.

Uninstall the logging component. Logs cannot be collected. Reinstall the logging component and manually create the aliyunlogconfig CRDs. Logs that are generated within the period of time during which the logging component and the aliyunlogconfig CRDs do not exist cannot be collected.

If you delete the logging component, the aliyunlogconfig CRDs and Logtail are also deleted. Logs that are generated within the period of time during which the logging component does not exist cannot be collected.