Notes and list of risky operations for Container Service for Kubernetes (ACK) - Container Service for Kubernetes

Alibaba Cloud Container Service for Kubernetes (ACK) provides managed services for its technical architecture and core components. However, improper operations on unmanaged components or applications in your ACK cluster can cause service failures. To assess and prevent these risks, read the recommendations and notes in this topic before you use ACK.

Index

Item	References
Usage notes	Data plane components Cluster upgrades Native Kubernetes configurations ACK serverless clusters Registered clusters App Catalog
Risky operations	Risky operations related to clusters Risky operations related to node pools Risky operations related to networking and load balancing Risky operations related to storage Risky operations related to logs

Usage notes

Data plane components

Data plane components are system components that run on your ECS instances, such as CoreDNS, Ingress, kube-proxy, Terway, and kubelet. Because these components run on your ECS instances, both Alibaba Cloud Container Service and you share responsibility for maintaining their stability.

ACK provides the following support for data plane components:

Manages component parameter settings, provides regular feature optimizations, bug fixes, and CVE patches, and offers related guidance documents.
Provides observability features such as monitoring and alerts for components. For some core components, logs are provided and exposed to you through Simple Log Service (SLS).
Provides configuration best practices and recommendations. Container Service for Kubernetes provides component configuration recommendations based on cluster size.
Provides regular component inspections and some alert notification capabilities. Inspection items include but are not limited to component versions, configurations, payloads, deployment topologies, and the number of instances.

Follow these recommendations when you use data plane components:

Use the latest component versions. New versions are frequently released to fix bugs or provide new features. After a new version is released, upgrade the component at an appropriate time while ensuring service stability. Follow the instructions in the component upgrade guide. For more information, see Components.
In the ACK alert center, configure email addresses and mobile phone numbers for your contacts, and set your preferred alert notification methods. Alibaba Cloud sends alert information, service notices, and other messages through these channels. For more information, see Manage alerts for Container Service for Kubernetes.
When you receive a component stability risk report, follow the instructions to resolve the issue and eliminate security risks.
When you use data plane components, configure custom parameters for the components in the Container Service for Kubernetes console on the Operations Management > Components page of your cluster, or using OpenAPI. Modifying component configurations through other channels may cause component exceptions. For more information, see Manage components.
Do not use the OpenAPI of IaaS layer products to change the runtime environment of components. This includes using ECS OpenAPI to change the running status of an ECS instance, modifying the security group configurations of worker nodes, changing the network configurations of worker nodes, or using Server Load Balancer OpenAPI to modify SLB configurations. Unauthorized changes to IaaS layer resources may cause data plane component exceptions.
Some data plane components are affected by their upstream open source versions, which may have bugs or vulnerabilities. Upgrade components promptly to prevent your services from being affected by bugs or vulnerabilities in open source components.

Cluster upgrades

Use the cluster update feature of ACK to update the Kubernetes versions of your clusters. Other methods may cause stability or compatibility issues. For more information, see Update clusters and separately update control planes and node pools.

ACK provides the following support for cluster upgrades:

Provides the feature to upgrade a cluster to a new Kubernetes version.
Provides a pre-check feature for new Kubernetes version upgrades to ensure that the cluster is ready for the upgrade.
Provides release notes for new Kubernetes versions, including the changes from previous versions.
Warns about potential risks that may occur due to resource changes when you upgrade to a new Kubernetes version.

Follow these recommendations when you use the cluster upgrade feature:

Run a pre-check before you upgrade the cluster. Fix any issues that block the upgrade based on the pre-check results.
Read the release notes for the new Kubernetes version carefully. Confirm the status of your cluster and services based on the upgrade risks prompted by ACK, and assess the upgrade risks yourself. For more information, see [Discontinued] Overview of Kubernetes version releases.
The cluster upgrade feature does not support rollbacks. Create a thorough upgrade plan and back up your data before you start.
Upgrade the Kubernetes version of your cluster promptly within the support lifecycle of the current version, according to the ACK version support policy. For more information, see Version guide.

Native Kubernetes configurations

Do not modify key Kubernetes configurations, such as the paths, links, and content of the following files:
- /var/lib/kubelet
- /var/lib/docker
- /etc/kubernetes
- /etc/kubeadm
- /var/lib/containerd
Do not use annotations reserved for Kubernetes clusters in YAML templates. Otherwise, resources may become unavailable, fail to be requested, or encounter exceptions. Annotations that start with kubernetes.io/ and k8s.io/ are reserved for core components. Invalid example: pv.kubernetes.io/bind-completed: "yes".

ACK serverless clusters

In the following scenarios, ACK serverless clusters do not provide compensation:

To simplify cluster O&M, ACK serverless clusters provide managed capabilities for some system components. After you enable the managed feature for a component in a cluster, ACK is responsible for its deployment and maintenance. If your services are affected because you accidentally delete Kubernetes objects that the managed components depend on, ACK Serverless does not provide compensation.

Registered clusters

When you connect an external Kubernetes cluster using the registered cluster feature in the Management Console, ensure network stability between the external cluster and Alibaba Cloud.
ACK lets you register and connect to external Kubernetes clusters, but it cannot control the stability of the external clusters or prevent improper operations. Therefore, be careful when you configure information such as labels, annotations, and tags for the nodes of a registered cluster. Such operations may cause application exceptions.

App Catalog

To enrich Kubernetes applications, the ACK App Catalog provides applications that are adapted and customized based on open source software. ACK cannot control the defects inherent in the open source software itself. Be aware of this risk. For more information, see App Catalog.

Risky operations

Some functional modules in ACK involve risky operations that can significantly affect service stability. Before you use these features, carefully read about the following risky operations and their impacts.

Risky operations related to clusters

Classification	Risky operation	Impact	Recovery plan
API Server	Reuse the SLB instance used by the API Server for other scenarios, such as reusing the SLB instance for a LoadBalancer service.	The cluster becomes unavailable, which affects service traffic.	Revert to the original configuration, or request after-sales support.
	Modify the listener, vServer group, access control list (ACL), or tag configurations of the SLB instance used by the API Server. These configurations control SLB forwarding.	The cluster becomes abnormal.	Revert to the original configuration.
	Delete the SLB instance used by the API Server.	The cluster becomes inoperable.	Unrecoverable. Re-create the cluster. For more information, see Create an ACK managed cluster.
Worker nodes	Modify the security group of a node in the cluster.	The node may become unavailable.	Add the node back to the security group that was automatically created for the cluster. For more information, see Associate a security group with an instance (primary ENI).
	The node expires or is destroyed.	The node becomes unavailable.	Unrecoverable.
	Reinstall the operating system.	Components on the node are deleted.	Remove the node from the cluster and then add it back. For more information, see Remove a node and Add existing nodes to a cluster.
	Upgrade the version of a node component on your own.	The node may become unusable.	Roll back to the original version.
	Change the IP address of the node.	The node becomes unavailable.	Change the IP address back to the original one.
	Modify the parameters of core components, such as kubelet, docker, and containerd, on your own.	The node may become unavailable.	Configure the parameters as recommended on the official website.
	Modify the operating system configuration.	The node may become unavailable.	Try to revert the configuration item, or delete the node and purchase a new one.
	Modify the node time.	Components on the node may work abnormally.	Revert the node time.
	Add computing power resources to the cluster in a way that is not supported by ACK.	ACK provides multiple ways to add computing power resources to a cluster, such as using the console, OpenAPI, and command-line interface (CLI). For more information, see Add existing nodes to a cluster. If you add a node to a cluster through other means, ACK cannot identify the source of the node. As a result, ACK cannot provide product capabilities such as node lifecycle management, automated O&M, and technical support. For more information about the risks, see Why does the console show that the source of the node pool to which a node belongs is "Other Nodes"?.	We recommend that you manage computing power resources using node pools. If you want to continue using the node, ensure its compatibility with all cluster components, such as Kubernetes components, networking, storage, and security components.
Master nodes (ACK dedicated clusters)	Modify the security group of a node in the cluster.	The master node may become unavailable.	Add the node back to the security group that was automatically created for the cluster. For more information, see Associate a security group with an instance (primary ENI).
	The node expires or is destroyed.	The master node becomes unavailable.	Unrecoverable.
	Reinstall the operating system.	Components on the master node are deleted.	Unrecoverable.
	Upgrade the version of the Master or etcd component on your own.	The cluster may become unusable.	Roll back to the original version.
	Delete or format data in core directories, such as /etc/kubernetes, on the node.	The master node becomes unavailable.	Unrecoverable.
	Change the IP address of the node.	The master node becomes unavailable.	Change the IP address back to the original one.
	Modify the parameters of core components, such as etcd, kube-apiserver, and docker, on your own.	The master node may become unavailable.	Configure the parameters as recommended on the official website.
	Replace the Master or etcd certificate on your own.	The cluster may become unusable.	Unrecoverable.
	Add or remove Master nodes on your own.	The cluster may become unusable.	Unrecoverable.
	Modify the node time.	Components on the node may work abnormally.	Revert the node time.
Others	Change or modify permissions using RAM.	Some cluster resources, such as Server Load Balancers, may fail to be created.	Revert to the original permissions.
Others	Note This applies only to clusters of a version earlier than 1.26. Modify or delete the preset PodSecurityPolicy-related resources in the cluster. This includes the PodSecurityPolicy resource named `ack.privileged`, and the ClusterRole, ClusterRoleBinding, Role, and RoleBinding resources whose names start with `ack:podsecuritypolicy:`.	Core cluster components may become abnormal. You may be unable to create or update pod resources in the cluster.	Recover the related resources. For more information, see Configure or recover the default Pod security policies of ACK.

Risky operations related to node pools

Risky operation	Impact	Recovery plan
Delete the scaling group.	The node pool becomes abnormal.	Unrecoverable. You can only re-create the node pool. For more information, see Create a node pool.
Remove a node using kubectl.	The number of nodes displayed for the node pool does not match the actual number.	Remove the specified node using the Management Console or the node pool-related API (see Remove a node), or scale in the node pool by modifying its expected number of nodes (see Create and manage a node pool).
Directly release an ECS instance.	The node pool product page may display exceptions. For a node pool with an expected number of nodes, it will automatically scale out to the expected number based on its configuration to maintain that number.	Unrecoverable. The correct way is to scale in the node pool by modifying its expected number of nodes (see Create and manage a node pool) or remove a specified node (see Remove a node) using the Management Console or the node pool-related API.
Manually scale out or scale in a node pool for which auto scaling is enabled.	The auto scaling component automatically adjusts the number of nodes based on the policy, which leads to unexpected results.	Unrecoverable. Do not manually intervene with an auto scaling node pool.
Modify the maximum or minimum number of instances in the ESS scaling group.	Scaling may become abnormal.	For a node pool without auto scaling enabled, change the maximum and minimum number of instances in the ESS scaling group to the default values of 2000 and 0. For a node pool with auto scaling enabled, change the maximum and minimum number of instances in the ESS scaling group to be consistent with the maximum and minimum number of nodes in the node pool.
Do not back up data before you add an existing node.	Data on the instance is lost before it is added.	Unrecoverable. Before you manually add an existing node, you must back up all data that you want to keep. When a node is added automatically, its system disk is replaced. You must back up useful data stored on the system disk in advance.
Save important data on the system disk of a node.	The self-healing operation of a node pool may repair a node by resetting its configuration, which can cause data loss on the system disk.	Unrecoverable. The correct way is to store important data on an extra data disk or on a cloud disk, NAS, or OSS.

Risky operations related to networking and load balancing

Risky operation	Impact	Recovery plan
Modify the kernel parameter `net.ipv4.ip_forward=0`.	Network connection fails.	Change the kernel parameter to `net.ipv4.ip_forward=1`.
Modify the kernel parameters: `net.ipv4.conf.all.rp_filter = 1\|2` `net.ipv4.conf.[ethX].rp_filter = 1\|2` Note `ethX` represents all network interface cards whose names start with `eth`.	Network connection fails.	Change the kernel parameters to: `net.ipv4.conf.all.rp_filter = 0` `net.ipv4.conf.[ethX].rp_filter = 0`
Modify the kernel parameter `net.ipv4.tcp_tw_reuse = 1`.	Pod health checks become abnormal.	Change the kernel parameter to `net.ipv4.tcp_tw_reuse = 0`.
Modify the kernel parameter `net.ipv4.tcp_tw_recycle = 1`.	NAT becomes abnormal.	Modify the kernel parameter `net.ipv4.tcp_tw_recycle = 0`.
Modify the kernel parameter `net.ipv4.ip_local_port_range`.	The network connection intermittently fails.	Change the kernel parameter to the default value `net.ipv4.ip_local_port_range="32768 60999"`.
Install firewall software, such as Firewalld or ufw.	The container network connection fails.	Uninstall the firewall software and restart the node.
The node security group configuration does not allow UDP traffic on port 53 for the container CIDR block.	DNS in the cluster does not work correctly.	Configure the security group to allow traffic as recommended on the official website.
Modify or delete the tags of an SLB instance added by ACK.	The SLB instance becomes abnormal.	Revert the tags of the SLB instance.
Modify the configurations of an ACK-managed SLB instance, including the SLB instance, listener, and vServer group, in the Server Load Balancer console.	The SLB instance becomes abnormal.	Revert the configurations of the SLB instance.
Remove the annotation for reusing an existing SLB instance from the service: `service.beta.kubernetes.io/alibaba-cloud-loadbalancer-id: ${YOUR_LB_ID}`.	The SLB instance becomes abnormal.	Add the annotation for reusing an existing SLB instance to the service. Note A service that reuses an existing SLB instance cannot be directly changed to a service that uses an automatically created SLB instance. You must re-create the service.
Delete an SLB instance created by ACK in the Server Load Balancer console.	The cluster network may become abnormal.	Delete the SLB instance by deleting the service. For more information, see Delete a Service.
Manually delete the `nginx-ingress-lb` service in the kube-system namespace when the Nginx Ingress Controller component is installed.	The Ingress controller does not work correctly and may even crash.	Create a new service with the same name using the following YAML template. `apiVersion: v1 kind: Service metadata: annotations: labels: app: nginx-ingress-lb name: nginx-ingress-lb namespace: kube-system spec: externalTrafficPolicy: Local ports: - name: http port: 80 protocol: TCP targetPort: 80 - name: https port: 443 protocol: TCP targetPort: 443 selector: app: ingress-nginx type: LoadBalancer`
Add or modify the `nameserver` option in the DNS configuration file /etc/resolv.conf on an ECS node.	If the configured DNS server is not configured properly, DNS resolution may fail, which affects the normal operation of the cluster.	If you want to use a self-managed DNS server as an upstream server, we recommend that you configure it on the CoreDNS side. For more information, see Instructions for configuring unmanaged CoreDNS.
Modify or delete ENIs or Lingjun ENIs created by ACK.	The pod network is interrupted.	Unrecoverable.
Modify or delete network-related CRDs. `podnetworkings.network.alibabacloud.com podenis.network.alibabacloud.com networkinterfaces.network.alibabacloud.com nodes.network.alibabacloud.com noderuntimes.network.alibabacloud.com .cilium.io .crd.projectcalico.org`	The Terway component will not work. In severe cases, this may cause network interruptions and pod exceptions.	Unrecoverable.
Create, modify, or delete network-related system CRs. `podenis.network.alibabacloud.com networkinterfaces.network.alibabacloud.com nodes.network.alibabacloud.com noderuntimes.network.alibabacloud.com .cilium.io .crd.projectcalico.org`	The Terway component will not work. In severe cases, this may cause network interruptions and pod exceptions.	Delete the custom CR definition and re-create the associated pods.
Modify or delete fields in the Terway network configuration that are not allowed to be modified. For more information about the configuration parameters, see Customize Terway configuration parameters.	The Terway component will not work. In severe cases, this may cause network interruptions and pod exceptions.	Revert to the original configuration and restart the node.

Risky operations related to storage

Risky operation	Impact	Recovery plan
Manually detach a disk in the console.	The pod reports an I/O error during write operations.	Restart the pod and manually clear the residual mount information on the node.
Run the umount command for the disk mount path on the node.	The pod writes data to the local disk.	Restart the pod.
Directly operate on a disk on the node.	The pod writes data to the local disk.	Unrecoverable.
Mount the same disk to multiple pods.	The pod writes data to the local disk or reports an I/O error.	Ensure that one disk is used by only one pod. Important Cloud disks are non-shared storage provided by the Alibaba Cloud storage team and can be mounted to only one pod at a time.
Manually delete the NAS mount directory.	The pod reports an I/O error during write operations.	Restart the pod.
Delete an in-use NAS disk or mount target.	The pod experiences an I/O hang.	Restart the ECS node. For more information, see Restart an ECS instance.

Risky operations related to logs

Risky operation	Impact	Recovery plan
Delete the /tmp/ccs-log-collector/pos directory on the host.	Logs are collected repeatedly.	Unrecoverable. The files in this directory record the log collection position.
Delete the /tmp/ccs-log-collector/buffer directory on the host.	Logs are lost.	Unrecoverable. This directory contains cache files for logs that are waiting to be consumed.
Delete the aliyunlogconfig CRD resource.	Log collection fails.	Re-create the deleted CRD and its corresponding resources. However, logs from the failure period cannot be recovered. Deleting a CRD also deletes all its associated instances. Even after you recover the CRD, you must manually create the deleted instances.
Uninstall the log component.	Log collection fails.	Reinstall the log component and manually recover the aliyunlogconfig CRD instances. Logs from the uninstallation period cannot be recovered. Uninstalling the log component is equivalent to deleting the aliyunlogconfig CRD and the Logtail log collector. All log collection capabilities are lost during this period.