All Products
Search
Document Center

Container Service for Kubernetes:Usage notes and risky operations

Last Updated:Dec 12, 2025

Alibaba Cloud Container Service for Kubernetes (ACK) manages the technical architecture and core components of the container service. However, for unmanaged components and applications that run in ACK clusters, improper operations can cause business failures. To better estimate and avoid these operational risks, carefully read the suggestions and notes in this topic before you use ACK.

Index

Item

References

Usage notes

Risky operations

Usage notes

Data plane components

Data plane components are system components, such as CoreDNS, Ingress, kube-proxy, terway, and kubelet, that run on your ECS instances. Because these components run on your ECS instances, you and Alibaba Cloud are jointly responsible for maintaining their stability.

ACK provides the following support for data plane components:

  • ACK provides features such as parameterized configuration management, regular feature optimization, bug fixes, and CVE patches for components, and provides corresponding guidance documents.

  • ACK provides observability features such as monitoring and alerts for components. For some core components, component logs are provided and made available to you through SLS.

  • ACK provides configuration best practices and suggestions. ACK also provides component configuration suggestions based on cluster scale.

  • ACK provides regular inspections and alert notification capabilities for components. Inspection items include component versions, component configurations, component payloads, component deployment topologies, and the number of component instances.

When you use data plane components, follow these suggestions:

  • Use the latest component versions. New component versions are frequently released to fix bugs or provide new features. After a new version is released, you can upgrade the component at an appropriate time to avoid compromising business stability. Follow the instructions in the component upgrade guide. For more information, see Components.

  • In the alert center of the ACK console, you can set the email addresses and phone numbers for your contacts and specify how you want to receive alert notifications. Alibaba Cloud sends ACK alert information and service notices through these channels. For more information, see Manage alerts for ACK.

  • After you receive a component stability risk report, you can follow the relevant instructions to promptly resolve the security risks.

  • When you use data plane components, you can configure custom parameters for the components in the ACK console on the O&M > Components page or by calling the OpenAPI. Modifying component configurations through other channels may cause component features to become abnormal. For more information, see Manage components.

  • Do not directly use the OpenAPI of IaaS products to change the runtime environment of components. For example, do not use the OpenAPI of ECS to change the running state of an ECS instance, modify the security group configuration of a worker node, or change the network configuration of a worker node. Do not use the OpenAPI of SLB to modify SLB configurations. Unauthorized changes to IaaS layer resources may cause data plane components to become abnormal.

  • Some data plane components may have bugs or vulnerabilities from their upstream open source community versions. You can upgrade the components promptly to prevent your business from being affected by bugs or vulnerabilities in open source components.

Cluster upgrades

You can upgrade the Kubernetes version of your cluster only using the cluster upgrade feature of ACK. Upgrading the Kubernetes version by yourself may cause stability and compatibility issues in the ACK cluster. For more information, see Upgrade a cluster and independently upgrade the control plane and node pools of a cluster.

ACK provides the following support for cluster upgrades:

  • ACK provides the feature to upgrade the cluster to a new Kubernetes version.

  • ACK provides a pre-check feature for new Kubernetes version upgrades to ensure that the current state of the cluster supports the upgrade.

  • ACK provides version guide documents for new Kubernetes versions, including changes compared to previous versions.

  • ACK prompts you about potential risks due to resource changes when you upgrade to a new Kubernetes version.

When you use the cluster upgrade feature, follow these suggestions:

  • You can run a pre-check before the cluster upgrade and fix the blocking issues based on the pre-check results.

  • You can carefully read the version guide for the new Kubernetes version. You can also confirm the status of the cluster and your business based on the upgrade risks prompted by ACK and assess the upgrade risks yourself. For more information, see [Offline] Kubernetes release overview.

  • The cluster upgrade feature does not support rollbacks. You can create a thorough upgrade plan and back up data in advance.

  • You can promptly upgrade the Kubernetes version of the cluster within the support lifecycle of the current version based on the ACK version support mechanism. For more information, see Version guide.

Native Kubernetes configurations

  • Do not modify key Kubernetes configurations, such as the paths, links, and content of the following files:

    • /var/lib/kubelet

    • /var/lib/docker

    • /etc/kubernetes

    • /etc/kubeadm

    • /var/lib/containerd

  • Do not use annotations reserved by the Kubernetes cluster in YAML templates. Otherwise, resources may become unavailable, requests may fail, or other issues may occur. Labels starting with kubernetes.io/ and k8s.io/ are reserved for core components. Invalid example: pv.kubernetes.io/bind-completed: "yes".

ACK serverless clusters

Compensation is not provided for ACK serverless clusters in the following scenarios:

  • To simplify cluster O&M, ACK serverless clusters provide managed capabilities for some system components. After you enable the managed feature for a component, ACK is responsible for its deployment and maintenance. However, ACK Serverless does not provide compensation if your business is affected because you accidentally delete Kubernetes objects that are dependencies of managed components or for other reasons.

Registered clusters

  • When you connect an external Kubernetes cluster using the registered cluster feature in the Container Service for Kubernetes console, you must ensure network stability between your external cluster and Alibaba Cloud.

  • ACK lets you register and connect to external Kubernetes clusters but cannot control the stability of the external clusters or prevent improper operations. Therefore, when you configure information such as labels, annotations, and tags for external cluster nodes through a registered cluster, your applications may become abnormal. You must perform these operations with caution.

App Catalog

To enrich Kubernetes applications, the ACK App Marketplace provides an App Catalog. These applications are adapted and customized based on open source software. ACK cannot control the defects inherent in the open source software. You must be aware of this risk. For more information, see App Marketplace.

Risky operations

Some operations for ACK features are risky and may have a significant impact on business stability. Before you use these features, you must carefully understand the following risky operations and their impacts.

Risky operations related to clusters

Category

Risky operation

Impact

Recovery plan

API Server

Delete the SLB instance that is used by the API Server.

The cluster becomes inoperable.

Unrecoverable. You must recreate the cluster. For more information about how to recreate a cluster, see Create an ACK managed cluster.

Worker node

Modify the security group of a node in the cluster.

The node may become unavailable.

You can add the node back to the security group that was automatically created for the cluster. For more information, see Associate a security group with an instance (primary ENI).

The node expires or is destroyed.

The node becomes unavailable.

Unrecoverable.

Reinstall the operating system.

Components on the node are deleted.

You can remove the node from the cluster and then add it back. For more information, see Remove a node and Add existing nodes.

Upgrade node components by yourself.

The node may become unavailable.

You can roll back to the original version.

Change the IP address of the node.

The node becomes unavailable.

You can change the IP address back to the original one.

Modify the parameters of core components such as kubelet, docker, and containerd by yourself.

The node may become unavailable.

You can configure the parameters as recommended in the official documentation.

Modify the operating system configuration.

The node may become unavailable.

You can try to restore the configuration items or delete the node and purchase a new one.

Modify the node time.

Components on the node may become abnormal.

You can restore the node time.

Add computing resources to the cluster using a method that is not supported by ACK.

ACK provides multiple methods to add computing resources to a cluster, such as using the console, OpenAPI, and command-line interface (CLI). For more information, see Add existing nodes. If you add a node to a cluster using other methods, ACK cannot identify the source of the node. As a result, ACK cannot provide product capabilities such as node lifecycle management, automated O&M, and technical support. For more information about the risks, see Why does the console display that the source of the node pool to which a node belongs is "Other Nodes"?.

You can use node pools to manage computing resources. If you want to continue using the node, you must ensure the compatibility between the node and cluster components, such as Kubernetes, networking, storage, and security components.

Master node (for ACK dedicated clusters)

Modify the security group of a node in the cluster.

The master node may become unavailable.

You can add the node back to the security group that was automatically created for the cluster. For more information, see Associate a security group with an instance (primary ENI).

The node expires or is destroyed.

The master node becomes unavailable.

Unrecoverable.

Reinstall the operating system.

Components on the master node are deleted.

Unrecoverable.

Upgrade the Master or etcd component version by yourself.

The cluster may become unusable.

You can roll back to the original version.

Delete or format the data in core directories such as /etc/kubernetes on the node.

The master node becomes unavailable.

Unrecoverable.

Change the IP address of the node.

The master node becomes unavailable.

You can change the IP address back to the original one.

Modify the parameters of core components such as etcd, kube-apiserver, and docker by yourself.

The master node may become unavailable.

You can configure the parameters as recommended in the official documentation.

Replace the Master or etcd certificate by yourself.

The cluster may become unusable.

Unrecoverable.

Add or remove Master nodes by yourself.

The cluster may become unusable.

Unrecoverable.

Modify the node time.

Components on the node may become abnormal.

You can restore the node time.

Other

Change or modify permissions using RAM.

Some cluster resources, such as SLB instances, may fail to be created.

You can restore the original permissions.

Note

This applies only to clusters of a version earlier than 1.26.

Modify or delete the preset PodSecurityPolicy-related resources in the cluster. These resources include the PodSecurityPolicy resource named ack.privileged and the ClusterRole, ClusterRoleBinding, Role, and RoleBinding resources whose names start with ack:podsecuritypolicy:.

Core components of the cluster may become abnormal. You may be unable to create or update pods in the cluster.

You can restore the related resources. For more information, see Configure or restore the default pod security policies of ACK.

Risky operations related to node pools

Risky operation

Impact

Recovery plan

Delete the scaling group.

The node pool becomes abnormal.

Unrecoverable. You can only recreate the node pool. For more information about how to recreate a node pool, see Create a node pool.

Remove a node using kubectl.

The number of nodes displayed for the node pool is inconsistent with the actual number.

You can remove the specified node using the ACK console or by calling the API operations related to node pools (see Remove a node) or scale in the node pool by changing the number of expected nodes (see Create and manage a node pool).

Directly release the ECS instance.

The node pool product page may be displayed abnormally. A node pool for which you specified the number of expected nodes automatically scales out to the expected number of nodes based on the node pool configuration.

Unrecoverable. The correct practice is to scale in the node pool by changing the number of expected nodes in the ACK console or by calling the API operations related to node pools (see Create and manage a node pool) or remove the specified node (see Remove a node).

Manually scale out or scale in a node pool for which auto scaling is enabled.

The auto scaling component automatically adjusts the number of nodes based on the policy. The result may not be what you expect.

Unrecoverable. Do not manually intervene with an auto scaling node pool.

Modify the maximum or minimum number of instances in the ESS scaling group.

Scaling may become abnormal.

  • For a node pool for which auto scaling is disabled, you can set the maximum and minimum number of instances in the ESS scaling group to the default values of 2000 and 0.

  • For a node pool for which auto scaling is enabled, you can set the maximum and minimum number of instances in the ESS scaling group to be the same as the maximum and minimum number of nodes in the node pool.

Do not back up data before you add an existing node.

Data on the instance is lost after the instance is added.

Unrecoverable.

  • Before you manually add an existing node, you must back up all data that you want to retain.

  • When a node is automatically added, its system disk is replaced. You must back up useful data from the system disk in advance.

Save important data on the system disk of a node.

The self-healing feature of a node pool may reset the node configuration to repair the node. This may cause data loss on the system disk.

Unrecoverable. The correct practice is to store important data on a data disk, cloud disk, NAS volume, or OSS bucket.

Risky operations related to networking and load balancing

Risky operation

Impact

Recovery plan

Modify the kernel parameter net.ipv4.ip_forward=0.

Network connection fails.

You can modify the kernel parameter to net.ipv4.ip_forward=1.

Modify the kernel parameter:

  • net.ipv4.conf.all.rp_filter = 1|2

  • net.ipv4.conf.[ethX].rp_filter = 1|2

    Note

    ethX represents all network interface cards whose names start with eth.

Network connection fails.

You can modify the kernel parameter to:

  • net.ipv4.conf.all.rp_filter = 0

  • net.ipv4.conf.[ethX].rp_filter = 0

Modify the kernel parameter net.ipv4.tcp_tw_reuse = 1.

The health check of pods becomes abnormal.

You can modify the kernel parameter to net.ipv4.tcp_tw_reuse = 0.

Modify the kernel parameter net.ipv4.tcp_tw_recycle = 1.

NAT becomes abnormal.

You can modify the kernel parameter to net.ipv4.tcp_tw_recycle = 0.

Modify the kernel parameter net.ipv4.ip_local_port_range.

The network connection intermittently fails.

You can modify the kernel parameter to the default value net.ipv4.ip_local_port_range="32768 60999".

Install firewall software, such as Firewalld or ufw.

The container network connection fails.

You can uninstall the firewall software and restart the node.

The security group of the node does not allow access to UDP port 53 of the container CIDR block.

The DNS service in the cluster cannot work as expected.

You can configure the security group to allow access as recommended in the official documentation.

Modify or delete the tags of an SLB instance added by ACK.

The SLB instance becomes abnormal.

You can restore the tags of the SLB instance.

Modify the configurations of an SLB instance managed by ACK in the SLB console. The configurations include the SLB instance, listeners, and vServer groups.

The SLB instance becomes abnormal.

You can restore the configurations of the SLB instance.

Remove the annotation for reusing an existing SLB instance from the Service. The annotation is service.beta.kubernetes.io/alibaba-cloud-loadbalancer-id: ${YOUR_LB_ID}.

The SLB instance becomes abnormal.

You can add the annotation for reusing an existing SLB instance to the Service.

Note

A Service that reuses an existing SLB instance cannot be directly modified to use an automatically created SLB instance. You must recreate the Service.

Delete an SLB instance created by ACK in the SLB console.

The cluster network may become abnormal.

You can delete the SLB instance by deleting the Service. For more information about how to delete a Service, see Delete a Service.

Manually delete the nginx-ingress-lb Service from the kube-system namespace when the Nginx Ingress Controller component is installed.

The Ingress controller does not work as expected and may even crash.

You can create a Service with the same name using the following YAML template.

apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    app: nginx-ingress-lb
  name: nginx-ingress-lb
  namespace: kube-system
spec:
  externalTrafficPolicy: Local
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
  selector:
    app: ingress-nginx
  type: LoadBalancer

Add or modify the nameserver option in the /etc/resolv.conf DNS configuration file on an ECS node.

If the DNS server is not properly configured, DNS resolution may fail. This affects the normal operation of the cluster.

If you want to use a self-managed DNS server as the upstream server, you can configure it on the CoreDNS side. For more information, see Instructions on how to configure unmanaged CoreDNS components.

Modify or delete ENIs or Lingjun ENIs created by ACK.

The pod network is interrupted.

Unrecoverable.

Modify or delete network-related CRDs.

podnetworkings.network.alibabacloud.com
podenis.network.alibabacloud.com
networkinterfaces.network.alibabacloud.com
nodes.network.alibabacloud.com
noderuntimes.network.alibabacloud.com
*.cilium.io
*.crd.projectcalico.org

The Terway component will stop working. In severe cases, this may cause network interruptions and abnormal pods.

Unrecoverable.

Create, modify, or delete network-related system CRs.

podenis.network.alibabacloud.com
networkinterfaces.network.alibabacloud.com
nodes.network.alibabacloud.com
noderuntimes.network.alibabacloud.com
*.cilium.io
*.crd.projectcalico.org

The Terway component will stop working. In severe cases, this may cause network interruptions and abnormal pods.

You can delete the custom CR definitions and recreate the associated pods.

Modify or delete fields that are not allowed to be modified in the Terway network configuration. For more information about the configuration parameters, see Customize Terway parameters.

The Terway component will stop working. In severe cases, this may cause network interruptions and abnormal pods.

You can restore the original configuration and restart the node.

Risky operations related to storage

Risky operation

Impact

Recovery plan

Manually detach a cloud disk in the console.

An I/O error is reported when a pod writes data.

You can restart the pod and manually clear the residual mount information on the node.

Run the umount command on the disk mount path on the node.

The pod writes data to the local disk.

You can restart the pod.

Directly operate on the cloud disk on the node.

The pod writes data to the local disk.

Unrecoverable.

Mount the same cloud disk to multiple pods.

The pod writes data to the local disk or an I/O error is reported.

You must make sure that one cloud disk is used by only one pod.

Important

Cloud disks are non-shared storage provided by the Alibaba Cloud storage team. A cloud disk can be mounted to only one pod at a time.

Manually delete the NAS mount directory.

An I/O error is reported when a pod writes data.

You can restart the pod.

Delete a NAS volume or mount target that is in use.

The pod experiences an I/O hang.

You can restart the ECS node. For more information about how to restart an ECS instance, see Restart an ECS instance.

Risky operations related to logs

Risky operation

Impact

Recovery plan

Delete the /tmp/ccs-log-collector/pos directory on the host.

Logs are repeatedly collected.

Unrecoverable. The files in this directory record the log collection position.

Delete the /tmp/ccs-log-collector/buffer directory on the host.

Logs are lost.

Unrecoverable. This directory contains cache files of logs to be consumed.

Delete the aliyunlogconfig CRD resource.

Log collection fails.

You can recreate the deleted CRD and its corresponding resources. However, logs collected during the failure period cannot be recovered.

Deleting a CRD also deletes all its instances. Even if you restore the CRD, you still need to manually create the deleted instances.

Delete the log component.

Log collection fails.

You can reinstall the log component and manually restore the aliyunlogconfig CRD instances. Logs collected during the deletion period cannot be recovered.

Deleting the log component is equivalent to deleting the aliyunlogconfig CRD and the Logtail agent. All log collection capabilities are lost during this period.