Alibaba Cloud Container Service for Kubernetes (ACK) provides managed services for the technical architecture and core components of its container services. Improper operations on unmanaged components or applications running in ACK clusters can lead to service failures. To effectively assess and mitigate operational risks, review the recommendations and precautions in this topic before using ACK.
Index
Information Item | References |
Important notes
Data plane components
Data plane components are system components that run on your ECS instances, such as CoreDNS, Ingress, kube-proxy, Terway, and kubelet. Because these components run on your ECS instances, their stability requires joint maintenance from both Alibaba Cloud and you.
ACK provides the following support for data plane components:
Parameterized configuration management, regular feature optimization, bug fixes, CVE patches, and associated guidance documentation.
Observability features, such as monitoring and alerts. Logs for some core components are provided and delivered to you through SLS.
Configuration best practices and recommendations tailored to your cluster size.
Regular inspection and alerting features that check component versions, configurations, loads, deployment topologies, instance counts, and other relevant metrics.
Follow these recommendations when using data plane components:
Use the latest component versions. New versions often include bug fixes and new features. After a new version is released, upgrade at an appropriate time, ensuring service stability and following the upgrade instructions in the relevant documentation. For more information, see Components.
Configure contact email addresses and mobile phone numbers in the ACK alert center and configure alert notification methods. Alibaba Cloud uses these channels to send alerts and service notifications. For more information, see ACK alert management.
If you receive a stability risk report for a component, address it promptly according to the provided instructions to eliminate security risks.
Configure custom component parameters only through the ACK console by navigating to , or using OpenAPI. Modifying component configurations through other methods can cause component malfunctions. For more information, see Manage components.
Do not use IaaS-layer OpenAPI operations to modify the runtime environment of components. This includes using ECS OpenAPI to change ECS instance states, modifying security group settings or network configurations for worker nodes, or using SLB OpenAPI to modify SLB configurations. Unauthorized changes to IaaS resources can cause data plane components to malfunction.
Some data plane components are based on upstream open-source versions and can contain bugs or vulnerabilities. Upgrade components promptly to avoid service disruptions caused by these issues.
Cluster upgrades
Always use the ACK cluster upgrade feature to upgrade your Kubernetes version. Manually upgrading Kubernetes can cause stability and compatibility issues with your ACK cluster. For detailed steps, see Upgrade clusters and independently upgrade control planes and node pools.
ACK provides the following support for cluster upgrades:
Kubernetes version upgrade features.
Pre-upgrade checks to ensure your cluster is ready for an upgrade.
Release notes for new Kubernetes versions, including changes from previous versions.
Risk notifications about potential issues due to resource changes during upgrades.
Follow these recommendations when using the cluster upgrade feature:
Run pre-upgrade checks and resolve all blocking issues before proceeding.
Review the Kubernetes release notes and evaluate upgrade risks based on your cluster and workload status. For more information, see [Deprecated] Kubernetes version release overview.
Because cluster upgrades cannot be rolled back, create a thorough upgrade plan and perform backups beforehand.
Upgrade your cluster within the current version's support period according to ACK's version support policy. For more information, see Version Guide.
Kubernetes native configurations
Do not modify critical Kubernetes configurations, including the paths, links, or contents of the following directories:
/var/lib/kubelet
/var/lib/docker
/etc/kubernetes
/etc/kubeadm
/var/lib/containerd
Do not use Kubernetes-reserved annotations in your YAML templates. Doing so can cause resource unavailability, creation failures, or abnormal behavior. Annotations starting with
kubernetes.io/ork8s.io/are reserved for core components. For example:pv.kubernetes.io/bind-completed: "yes".
ACK serverless clusters
ACK serverless clusters do not provide compensation in the following scenarios:
To simplify cluster operations, ACK Serverless clusters manage some system components when management of cluster components is enabled. If your business is affected by the unintentional deletion of Kubernetes resources that the managed components depend on, no compensation is provided.
Registered clusters
When registering external Kubernetes clusters through the ACK console, ensure stable network connectivity between the external cluster and Alibaba Cloud.
ACK enables registration of external Kubernetes clusters but cannot control their stability or prevent improper operations. Exercise caution when configuring labels, annotations, or tags on external cluster nodes through registered clusters, as this can cause application failures.
App Catalog
To enrich Kubernetes applications, ACK Marketplace provides an App Catalog featuring applications adapted and customized from open-source software. ACK cannot control defects originating from the open-source software itself. Be aware of this risk. For more information, see Marketplace.
High-risk operations
Certain operations in ACK can significantly impact service stability. Understand the following high-risk operations and their effects before using these features.
Cluster-related high-risk operations
Category | High-risk operation | Impact | Recovery solution |
API Server | Reuse the CLB used by the API Server for other purposes, such as using a LoadBalancer-type Service with the same CLB. | Cluster becomes unavailable, affecting service traffic. | Restore the original configuration or contact customer support. |
Modify CLB configurations that control forwarding, such as listeners, server groups, ACLs, or CLB tags used by the API Server. | Cluster malfunctions. | Restore the original configuration. | |
Delete the CLB used by the API Server. | Cluster becomes inoperable. | Irreversible. Recreate the cluster. For steps, see Create an ACK managed cluster. | |
Worker nodes | Modify the security group of cluster nodes. | Nodes can become unavailable. | Add the nodes back to the automatically created node security group. For more information, see Associate a security group with an instance (primary NIC). |
Node expiration or deletion. | The node becomes unavailable. | Irreversible. | |
Reinstall the operating system. | Components on the node are deleted. | Remove and re-add the node to the cluster. For steps, see Remove a node and Add existing nodes. | |
Manually upgrade node component versions. | Nodes can become unusable. | Roll back to the original version. | |
Change the node IP address. | Node becomes unavailable. | Restore the original IP address. | |
Manually modify parameters of core components (such as kubelet, Docker, or containerd). | Nodes can become unavailable. | Use the recommended configuration parameters from the official documentation. | |
Modify operating system configurations. | Nodes can become unavailable. | Attempt to restore the configuration or delete and recreate the node. | |
Modify node time. | Components on the node can malfunction. | Restore the original node time. | |
Add node computing resources to the cluster using unsupported methods. | ACK supports adding node computing resources through the console, OpenAPI, or CLI. For more information, see Add existing nodes. Nodes added through other methods cannot be recognized by ACK, so lifecycle management, automated O&M, or technical support are unavailable. For details, see Why does the console show the node pool source as "Other nodes"?. | Manage computing resources through node pools. If you continue using unsupported methods, ensure compatibility between the node and cluster components (such as Kubernetes components, networking, storage, and security). | |
Master nodes (ACK dedicated clusters) | Modify the security group of cluster nodes. | Master nodes can become unavailable. | Add the nodes back to the automatically created node security group. For more information, see Associate a security group with an instance (primary NIC). |
Node expiration or deletion. | The master node becomes unavailable. | Irreversible. | |
Reinstall the operating system. | Components on the master node are deleted. | Irreversible. | |
Manually upgrade master or etcd component versions. | The cluster can become unusable. | Roll back to the original version. | |
Delete or format core directories such as /etc/kubernetes on the node. | The master node becomes unavailable. | Irreversible. | |
Change the node IP address. | The master node becomes unavailable. | Restore the original IP address. | |
Manually modify parameters of core components (such as etcd, kube-apiserver, or Docker). | Master nodes can become unavailable. | Use the recommended configuration parameters from the official documentation. | |
Manually replace master or etcd certificates. | The cluster can become unusable. | Irreversible. | |
Manually add or remove master nodes. | The cluster can become unusable. | Irreversible. | |
Modify node time. | Components on the node can malfunction. | Restore the original node time. | |
Other | Modify permissions or configurations through RAM. | Cluster resources, such as SLB instances, can fail to be created. | Restore the original permissions. |
Note Applies only to clusters earlier than version 1.26. Modify or delete preset PodSecurityPolicy resources in the cluster, including the PodSecurityPolicy named | Core cluster components can malfunction. Pod creation and updates can fail. | Restore the related resources. For steps, see Configure or restore the default ACK Pod security policy. |
Node pool-related high-risk operations
High-risk operation | Impact | Recovery solution |
Delete a scaling group. | Node pool malfunctions. | Irreversible. Recreate the node pool. For steps, see Create a node pool. |
Remove a node using kubectl. | Node count displayed in the node pool does not match the actual count. | Remove the node through the ACK console or node pool APIs (see Remove a node) or adjust the desired node count to scale in (see Create and manage node pools). |
Directly release an ECS instance. | The node pool details page can display incorrectly. For node pools with a desired node count enabled, the system automatically scales out to maintain the desired count. | Irreversible. Correctly scale in by adjusting the desired node count through the ACK console or node pool APIs (see Create and manage node pools) or remove specific nodes (see Remove a node). |
Manually scale out or in a node pool with auto scaling enabled. | The auto scaling component adjusts the node count based on policies, leading to unexpected results. | Irreversible. Do not manually intervene in auto scaling node pools. |
Modify the maximum or minimum instance count of an ESS scaling group. | Scaling can malfunction. |
|
Add existing nodes without backing up data. | Data on the instance is lost. | Irreversible.
|
Store important data on the node's system disk. | Node self-healing operations can reset node configurations, causing data loss on the system disk. | Irreversible. Store important data on additional data disks, cloud disks, NAS, or OSS. |
Virtual node-related high-risk operations
High-risk operation | Impact | Recovery solution |
Uninstall the virtual node component. | Serverless Pod management fails: existing ECI and ACS pods cannot be deleted, and new ones cannot be created. |
Network and Server Load Balancer-related high-risk operations
High-risk operation | Impact | Recovery solution |
Set the kernel parameter | Network connectivity fails. | Set the kernel parameter to |
Modify kernel parameters:
| Network connectivity fails. | Set kernel parameters to:
|
Set the kernel parameter | Pod health checks fail. | Set the kernel parameter to |
Set the kernel parameter | NAT malfunctions. | Set the kernel parameter to |
Modify the kernel parameter | Intermittent network connectivity issues occur. | Restore the kernel parameter to its default value: |
Install firewall software such as Firewalld or ufw. | Container networking fails. | Uninstall the firewall software and restart the node. |
Do not allow UDP port 53 for the container CIDR in the node security group. | Cluster DNS fails. | Configure the security group according to official recommendations. |
Modify or delete tags added by ACK to an SLB instance. | SLB malfunctions. | Restore the SLB tags. |
Modify configurations of ACK-managed SLB instances through the SLB console, including SLB instances, listeners, or vServer groups. | SLB malfunctions. | Restore the SLB configuration. |
Remove the annotation for reusing an existing SLB from a Service: | SLB malfunctions. | Add the annotation for reusing an existing SLB back to the Service. Note A Service that reuses an existing SLB cannot be directly changed to use an automatically created SLB. You must recreate the Service. |
Delete an ACK-created SLB through the SLB console. | Cluster networking can fail. | Delete the SLB by deleting the associated Service. For steps, see Delete a Service. |
Manually delete the | Ingress Controller malfunctions or crashes. | Create a new Service with the same name using the following YAML. |
Add or modify the | If the configured DNS server is misconfigured, DNS resolution can fail, affecting cluster operations. | If you want to use a self-managed DNS server as an upstream server, configure it in CoreDNS instead. For steps, see Unmanaged CoreDNS configuration. |
Modify or delete elastic network interfaces (ENIs) or Lingjun ENIs created by ACK. | Pod networking fails. | Irreversible. |
Modify or delete network-related CRDs. | Terway fails. Severe cases can cause network or Pod failures. | Irreversible. |
Create, modify, or delete network-related system CRs. | Terway fails. Severe cases can cause network or Pod failures. | Delete custom CR definitions and recreate associated Pods. |
Modify fields in Terway network configurations that are not allowed to be changed. For configurable parameters, see Custom Terway configuration parameters. | Terway fails. Severe cases can cause network or Pod failures. | Restore the original configuration and restart the node. |
Storage-related high-risk operations
High-risk operation | Impact | Recovery solution |
Manually detach a cloud disk through the console. | Pods report IO errors on writes. | Restart the Pod and manually clean up residual mounts on the node. |
Run umount on the disk mount path on the node. | Pods write to the local disk. | Restart the Pod. |
Directly operate cloud disks on the node. | Pods write to the local disk. | Irreversible. |
Mount the same cloud disk to multiple Pods. | Pods write to the local disk or report IO errors. | Ensure one cloud disk is used by only one Pod. Important Cloud disks are non-shared storage provided by Alibaba Cloud Storage and can be mounted to only one Pod at a time. |
Manually delete the NAS mount directory. | Pods report IO errors on writes. | Restart the Pod. |
Delete a NAS file system or mount target that is in use. | Pods experience IO hangs. | Restart the ECS node. For steps, see Restart an ECS instance. |
Log-related high-risk operations
High-risk operation | Impact | Recovery solution |
Delete the /tmp/ccs-log-collector/pos directory on the host. | Logs are collected repeatedly. | Irreversible. This directory records log collection positions. |
Delete the /tmp/ccs-log-collector/buffer directory on the host. | Logs are lost. | Irreversible. This directory stores cached logs awaiting processing. |
Delete aliyunlogconfig CRD resources. | Log collection stops. | Recreate the deleted CRD and its resources, but logs collected during the outage cannot be recovered. Deleting the CRD also deletes all associated instances. Even after restoring the CRD, you must manually recreate the deleted instances. |
Delete log components. | Log collection stops. | Reinstall the log components and manually restore aliyunlogconfig CRD instances. Logs collected during the outage cannot be recovered. Deleting log components removes both the aliyunlogconfig CRD and the Logtail collector, disabling all log collection capabilities during that period. |