This topic covers common problems and solutions for edge nodes in ACK Edge clusters.
How does ACK Edge distinguish between cloud nodes and edge nodes?
ACK Edge uses the alibabacloud.com/is-edge-worker label to identify node type. When a node joins a cloud node pool or an edge node pool, the is-edge-worker label is added automatically:
-
is-edge-worker=true: edge node -
is-edge-worker=false: cloud node
Add an edge node over Express Connect
When adding edge nodes in an Express Connect environment, note the following requirements. For details, see Special configuration instructions for ACK Edge clusters in Express Connect scenarios.
-
Select Dedicated as the edge node pool type, then generate a node connection script by following Add an edge node. For details about dedicated edge node pools, see Create and manage edge node pools.
NoteFor ACK Edge clusters of version 1.22 or later, you cannot connect over Express Connect by configuring the
inDedicatedNetworkparameter in the node connection script. Upgrade clusters earlier than version 1.22. -
The edge node uses private addresses to communicate with cloud services. Make sure the node can access Object Storage Service (OSS), Container Registry (ACR), and Server Load Balancer (SLB).
Add a GPU node
Before adding the node, install the GPU driver. For supported driver versions, see List of NVIDIA driver versions supported by ACK.
When generating the node connection script, configure the gpuVersion parameter. The following GPU models are supported:
| System architecture | GPU model | Minimum cluster version |
|---|---|---|
| AMD64/x86_64 | Nvidia_Tesla_T4 | ≥1.16.9-aliyunedge.1 |
| AMD64/x86_64 | Nvidia_Tesla_P4 | ≥1.16.9-aliyunedge.1 |
| AMD64/x86_64 | Nvidia_Tesla_P100 | ≥1.16.9-aliyunedge.1 |
| AMD64/x86_64 | Nvidia_Tesla_V100 | ≥1.18.8-aliyunedge.1 |
| AMD64/x86_64 | Nvidia_Tesla_A10 | ≥1.20.11-aliyunedge.1 |
| AMD64/x86_64 | Nvidia_L40 | ≥1.26.3-aliyun.1 |
After you configure gpuVersion, the tool automatically installs nvidia-containerd-runtime. For details, see NVIDIA Container Runtime.
Troubleshoot node connection script failures
If an error occurs while running the node connection script, use the following table to identify the cause and solution. If your issue is not listed, collect node diagnostics information and submit a ticket. For how to collect diagnostics, see Collect node diagnostics.
| Error message | Cause | Solution |
|---|---|---|
The os XXX unsupport |
The edge node's operating system is not supported. | Check the list of supported operating systems in Add an edge node. |
invalid nodeName |
The node name is invalid. | The node name must use only lowercase letters, hyphens (-), and periods (.); must not exceed 253 characters; and must not start with localhost. |
Node route overlaps with service cidr |
The node's routing table conflicts with the Pod CIDR block or Service CIDR block configured at cluster creation. | Recreate the cluster. Make sure the Pod CIDR and Service CIDR blocks do not conflict with the NameServer address or routing table of the edge node. |
response error msg: TOKEN_EXPIRED |
The access token has expired. | Regenerate the node connection script. Also check that the node's system time is correct. |
A node named XXX is already exist in the cluster |
A node with the same name already exists in the cluster. | Remove the existing node with the same name from the cluster. |
dial tcp xx.xxx.xx.xx:6443: i/o timeout |
edgeadm cannot reach the API server to get cluster information. |
Check whether the access control list (ACL) rules of the Server Load Balancer (SLB) for the API server restrict access from the edge node's address. |
error run phase join-node: Install edge-hub failed...text file busy |
Installation of the edge-hub binary failed because the file already exists on the node. |
Run edgeadm reset to clean up the node, then connect it again. |
error run phase post-check: timed out waiting for the condition |
System components failed to start. | 1. Download the latest edgeadm, run edgeadm reset, then reconnect. 2. Check whether the node can access the required public addresses (see Network management). 3. If the problem persists, collect node diagnostics and submit a ticket. |
Troubleshoot edge node upgrade failures
When upgrading an edge node pool, if you do not receive the message This node has been upgraded successfully, use the following table to diagnose the issue.
| Error message | Cause | Solution |
|---|---|---|
edgeadm version xxxx does not match cluster version |
The upgrade tool version does not match the cluster version. | Check whether the cluster control plane has been upgraded. Verify that TARGET_CLUSTER_VERSION is set correctly. |
node has already been upgraded to xxx |
The node is already at the target version. | If some components were not upgraded, save the logs and submit a ticket. |
kubelet target version xxxx does not match cluster version xxxx |
The specified kubelet upgrade version does not match the control plane version. | If you specified kubelet-version, verify that its value matches the control plane version. Otherwise, submit a ticket. |
Parameter currentVersion cann't null |
An outdated version of edgeadm is used. |
Use the latest version of edgeadm. Supported upgrade paths are 1.18 → 1.20 and 1.20 → 1.22. |
upgrade kubelet failed at phase install, recover to previous state. error run phase upgrade: xxxx |
The upgrade failed and was automatically rolled back. The node is unaffected. | Save the logs and submit a ticket. |
upgrade kubelet failed at phase install, recover to previous state; recover kubelet failed, err: xxx; error run phase upgrade: xxxx |
The upgrade failed and the automatic rollback also failed. The node status is affected. | Save the logs and submit a ticket. |
Collect node diagnostics
If a node in an ACK Edge cluster is abnormal, follow these steps to collect its diagnostics information.
-
Log on to the abnormal node.
-
Download the diagnostics script:
curl -o /usr/local/bin/diagnose_edge_node.sh https://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/public/diagnose/diagnose_k8s.sh -
Grant execute permission:
chmod u+x /usr/local/bin/diagnose_edge_node.sh -
Change to the script directory:
cd /usr/local/bin/ -
Run the diagnostics script:
./diagnose_edge_node.shThe script generates a diagnostics archive with a unique name. The output is similar to:
...... + echo 'please get diagnose_1578310147.tar.gz for diagnostics' please get diagnose_1578310147.tar.gz for diagnostics + echo 'Submit diagnose_1578310147.tar.gz to technical support' Submit diagnose_1578310147.tar.gz to technical support -
Run
llto verify that the diagnostics file (for example,diagnose_1578310147.tar.gz) was created.