By Mingquan Zheng, Kai Yu
The pod has been running normally for a certain period of time after it was created. However, one day it was observed that no new connections were being established. In the business context, pod A accessed the svc name of svc B. When manually applying the wget command within the pod, an error message "Address not available" was encountered. Why did this error occur?
The rough example is depicted below:
I conducted extensive research to understand the reasons behind the "Address not available" error and which specific address is considered unavailable. Despite finding relevant definitions based on the POSIX (Portable Operating System Interface for UNIX) error standard, the explanation remains unclear.
Reference of error code: [errno.3 ]
EADDRNOTAVAIL Address not available (POSIX.1-2001).
netstat –an, I checked the address of the connection svc. The number of connections in the estab state has reached the threshold of available random ports. Therefore, no new connection can be created.
The problem is finally resolved by modifying the port range of the kernel parameter random port: net.ipv4.ip_local_port_range.
We know that the port range of random ports defined by the Linux kernel is from 32768 to 60999. This range is often overlooked in business design scenarios. As we all know, each TCP connection is composed of four tuples: source IP, source port, destination IP, and destination port. If any of these tuples change, a new TCP connection can be created. In the case where a pod needs to access a fixed destination IP and port, the only variable for each TCP connection is the source port, which is randomly assigned. Therefore, when a large number of long connections need to be established, it is necessary to either increase the kernel's range of random ports or adjust the service configuration.
Related kernel reference: [ip-sysctl.txt ]
ip_local_port_range - 2 INTEGERS Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first, the second the last local port number. If possible, it is better these numbers have different parity (one even and one odd value). Must be greater than or equal to ip_unprivileged_port_start. The default values are 32768 and 60999 respectively.
Let’s manually narrow the port range of net.ipv4.ip_local_port_range, and then reproduce it.
For the same problem, I tried
wget commands respectively, and different errors were reported, which confused me.
curl: (7) Couldn't connect to server
nc: bind: Address in use
wget: can't connect to remote host (126.96.36.199): Address not available
Let's analyze the process using the strace command to trace the call name of the specified system.
nc commands all create socket(). But I find out that the
curl commands call the connect() function. In contrast, the nc command first call the bind() function, and if an error is reported, It will not call the connect() function.
As shown in the figure, through the analysis of the B/S architecture, connect() is established after the client creates a socket.
Why do wget and curl report different errors when they call the same connect() function?
Why do the connect() function and the bind() function report different errors?
EADDRINUSE Address already in use (POSIX.1-2001). EADDRNOTAVAIL Address not available (POSIX.1-2001).
If we find a machine with the Centos7.9 system, install curl, wget, nc, and other tools, we may encounter the error message "Cannot assign requested address" when narrowing the port range. In this scenario, we know that in some images (alpine and busybox), using the same command tool will report different errors under the same circumstances. This is because these images may choose the busybox toolbox for some basic commands to reduce the size of the entire image, making it difficult to locate problems (wget and nc are tools of the busybox toolbox. For more information, see the busybox document: Busybox Command Help ).
Definitions related to error codes in the Linux system: /usr/include/asm-generic/errno.h
#define EADDRNOTAVAIL 99 /* Cannot assign requested address */
Theoretically, ports ranging from 0 to 65535 can be used. The ports ranging from 0 to 1023 are privileged and are available only for privileged users, which have been reserved for the following standard services, such as HTTP:80 and SSH:22. To avoid unauthorized users from attacking through traffic characteristics, it is suggested that the random port range can be limited to 1024 to 65535 if the port range needs to be larger.
According to the information provided by the Kubernetes community, we can modify the source port through securityContext  and by setting the initContainers to the privileged mode, mount -o remount rw /proc/sys. This modification only takes effect in the network namespace of the pod.
... securityContext: sysctls: - name: net.ipv4.ip_local_port_range value: 1024 65535
initContainers: - command: - /bin/sh - '-c' - | sysctl -w net.core.somaxconn=65535 sysctl -w net.ipv4.ip_local_port_range="1024 65535" securityContext: privileged: true ...
It is not recommended to modify net.ipv4.ip_local_port_range for clusters later than version 1.22, because it will conflict with ServiceNodePortRange.
The default ServiceNodePortRange of Kubernetes is from 30000 to 32767. Kubernetes 1.22 and later versions remove the logic of kube-proxy monitoring NodePort. If there is a monitor, the application will avoid the monitored ports when selecting random ports. If the net.ipv4.ip_local_port_range and the ServiceNodePortRange share the same range, due to the removal of the logic of monitoring NodePort, the application may select the overlapping part when selecting random ports, such as the range from 30000 to 32767. When NodePort conflicts with the kernel parameter, net.ipv4.ip_local_port_range, it may lead to occasional TCP connection failure, health check failure, abnormal service access, and so on. For more information, see the Kubernetes Community PR .
When creating a large number of SVCs, one way to reduce the number of monitors is to submit ipvs/iptables rules. This optimization improves connection performance and addresses issues such as a high number of CLOSE_WAIT states occupying TCP connections in certain scenarios. It's worth noting that the PortOpener logic was removed after version 1.22.
Line 1304 in f98f27b
1304 proxier.openPort(lp, replacementPortsMap)
How exactly do these conflicts occur? The testing environment is Kubernetes 1.22.10, and the network type of kube-proxy is IPVS. Take the Kubelet health check as an example. The kernel parameter of the node, net.ipv4.ip_local_port_range, is adjusted to 1024~65535.
To analyze the issue, we deploy tcpdump to capture packets and stop capturing when an event with a health check failure is detected. Here's what we observe:
The kubelet uses the node IP address (192.168.66.27) and random port 32582 to initiate a TCP handshake with the pod IP address (192.168.66.65)+80. However, when the pod sends an SYN ACK to the kubelet during the TCP handshake, the destination port is 32582, but the pod keeps retransmitting. Because this random port happens to be the nodeport of a service, it is preferentially intercepted by IPVS and given to the backend service of the rule. However, this backend service (192.168.66.9) does not initiate a TCP connection with podIP (192.168.66.65), so the backend service (192.168.66.9) is directly discarded. Then, the kubelet cannot receive the SYN ACK response, and the TCP connection cannot be established, making the health check fail.
From this message, we can conclude that the kubelet initiates a TCP handshake but encounters repeated retransmission when the pod responds with a SYN ACK.
In fact, the handshake is sent to the backend pod of the port 32582 SVC, which is directly discarded.
To solve this problem, a judgment can be added to hostnework. When initContainers is used to modify kernel parameters, the parameter, net.ipv4.ip_local_port_range, will be modified only if the podIP and hostIP are not equal, so as to avoid modifying the kernel parameters of nodes due to misoperation.
initContainers: - command: - /bin/sh - '-c' - | if [ "$POD_IP" != "$HOST_IP" ]; then mount -o remount rw /proc/sys sysctl -w net.ipv4.ip_local_port_range="1024 65535" fi env: - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: HOST_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.hostIP securityContext: privileged: true ...
In Kubernetes, APIServer provides the ServiceNodePortRange parameter (command line parameter - service-node-port-range), which restricts the port range that Nodes use to monitor the NodePort of Services with NodePort or LoadBalancer type. By default, this parameter is set to 30000 to 32767. In an ACK Pro cluster, you can customize the control plane component parameters to modify the port range. For more information, see Customize the Parameters of Control Plane Components in ACK Pro Clusters .
 Busybox Command Help
 Kubernetes Community PR
 Customize the Parameters of Control Plane Components in ACK Pro Clusters
Alibaba Cloud Blockchain Service Team - September 6, 2018
Alibaba Clouder - March 6, 2018
Alibaba Clouder - June 24, 2019
Alibaba Clouder - December 18, 2017
Joyer - January 11, 2021
ApsaraDB - December 5, 2018
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resourcesLearn More
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.Learn More
A secure image hosting platform providing containerized image lifecycle managementLearn More
Take advantage of the cost effectiveness, scalability, and flexibility of Alibaba Cloud's infrastructure and services, as well as the proven reliability of Red Hat Enterprise Linux and Alibaba Cloud's support backed by Red Hat Global Support Services.Learn More
More Posts by Alibaba Cloud Native Community