×
Community Blog Windows Networking Troubleshooting 3: Network Bugs triggered

Windows Networking Troubleshooting 3: Network Bugs triggered

In this article, we will be troubleshooting network bugs caused by using Alibaba Cloud's CloudMonitor.

Cause

Recently, an interesting bug has emerged. Some users report that their servers have been connecting to Alibaba Cloud services, occupying all ports and causing network interruptions, service connection failures, and making the local Telnet port unreachable.

From the perspective of product design, this situation is generally not possible, but the Netstat results of the users do show that most of the connections are established with Alibaba Cloud servers.

1

The servers are all non-Alibaba Cloud instances, so it is not possible to log on the servers directly for troubleshooting. Therefore, a troubleshooting solution is prepared.

Troubleshooting + Analysis

1.  The bug causing the local Telnet port to be unreachable is generally caused by the exhaustion of dynamic ports. By default, the range of ports used by systems after Windows Server 2008 is 49152 - 65536, a total of 16384 ports. In general, they can not be used up. If this situation occurs, it is often accompanied by a port leak.

netsh int ipv4 show dyn tcp
Protocol tcp Dynamic Port Range
Start Port: 49152
Number of Ports: 16384

One way to verify this is to open more dynamic ports through this command: netsh int ipv4 set dynamicport tcp start=10000 num=50000.

The netsh command directly calls the TCPIP method for operating Windows Kernel through the NSI interface to modify the system configuration, so the command takes effect immediately.

2.  To confirm whether the port is leaking, it is generally necessary to capture the dump to analyze the cause of the port leak. However, we have actually dealt with a lot of these types of bug, most of which are caused by third-party drivers. Therefore, it is recommended to check whether any third-party software is installed first. If so, try uninstalling the software, and then see if the problem is solved.

In this case, the user reported that no third-party software was installed.

3.  Considering that most of the dynamic ports in the system are in the TIME_WAIT status, you can try to shorten the TIME_WAIT release time through the TcpTimedWaitDelay registry.

https://docs.microsoft.com/en-us/biztalk/technical-guides/settings-that-can-be-modified-to-improve-network-performance

Key: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Value: TcpTimedWaitDelay
Data Type: REG_DWORD
Range: 30-300 (decimal)
Default value: 0x78 (120 decimal)
Recommended value: 30
The registry configuration takes effect only after a restart.

4.  Users have always had doubts about the IP addresses of Alibaba Cloud servers accessed by their servers. However, most of them are accessed through port 443, so it is hard to prove exactly which application generated the connection by capturing packets. Therefore, the ETL Trace method provided by Windows is required to obtain the calling information of the application to Network APIs, such as TCPIP.

netsh trace start provider={2F07E2EE-15DB-40F1-90EF-9D7BA282188A} keywords=0xffffffffffffffff level=0xff provider={E53C6823-7BB8-44BB-90DC-3F86090D48A6} keywords=0xffffffffffffffff level=0xff provider={7D44233D-3055-4B9C-BA64-0D47CA40A232} keywords=0xffffffffffffffff level=0xff provider={50B3E73C-9370-461D-BB9F-26F32D68887D} keywords=0xffffffffffffffff level=0xff provider={43D1A55C-76D6-4F7E-995C-64C711E5CAFE} keywords=0xffffffffffffffff level=0xff maxSize=500MB fileMode=circular persistent=no overwrite=yes report=yes correlation=yes traceFile=c:NetworkTrace.etl capture=yes packettruncatebytes=128

From the collected logs, the Process ID corresponding to the request is 0x0600 = 1536.

 16767 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.978 [Microsoft-Windows-TCPIP]TCP: endpoint 0xFFFFFA80C4CF0CF0 (Family=IPV6 , PID=1536) created with status = STATUS_SUCCESS. 
 16768 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]socket: 1: Process 0xFFFFFA80C3FB0860 (0x600), Endpoint 0xFFFFFA80C259FE50, Family 0, Type 0, Protocol 0, Seq 1013, Status STATUS_SUCCESS  
 16769 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x1, Seq 11002, Status STATUS_SUCCESS 
 16770 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address ::, Seq 7010, Status STATUS_SUCCESS 
 16771 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 acquired port number 10010. 
 16772 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) bound. 
 16773 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::]:10010, Seq 7022, Status STATUS_SUCCESS 
 16774 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]connect: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::ffff:120.55.35.39]:443, Seq 5023, Status STATUS_SUCCESS 
 16775 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosedState  to SynSentState , SndNxt = 0. 
 16776 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 attempted to acquire weak reference on port number 10010 inherited from endpoint 0xFFFFFA80C4CF0CF0. Successful = TRUE . 
 16777 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) requested to connect. 
 16779 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 is going to output SYN with ISN = 2438595996, RcvWnd = 8192, RcvWndScale = 8. 
 16780 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect proceeding. 
 16789 [7]0000.0000::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from SynSentState  to EstablishedState , SndNxt = 2438595997. 
 16790 [7]0000.0000::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1. ThSeq = 2246241665. 
 16791 [7]0000.0000::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect completed. PID = 1536. 
 16793 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]connect: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 5024, Status STATUS_SUCCESS 
 16794 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x0, Seq 11002, Status STATUS_SUCCESS 
 16795 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.009 [Microsoft-Windows-TCPIP]TCP: Option TCP_OFFLOAD_NOT_PREFERRED  is going to be set for connection 0xFFFFFA80C3956B20. 
 16796 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3047, Status STATUS_SUCCESS 
 16797 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3056, Status STATUS_SUCCESS 
 16798 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 send posted posted 279 bytes at 2438595997. 
 16807 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3051, Status STATUS_SUCCESS 
 16815 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3024, Status STATUS_SUCCESS 
 16818 [7]3ECC. 2FC8::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2C56830, Length 1452, Seq 9000 
 16819 [7]3ECC. 2FC8::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246241665.  
 16820 [7]3ECC. 2FC8::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246241665. 
 16823 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4115, Status STATUS_SUCCESS 
 16824 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2D36F50, Length 1452, Seq 9000 
 16825 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4116, Status STATUS_SUCCESS 
 16826 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246243117. 
 16827 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246243117. 
 ...
 17644 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from EstablishedState  to FinWait1State , SndNxt = 2438604647. 
 17650 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2002, Status STATUS_SUCCESS 
 17651 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) closed. 
 17652 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 released port number 10010. WeakReference = FALSE . 
 17653 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2003, Status STATUS_SUCCESS 
 17654 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2000, Status STATUS_SUCCESS 
 17655 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2001, Status STATUS_SUCCESS 
 17664 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from FinWait1State  to ClosingState , SndNxt = 2438604648. 
 17665 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 delivering FIN. RcvNxt = 2246247426. 
 17666 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-Winsock-AFD]disconnect indicated: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 12001 
 17671 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosingState  to TimeWaitState , SndNxt = 2438604648. 
 17673 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) disconnect completed. 
 17675 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) close issued. 
 17676 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) shutdown initiated (The transport connection was aborted by the local system.). PID = 1536. 
 17677 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from TimeWaitState  to ClosedState , SndNxt = 2438604648. 
 17678 [2]0004.0058::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C3956B20 released port number 10010. WeakReference = TRUE . 

When PID is 1536, the corresponding Java.exe can be found through Process Explorer.

2

5.  In step 4, we have found that the connection request to Alibaba Cloud server 120.55.35.x is triggered by CloudMonitor.

Considering that the operating system without TcpTimedWaitDelay configured uses 2MSL (120s) TIME_WAIT timeout by default, if nearly 15000 ports of a system are in the TIME_WAIT status, theoretically, CloudMonitor needs to generate at least 125 connections per second. However, this type of behavior wasn't found in the CloudMonitor log or the previous ETL Trace log. The CloudMonitor program regularly establishes connections every few seconds to submit "metrics".

41931919752 INFO 2018-12-24 17:57:17.748 [ricGatherServiceHttp] metric submission completed, time consumed: 78 ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]
41931935181 INFO 2018-12-24 17:57:33.177 [ricGatherServiceHttp] metric submission completed, time consumed: 47 ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]
41931950625 INFO 2018-12-24 17:57:48.621 [ricGatherServiceHttp] metric submission completed, time consumed: 47 ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

6.  The bug is quite strange. After all other possibilities are excluded, the only possibility left is that the windows system does not release the TIME_WAIT port after 2MSL, and subsequent troubleshooting confirms this. Unfortunately, we could not find the specific cause from the ETL Trace log, but the user reminds us with an unintentional sentence,

"This server has been running for a long time without any bugs."

How long has it been running specifically? Through the Eventlog 6013, we confirmed that the server had been running for 43225197s, nearly 500 days. This caught our attention.

All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days from system startup in Windows Vista, in Windows 7, in Windows Server 2008 and in Windows Server 2008 R2

https://support.microsoft.com/en-us/help/2553549/all-the-tcp-ip-ports-that-are-in-a-time-wait-status-are-not-closed-aft

So far, the bug has been thoroughly analyzed and solved.

Cause Analysis

Let me briefly explain the cause of this bug. In a Windows system, the time to release TCP endpoints in the TIME_WAIT status is determined by the current system running time plus the TIMEOUT. In a system without patches installed, the maximum value that can be stored in the 32-bit register is 0xFFFFFFFF, and the value stored in the register is "time * 0n100", that is to say, bugs may occur after the system runs for more than "0XFFFFFFFF/0n100" seconds.

0xFFFFFFFF/(0n86400 * 0n100)= 497.1026961805...
0 0 0
Share on

Tim Chen

8 posts | 1 followers

You may also like

Comments