Recently, we have been experiencing the problem with customer's Windows Server 2008 R2 servers being unable to connect to the network. After some research, we finally found the potential bug in the Windows DHCP Client service. Here is the debugging process.
Note: The Windows product group on November 9, 2018 has fixed this issue in Windows 1903 (19H1).
The Windows Server 2008 R2 server cannot not be connected to the network. After troubleshooting, it was found that the IP of the NIC changed to an APIPA address, 169.254.x.x. You can use ipconfig/renew to solve the problem quickly.
The problem was with the Windows computers that obtained the address through the DHCP service, so the first thing that came to our mind was that the IP address was released. For the releasing of the address, the following are several situations:
1.The user runs ipconfig/release to release the address. If the DHCP Client log is enabled, we can see the following events:
Unfortunately, the operating system does not enable the Microsoft-Windows-DHCP Client Events/Operational log by default.
2.The lease time of the DHCP IP address expires, and the system actively releases the IP address. We can see the following events from the Microsoft-Windows-DHCP Client Events/Admin:
Note: In the DHCP protocol specification, we have many opportunities to renew DHCP lease info to extend the lease term.
Since these two event logs are not enabled in the system, we confirmed that the disappearance of the IP addresses of almost all our computers was caused by the one year DHCP lease expiration, by comparing the current LeaseTime of DHCP with the running time of the system in a roundabout way. Using the event log Eventlog 6013, we can easily find the system running time. As shown in the figure below, the host has run for 3152517 seconds, which is more than 364 days.
The lease period for the IP address of our DHCP service is usually one year, so we believe that it was indeed the expiration of the computer IP address lease period that caused the address to be released. At the same time, by using the DHCP Server service log, we can basically confirm that the DHCP Server did not receive the DHCP Request message from this host to renew the address during the year.
The problem is related to the operating system behavior, while the Windows operating system is a relatively closed environment, and the log is not enabled by default, so we cannot fully confirm whether the DHCP Request was sent at that time.
Fortunately, the log on the DHCP Server clearly records that the DHCP Client and the DHCP Server have been interacting with each other for DHCP Information requests, so we determine that the DHCP Client did not send the DHCP Request request at that time. To verify this idea, we have collected the DHCP Client service dump for analysis.
First, we check whether the DHCP Client service is abnormal from the dump. Through reviewing call stacks, we find that the DHCP Client service is mainly waiting for the message to process DHCP requests. According to experience, we only need to pay attention to whether any long call stacks exist.
0:005> kL Child-SP RetAddr Call Site 00000000`012af518 000007fe`fcf31430 ntdll!NtWaitForMultipleObjects+0xa 00000000`012af520 00000000`76f016e3 KERNELBASE!WaitForMultipleObjectsEx+0xe8 00000000`012af620 000007fe`fab5a849 kernel32!WaitForMultipleObjectsExImplementation+0xb3 00000000`012af6b0 000007fe`fab61bc0 dhcpcore!ProcessDhcpRequestForever+0x2d5 00000000`012af7a0 00000000`ffc11318 dhcpcore!ServiceMain+0x398 00000000`012af840 000007fe`fe06a831 svchost!ServiceStarter+0x1eb 00000000`012af8c0 00000000`76ef59cd sechost!ScSvcctrlThreadW+0x25 00000000`012af8f0 00000000`7712a561 kernel32!BaseThreadInitThunk+0xd 00000000`012af920 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
The DHCP main service calls ProcessDhcpRequestForever (this name indicates that it is a function that mainly processes DHCP messages). Therefore, we can obtain some details by disassembling ProcessDhcpRequestForever. For example, we find the symbol information dhcpcore! DhcpGlobalNICList:
Then, we find the corresponding Lease information, and confirm whether the data structure is accurate:
We can clearly see that all the time computing is correct, but that a RunTime time in the structure is the same as T1 time. The RunTime records the time when the DHCP Client needs to operate on this IP Lease next time. In fact, it is similar to the time of renewing the IP, but it can tell us whether T1 has renewed the address. If the renewal is successful, the Runtime becomes the next T1, while if the renewal fails, the Runtime should be T2 or even the Expire time. But in this dump, we found that this value did not change. This shows that the DHCP Client did not perform the renewal operation for some reason at that time. At this stage, we can actually ask Microsoft to check whether this is the problem with the DHCP Client.
And, by disassembling ProcessDhcpRequestForever, we also confirm that the DHCP Client service is waiting for the event to process:
For those who are familiar with Windows programming, it is easier to understand here. In _imp_WaitForMultipleObjectsEx, the above Event information is passed into the system as a parameter. Only when the above event occurs will the system return the corresponding signal information. Then, the DHCP Client will process the response event (either in the current thread or create a new thread) according to the event, and continue to wait for the message after completion.
WaitForMultipleObjectsEx function https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitformultipleobjectsex
We can now notice that one of the returns isWAIT_TIMEOUT = 0x00000102L:
dhcpcore!ProcessDhcpRequestForever+0x2fc: 000007fe`fab635db 3d02010000 cmp eax,102h 000007fe`fab635e0 0f858a500000 jne dhcpcore!ProcessDhcpRequestForever+0x307 (000007fe`fab68670)
The ProcessDhcpRequestForever code contains the processing for 0x00000102L. Those who are interested can follow it further.
As for this problem, according to the DHCP RFC standard, we believe that the DHCP Client service needs an Event that is triggered at T1 or T2, but this Event is not seen in the parameters previously passed in, so we think that the processing process is in the logic of WAIT_TIMEOUT processing. Therefore, it is necessary to set a correct T1 or T2 waiting time for WaitForMultipleObjectsEx. But in fact, when we review the problems of WaitForMultipleObjectsEx, we find that the dwMilliseconds parameter passed to WaitForMultipleObjectsEx is 0xFFFFFFFF = INFINITE.
Therefore, we suspect that the following processing logic has an overflow problem:
Let's do some simple arithmetic: The Lease Duration for a year is 365 24 60 * 60 = 0n31536000 = 0x01e13380. So, T1 = 0n31536000 / 2 = 0x01e13380 / 2 = 0xF099C0 . Therefore, according to the logic in the ASM code, we find that an overflow risk exists in the 32-bit register edi, causing edi to be reset to 0xFFFFFFFF, which may also be the reason for INFINITE waiting.
Tim Chen - May 22, 2019
Tim Chen - May 22, 2019
Alibaba Clouder - August 24, 2020
Tim Chen - May 22, 2019
Alibaba Clouder - August 20, 2020
Tim Chen - May 22, 2019
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
Apsara Stack Agility Elastic Compute Service (Alibaba Cloud ZStack) is a light-weight hybrid cloud solution.Learn More
Alibaba Cloud offers an accelerated global networking solution that makes distance learning just the same as in-class teaching.Learn More
Connect your business globally with our stable network anytime anywhere.Learn More
More Posts by Tim Chen