All Products
Search
Document Center

Troubleshoot IP address faults in CentOS 7 instances and Windows instances

Last Updated: Dec 15, 2020

Problem description

If you have been using an ECS instance continuously for more than a period of time and the ECS instance has not been restarted during this period of time, the instance suddenly gets disconnected and fails to Ping the public and private IP addresses.

 

Possible cause

Initial start-up of the ECS instance, the system uses the DHCP (Dynamic Host Configuration Protocol, Dynamic Host Configuration Protocol) basis for Elastic Network Interface automatically assigns an IP address and get the IP address lease expiration time. Generally, the dhclient process of the Linux and the DHCP Client service of the Windows system regularly update the lease expiration time to the DHCP server to ensure the availability of the instance IP address.

 

The dhclient process is removed from instances created based on CentOS 7 images (see applicable scope). Known issues exist in the DHCP Client service of the Windows Server operating system. As a result, your instances cannot automatically renew the lease expiration time of the IP addresses. When the IP address expires, the private IP address of the instance is released, resulting in the network disconnection.

 

Scenarios

If the ECS instance meets the following conditions and uses DHCP to automatically assign IP addresses to Elastic Network Interface, you must solve the problem described in this topic. For statically configured IP addresses, no further documentation is required.

  • Instances of any type created based on the following CentOS 7 public images (ECS instances created before May 31, 2018 and not restarted after November 15, 2018).
    • centos_7_04_64_20G_alibase_20180419.vhd
    • centos_7_04_64_20G_alibase_20180326.vhd
    • centos_7_04_64_20G_alibase_201701015.vhd
    • centos_7_03_64_20G_alibase_20170818.vhd
    • centos_7_02_64_20G_alibase_20170818.vhd
    • centos_7_03_64_40G_alibase_20170710.vhd
    • centos_7_03_64_40G_alibase_20170625.vhd
    • centos_7_03_64_40G_alibase_20170523.vhd
    • centos_7_03_64_40G_alibase_20170503.vhd
  • Instances running the following Windows Server operating systems (ECS instances that were created before November 15, 2018 and have not been restarted since).
    • Windows Server 2008 R2
    • Windows Server 2012 R2
    • Windows Server 2016
    • Windows Server Version 1709

 

Solution

Alibaba Cloud reminds you that:

  • Before you perform operations that may cause risks, such as modifying instance configurations or data, we recommend that you check the disaster recovery and fault tolerance capabilities of the instances to ensure data security.
  • If you modify the configurations and data of instances including but not limited to ECS and RDS instances, we recommend that you create snapshots or enable RDS log backup.
  • If you have authorized or submitted security information such as the logon account and password in the Alibaba Cloud Management console, we recommend that you modify such information in a timely manner.

This topic describes how to delete a data record by using one of the four methods.

  • Method 1: use Cloud Assistant to repair multiple objects at a time This method is applicable to scenarios where there is multiple instances. You can complete the operations in the ECS console with ease.
  • Method 2: write a Python SDK script based on the Cloud Assistant API. You can check the status of your instances and fix these instances automatically by region. This option is suitable for users who are familiar with scripted O&M.
  • Method 3: supports Shell and PowerShell scripts. You need to log on to the ECS instance to fix the issue. This method applies to polling and testing on a small number of instances. The script content is the same as solution 1.
  • Method 4: a case-by-case basis to troubleshoot the network interface controller for a small amount of an example scene.

 

Method 1: repair multiple Cloud Assistant at a time

In this example, the Cloud Assistant checks and automatically fixes the ECS instances. Make sure that the Cloud Assistant client has been installed on the instances. ECS instances created after December 01, 2017 are by default pre-installed with the Cloud Assistant client. For more information, see Cloud Assistant client.

  1. Log on to the ECS console Cloud Assistant page.
  2. Select a region.
  3. Click Create. For more information, see create Cloud Assistant commands.
  4. Download the following Shell or PowerShell scripts and paste them into the command content of the Cloud Assistant.
  5. Locate the command that you have created. In the actions column, click run to run multiple Cloud Assistant commands at a time on all affected instances. For more information about the procedure, see run command.
  6. When the command execution complete, in the action column, click view result. For more information about the operation, see query execution results and status

 

Method 2: perform batch repair of Python SDK scripts

This example writes Python scripts based on Cloud Assistant APIs and automatically fixes all the affected instances in a region. For more information about how to install ECS SDK, see Alibaba Cloud Github repository installation documentation.

 

Preparations

Run the following command to download the Python SDK dependencies to your local computer or ECS instance.

pip install aliyun-python-sdk-core
pip install aliyun-python-sdk-ecs

 

Procedure

  1. Download autofix_dhclient.py to your local computer or ECS instance in the prepared state.
  2. Run the following command to check the script instructions.
    Note: This is an optional step.
    python autofix_dhclient.py
    The following command output is returned.
    Usage: autofix_dhclient.py <AccessKeyID> <AccessKeySecret> <region-id>
    Note: The parameters are described as follows.
    • AccessKeyID: your AccessKey id. For more information, see create AccessKey.
    • AccessKeySecret: your AccessKeySecret.
    • region-id: The region ID of the instance. For more information, see region and zone.
  3. Specify parameters such as AccessKeyID, AccessKeySecret, and region-id as required, and run the script as root or an administrator, for example,.
    python autofix_dhclient.py LTAIn*******Py6J kXXIOEoPXXvsYRUd**********TRyU cn-hangzhou

 

Execution result

The following figure shows the script results.

PyhtonSDKResult

The following table describes the parameters for checking the instance status.

  • Cloud Assistant: checks whether a Cloud Assistant client is installed on your instance.
    • Installed: the Cloud Assistant client has been Installed on the instance.
    • Not Installed: the Cloud Assistant client is missing. You can install Cloud Assistant client and resume the work.
  • NeedFix: checks whether the dhclient process or the DHCP Client service needs to be repaired on the instance.
    • Yes: the vulnerability fix needs to be fixed. The script automatically fixes the cause of the issue.
    • No: the vulnerability does not need to be fixed.
    • Unknown: The script cannot be run, and you must manually run the script.
  • FixResult: the Checker reports the fixing results.
    • Success: The dhclient process or the DHCP Client service is recovered.
    • Failed: indicates that the vulnerability fix has Failed.
    • NoChange: indicates that no recovery is required.
    • Unknown: The script cannot be run, and you must manually run the script.

 

Method 3: fix the vulnerability by using Shell or PowerShell scripts

This operation is applicable to the scenario with a small number of affected instances.

 

For CentOS instances

  1. Connect to the ECS instance. For more information, see connection navigation.
  2. Copy the linux_fix_dhclient.sh script to a directory.
  3. Switch to the working directory where the script is located and run the script as root.
    sudo bash linux_fix_dhclient.sh
    Note:
    • "When the result is" "0" ", the script has finished checking and repairing."
    • If other statuses are returned, the troubleshooting has failed.

 

Windows instances

  1. Connect to the ECS instance. For more information, see connection navigation.
  2. Get the repair win_fix_dhclient.ps1 script to any directory.
  3. Open PowerShell as an administrator and run the following command:
    powershell -executionpolicy bypass -file C:\win_fix_dhclient.ps1
    Note:
    • You need to replace C:\win_fix_dhclient.ps1 with the actual file path.
    • When "No ip will expire in recent 500 days. Then no need fix." Yes, it indicates that the DHCP Client service of the instance has no exception and does not need to be repaired.
    • When "Found one ip will expire in 500 days. We need fixing it!!! Fix it now... Fix success." The DHCP Client service for the instance is abnormal and the script has completed the troubleshooting.
    • If other statuses are returned, the troubleshooting has failed.

 

Method 4: Check network interface controller

The dhclient process (for CentOS instances) or the lease expiration time of the IP address (for Windows instances) corresponding to each network interface controller must be checked automatically.

 

For CentOS instances

  1. A remote connection is established to the ECS instance.
  2. Run the ls-al/sys/class/net/ command to check all network interface controller of the instance.
  3. Run the following command to check whether the eth0 network interface controller uses DHCP to assign an IP address.
    • BOOTPROTO=dhcp indicates that network interface controller uses DHCP to assign IP addresses.
      etho
    • If DHCP is not used to assign IP addresses, you can skip the remaining description in this article and go to step 7.
  4. Run psaux|grepdhclient|grepeth0 command to check eth0 network interface controller corresponding to the dhclient process runs.
    • An empty result indicates that the dhclient process is abnormal.
    • If the following result is returned, the dhclient process is normal. You can skip step 7 and skip this step.
      root 15340 0.0 0.3 113372 12788 ? Ss 14:16 0:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid -H izuf695ygwh32u2i******z eth0
  5. Run the iputh0 command to restart dhclient.
    Note: replace eth0 in the command with the actual network interface controller identifier.
  6. Check again the running status of the dhclient process corresponding to the network interface controller.
  7. Repeat steps 3 to 6 to check and fix the operating status of the dhclient process for all the network interface controller.

 

Windows instances

  1. A remote connection is established to the ECS instance.
  2. Open command prompt CMD as an administrator.
  3. Run the following commands to view the directories of the described as Red Hat VirtIO Ethernet Adaptor of network interface controller of DHCP enabled item whether is and lease out-of-date.
    ipconfig /all
    The following command output is returned.
    WindowsEachNIC
    Note: Red Hat VirtIO Ethernet Adaptor to check the ECS instance's main network interface controller and the auxiliary Elastic Network Interface network interface controller, your custom configuration of the VPN or LoopBack network interface controller and the like is not affected range In addition, network interface controller does not have DHCP service enabled is not in the affected range.
  4. If the lease is expired within a year, run the following command to update the lease expiration time.
    ipconfig /renew
  5. The lease expiration time returned by running the ipconfig/all command is updated to ten years, indicating that the repair has been completed.

 

Application scope

  • ECS