Problem description
After you continuously use an Elastic Compute Service (ECS) instance for an extended period of time without restarting the instance, issues occur, such as instance disconnection, network paralysis, and being unable to ping the public and private IP addresses of the instance.
Cause
The first time you start an ECS instance, the system uses Dynamic Host Configuration Protocol (DHCP) to automatically assign IP addresses to elastic network interfaces (ENIs) and obtains the expiration time of the IP address lease. The dhclient process of the Linux operating system and the DHCP Client service of the Windows operating system periodically update the lease expiration time to the DHCP server to ensure the availability of the IP addresses on the instance. However, instances created from specific CentOS 7 images may clear the dhclient process, and the DHCP Client service of the Windows Server operating system has known issues. This causes the ECS instance to be unable to automatically update the renewal expiration time for its IP addresses. When the lease of the IP addresses expires after the first renewal, the private IP addresses of the instance are released, causing the instance to lose network connectivity.
Affected operating systems
The preceding issue may occur on ECS instances that meet the following requirements and automatically assign IP addresses to ENIs by using DHCP. You can resolve the issue as described in this topic. If your ECS instance is configured with a static IP address, you do not need to handle the issue.
ECS instances created from the following CentOS 7 public images before May 31, 2018 and not restarted after November 15, 2018:
centos_7_04_64_20G_alibase_20180419.vhd
centos_7_04_64_20G_alibase_20180326.vhd
centos_7_04_64_20G_alibase_201701015.vhd
centos_7_03_64_20G_alibase_20170818.vhd
centos_7_02_64_20G_alibase_20170818.vhd
centos_7_03_64_40G_alibase_20170710.vhd
centos_7_03_64_40G_alibase_20170625.vhd
centos_7_03_64_40G_alibase_20170523.vhd
centos_7_03_64_40G_alibase_20170503.vhd
ECS instances created from the following Windows Server operating systems before November 15, 2018 and not restarted afterward:
Windows Server 2008 R2
Windows Server 2012 R2
Windows Server 2016
Windows Server Version 1709
Solutions
Take note of the following items:
Before you perform high-risk operations such as modifying the specifications or data of an Alibaba Cloud instance, we recommend that you check the disaster recovery and fault tolerance capabilities of the instance to ensure data security.
Before you modify the configurations or data of an Alibaba Cloud instance, such as an ECS instance or an ApsaraDB RDS instance, we recommend that you create snapshots or enable backups for the instance. For example, you can enable log backups for an ApsaraDB RDS instance.
If you granted specific users the permissions on sensitive information, such as usernames and passwords, or submitted sensitive information in the Alibaba Cloud Management Console, we recommend that you modify the sensitive information at the earliest opportunity.
You can choose one of the following solutions based on your business scenario:
Method 1: Cloud Assistant-based batch repair: This is an easier method suitable for scenarios where operations are concurrently performed on multiple instances in the ECS console.
Method 2: Python SDK script-based batch repair: A Python SDK script is written based on the Cloud Assistant API. Use the region as the repair unit to batch check the status of your instances and complete automatic repair. This method is suitable for users who are familiar with scripted O&M.
Method 3: Shell or PowerShell script-based repair: A shell or PowerShell script is compiled. You must log on to an ECS instance to manually resolve the issue. This method is suitable for polling or testing in a small number of instances. The script content is the same as that in Method 1.
Method 4: Troubleshoot ENIs one by one: This method is suitable for scenarios with a small number of instances.
Method 1: Cloud Assistant-based batch repair
In this example, Cloud Assistant is used to check and automatically repair affected ECS instances. Make sure that Cloud Assistant Agent is installed on the ECS instances you want to repair. ECS instances created after December 01, 2017 are pre-installed with Cloud Assistant Agent by default. For more information, see Install Cloud Assistant Agent.
Perform the following steps:
Download the following shell or PowerShell script and paste it into the command content of the Cloud Assistant command:
CentOS instances: linux_fix_dhclient.sh
Windows instances: win_fix_dhclient.ps1
Select ECS instances and run the Cloud Assistant command on the instances. For more information, see Run a command.
Confirm that the execution is successful. For more information, see Check execution results and troubleshoot common issues. The following figure shows the command execution results returned for CentOS and Windows instances.
Method 2: Python SDK script-based batch repair
In this example, a Python script is written based on the Cloud Assistant API, which can check and automatically repair all affected instances in an Alibaba Cloud region. For more information about how to install the ECS SDK, see Alibaba Cloud GitHub repository installation documentation.
Before you begin
Run the following commands to download the relevant Python SDK dependencies to your on-premises computer or ECS instances:
pip install aliyun-python-sdk-core
pip install aliyun-python-sdk-ecs
Procedure
Download the autofix_dhclient.py file to the ECS instances you want to repair.
Use the Cloud Assistant API to run the downloaded script with the following command:
sudo python autofix_dhclient.py <AccessKeyID> <AccessKeySecret> <region-id>
NoteReplace
<AccessKeyID>
,<AccessKeySecret>
, and<region-id>
with the actual values.AccessKeyID: the AccessKey ID of your Alibaba Cloud account or Resource Access Management (RAM) user.
AccessKeySecret: the AccessKey secret of your Alibaba Cloud account or RAM user.
region-id: the region ID of the ECS instances.
Result
The following figure shows a sample script running result.
The following section describes the status check of an ECS instance:
Cloud Assistant: This check item checks whether the instance is installed with Cloud Assistant Agent.
Installed: indicates that Cloud Assistant Agent is installed on the instance.
Not Installed: indicates that Cloud Assistant Agent is not installed on the instance. In this case, install Cloud Assistant Agent on the instance.
NeedFix: This check item checks whether the instance needs to repair the dhclient process or the DHCP Client service.
Yes: indicates that a repair is required. The script automatically completes the subsequent operations.
No: indicates that no repair is required.
Unknown: indicates that the script cannot determine whether a repair is required. You must manually perform operations.
FixResult: This check item reports the script repair result.
Success: indicates that the dhclient process or DHCP Client service is repaired.
Failed: indicates that the repair failed.
NoChange: indicates that no repair is required.
Unknown: indicates that the script cannot determine whether a repair is required. You must manually perform operations.
Method 3: Shell or PowerShell script-based repair
This method requires you to log on to the affected instances and troubleshoot the issue one by one. This method is suitable for scenarios with a small number of instances.
Procedure for CentOS instances
Download the linux_fix_dhclient.sh script to any directory.
Switch to the working directory where the script is stored and run the script as the root user.
sudo bash linux_fix_dhclient.sh
NoteA return value of 0 indicates that the script completed the check and repair operations.
Other return values indicate that the repair failed.
Procedure for Windows instances
Download the win_fix_dhclient.ps1 script to any directory.
Open PowerShell as an administrator and run the following command:
powershell -executionpolicy bypass -file C:\win_fix_dhclient.ps1
NoteReplace
C:\win_fix_dhclient.ps1
with the actual script file path.If "No ip will expire in recent 500 days. Then no need fix." is returned, the DHCP Client service of the instance has no exception and does not require a repair.
If "Found one ip will expire in 500 days. We need fixing it!!! Fix it now... Fix success." is returned, the DHCP Client service of the instance is abnormal and the script has completed the repair operation.
Other return values indicate that the repair failed.
Method 4: Troubleshoot ENIs one by one
This method requires you to check and fix the dhclient process (CentOS instances) or IP address lease expiration time (Windows instances) of each ENI.
Procedure for CentOS instances
Run the following command to check all ENIs of the instance:
ls -al /sys/class/net/
Run the following command to check whether the eth0 ENI uses DHCP to assign an IP address:
cat /etc/sysconfig/network-scripts/ifcfg-eth0
In the following command output,
BOOTPROTO=dhcp
indicates that the ENI uses DHCP to assign an IP address. If the ENI does not use DHCP to assign an IP address, go to Step 7.Run the following command to check the status of the dhclient process of the eth0 ENI:
ps aux | grep dhclient | grep eth0
An empty command output indicates that the dhclient process is abnormal.
The following command output indicates that the dhclient process is running as expected. In this case, go to Step 7.
root 15340 0.0 0.3 113372 12788 ? Ss 14:16 0:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid -H izuf****************** eth0
Run the following command to restart the dhclient process:
ifup eth0
NoteIn this example, the eth0 ENI is used. Replace eth0 with the actual ENI name.
Check the status of the dhclient process of the ENI again.
Repeat steps 3 to 6 to check and fix the status of the dhclient process of all other ENIs.
Procedure for Windows instances
Open the Command Prompt window as an administrator.
Run the following command to check whether the DHCP enabled value of each ENI described as Red Hat VirtIO Ethernet Adaptor is Yes and the time when its lease expires:
ipconfig /all
NoteThe primary and secondary ENIs of an ECS instance are described as Red Hat VirtIO Ethernet Adaptor. The issue in this topic does not affect custom VPN or loopback network interface controllers (NICs) or NICs not enabled with DHCP.
If the lease expires within one year, run the following command to update the lease expiration time:
ipconfig /renew
Run the
ipconfig /all
command to confirm that the returned lease expires within 10 years, indicating that the repair is completed.
Applicable scope
ECS