Remote connection failure or "Too many open files" error after adjusting the nofile parameter on a Linux instance -

Problem description

Remote connection failure:
- When you connect to the Elastic Computing Service (ECS) instance using Secure Shell (SSH) or Workbench, the connection is refused or times out.
- When you log on via Virtual Network Computing (VNC), a "System error" message appears after you enter the correct username and password, and the logon fails.
Application errors:
Application logs or command-line output show a "Too many open files" error.

Cause

This issue is caused by an overly restrictive nofile resource limit. The nofile parameter in the /etc/security/limits.conf file defines the maximum number of files a process can open. If this value is too low, any process that needs to open more files than the limit allows will fail, which can prevent you from logging on.

Solution

If you can still log on to the instance, you can modify the configuration file directly. If you cannot log on, you must attach the system disk to another instance to repair it.

If you can log on to the instance

Log on to the ECS instance as the root user.
1. Go to the ECS console - Instances page. In the upper-left corner of the page, select the region and resource group of the target instance.
2. Go to the details page of the target instance, click Connect, and then select Workbench. Follow the on-screen prompts to log on as the root user and open a terminal.
Modify the configuration file.
Edit the /etc/security/limits.conf file. Change the hard nofile and soft nofile parameter values to the default of 65535, then save the file and exit.
```
* soft nofile 65535
* hard nofile 65535
root soft nofile 65535
root hard nofile 65535
```
- * applies to all standard users, and root applies to the root user.
- hard nofile: Sets the hard limit on the number of open files. This value cannot exceed the limit set by the kernel parameter /proc/sys/fs/nr_open. Exceeding this limit may prevent you from logging on to the instance.
- soft nofile: This is the default limit applied when a user starts a new session. This value must not exceed the hard nofile limit. If it does, the configuration will be ignored.
  If the soft nofile value is set higher than the hard nofile value, the hard nofile value is used as the effective limit.
Apply the new configuration.
1. Log out and log back on to the ECS instance with the target user account to apply the changes.
2. Run sudo ulimit -n. An output of 65535 confirms that the nofile limit was updated.
Restart the relevant applications and confirm that they function correctly.

If you cannot log on to the instance

Important

If a historical Snapshot of the system disk exists, first create a new snapshot to back up the current data. Then, roll back the system disk by using a historical snapshot and verify that the instance is restored.

If no historical Snapshot is available, you need a healthy Linux instance in the same Zone as the faulty instance. You will attach the faulty system disk to the healthy instance as a data disk to modify the nofile parameter.

Detach the system disk.
Make sure the faulty instance is in the Stopped state, then follow these steps:
1. To prevent data loss from accidental operations, we recommend that you create a snapshot of the system disk to back up the current data.
2. Go to the ECS console - Instances page. In the upper-left corner of the page, select the region and resource group of the target instance.
3. Click the ID of the faulty instance to go to the Instance Details page, and then click the Block Storage tab.
4. In the System Disk section, find the Actions column and choose > Detach.
5. In the Detach Cloud Disk dialog box, confirm the information and click OK. The disk is successfully detached when the instance status changes to No System Disk.

Attach the disk as a data disk to a healthy instance.

Make sure the healthy instance is in the Running state, then follow these steps:

Attach the faulty system disk to the healthy instance.
1. Click the ID of the healthy instance to go to its details page.
2. Click the Block Storage tab, and then click Attach Cloud Disk.
3. On the Attach to Instance page, select the detached system disk in the Disk section, and click Next.
4. On the Partition Disk and Create and Mount File Systems page, select Configure Later to complete the attachment.
Click Connect and select Workbench. Follow the on-screen prompts to log on as the root user and open a terminal.

Mount the file system.

Identify the partition name of the faulty disk.

lsblk -f

vda                                                      
├─vda1                                                   
├─vda2 vfat         7938-FA03                            /boot/efi
└─vda3 ext4   root  33b46ac5-7482-4aa5-8de0-60ab4c3a4c78 /
vdb                                                      
├─vdb1                                                   
├─vdb2 vfat         7938-FA03                            
└─vdb3 ext4   root  33b46ac5-7482-4aa5-8de0-60ab4c3a4c78

In this example, the faulty disk is vdb, and its root partition is vdb3. This is the partition you need to mount. The partitions are described as follows:

vdb1/vdb2: Contain system boot files and can be ignored.
vdb3: Contains the operating system files and data. This partition must be mounted.

Create a mount point and mount the partition.

mkdir <mount_directory> && sudo mount /dev/<partition_name> <mount_directory>

Parameter	Description
`<partition_name>`	Replace this with the root partition name of the faulty disk that you identified in the previous step.
`<mount_directory>`	A custom mount directory. It must be an empty path starting with `/`. You can customize the name, but it must be unique. Important In a non-empty directory, the original files are hidden and cannot be read. Proceed with caution.

For example, to mount the target partition vdb3 to a new directory named /test, run mkdir /test && sudo mount /dev/vdb3 /test.

Check that the file system is mounted.
Run the lsblk command. If the target partition has a mount directory listed (MOUNTPOINT), the file system was mounted successfully.

Modify the configuration file.
Open the <mount_directory>/etc/security/limits.conf file for editing. Change the hard nofile and soft nofile parameter values to the default of 65535, then save the file and exit.
```
* soft nofile 65535
* hard nofile 65535
root soft nofile 65535
root hard nofile 65535
```
- * applies to all standard users, and root applies to the root user.
- hard nofile: This is the hard limit on the number of open files. This value cannot exceed the limit set by the kernel parameter /proc/sys/fs/nr_open. If it does, you may be unable to log on to the instance after it is restarted.
- soft nofile: This is the current limit on the number of open files. This value must not exceed the hard nofile limit. If it does, the configuration will be ignored.
  If the soft nofile value is set higher than the hard nofile value, the hard nofile value is used as the effective limit.
Restore the system disk to the original instance.
1. Unmount the file system.
  Replace <mount_directory> with the actual mount path.
```
umount <mount_directory>
```
  For our example, you would run umount /test.
2. Detach the repaired system disk.
  1. Return to the ECS console and go to the Block Storage tab on the healthy instance's details page.
  2. In the Actions column for the repaired system disk, click Detach.
  3. In the Detach Cloud Disk dialog box, click OK.
3. Attach the repaired system disk back to the original instance.
  1. Go to the Block Storage tab of the faulty instance's details page and click Attach Cloud Disk.
  2. On the Attach to Instance page, select the repaired system disk in the Disk section, configure the Logon Credentials, and click Next.
  3. On the Partition Disk and Create and Mount File Systems page, select Configure Later to complete the attachment.
4. Start the ECS instance.
Log on to the original ECS instance and verify that the issue is resolved.

Recommendations

Be cautious when modifying core system files: Before modifying any core system files, always create a snapshot to back up your data. Confirm that the changes are necessary and understand the potential risks. Do not modify system parameters that you are not familiar with.
Monitoring and alerting: To maintain the stability and security of your critical instances, set up a mechanism to monitor the ulimit -n configuration on all core instances. By regularly checking the runtime value of ulimit -n against its expected configuration, you can ensure that system resource limits meet your standards and receive timely alerts about any unauthorized changes.