This topic describes how to troubleshoot remote logon failures to a Linux instance.
Emergency logon to a Linux instance: If you have an emergency and need to log on to a Linux instance for O&M as soon as possible, you can use a VNC connection. For more information, see Connect to an instance using VNC.
Causes
SSH remote logon failures can be caused by factors such as the Pluggable Authentication Modules (PAM) security framework, security groups, and SSH configurations. Troubleshoot and resolve the connection failure based on your specific situation.
No specific error message is returned
Use the self-service troubleshooting tool
The Alibaba Cloud self-service troubleshooting tool helps you quickly check security group configurations, the internal firewall of the instance, and the listener status of common application ports. The tool provides a detailed diagnostic report.
Click to go to self-service troubleshooting page, and switch to the target region.
If the self-service troubleshooting tool cannot identify the issue, proceed with the following steps to manually troubleshoot the issue.
Manually troubleshoot the issue
If no error message is returned when the remote connection fails, follow these steps to manually troubleshoot the issue:
Step 1: Use Workbench to test the remote logon
You can use the Workbench tool provided by Alibaba Cloud to remotely log on. If a remote logon error occurs, Workbench returns a specific error message and a solution. The test steps are as follows:
Go to ECS console - Instances.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
Click the ID of the target instance. On the instance details page, click Connect.
In the Remote connection dialog box, find Workbench and click Sign in now.
Test the remote logon.
Workbench automatically fills in the basic information required to log on to the target instance. Confirm that the information is correct and enter your username and authentication information. Proceed as follows based on the result. For more information about how to use Workbench to remotely log on to a Linux instance, see Remotely log on to a Linux instance using Workbench.
If the logon still fails, Workbench returns an error message and a solution. Follow the prompts to resolve the issue. After you resolve the issue, use Workbench to test the remote logon again. For information about common errors that can occur when you use Workbench, see Issues with VNC connections to an instance.
If you can log on using Workbench, the SSH service on the target instance is running as expected. This rules out the possibility of an SSH server-side error. You can proceed to Step 2: Check the network.
Step 2: Check the network
If you cannot remotely connect to the Linux instance, first check whether the network is working correctly.
From a computer in a different network environment, such as a different network segment or a different carrier's network, run a test to determine whether the issue is caused by your on-premises network or the server.
If the issue is with your on-premises network or carrier, contact your local IT staff or the carrier to resolve it.
If the network interface card driver is not working correctly, reinstall it.
On your local client, use the ping command to test network connectivity to the instance.
If the network is abnormal, capture packets for analysis. For more information, see Use a packet capture tool to capture network packets.
If packet loss occurs or the ping fails, use a tool such as
tracertormtrto run a link test and identify the root cause. For more information, see Use MTR for network link analysis.If the system kernel does not prohibit ping requests but the ping command fails to connect to the ECS instance, the internal firewall of the instance's operating system may have a policy that drops packets from the client.
For more information, see What do I do if I cannot ping the public IP address of an ECS instance?.
Step 3: Check the port and security group
Check whether the security group configuration allows connections on the remote connection port.
Go to ECS console - Instances.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
In the instance list, click the corresponding instance ID.
On the Security Groups tab, locate the security group and click Manage Rules in the Operation column.
On the Security Group Details page, in the Rules area, on the Inbound tab, click Add Rule and configure the rule with the following parameters.
Action: Allow
Priority : 1 (A smaller value indicates a higher priority. 1 is the highest priority.)
Protocol: Custom TCP
Source: Enter your IP address. You can find your IP address by visiting
https://cip.cc/.For Destination (Current Instance), select SSH (22).
Run the following command to test the port and check if it is working correctly.
telnet [$IP] [$Port]Note[$IP] is the IP address of the Linux instance.
[$Port] specifies the RDP port number for the Linux instance.
The system displays a response similar to the following. For example, if you run the
telnet 192.168.0.1 22command, a normal response looks like this:Trying 192.168.0.1 ... Connected to 192.168.0.1. Escape character is '^]'If the port test fails, see Troubleshoot port failures when an ECS instance can be pinged for troubleshooting.
Step 4: Check the CPU load, bandwidth, and memory usage
Remote connection failures can be caused by high CPU load, insufficient bandwidth, or out-of-memory errors.
Check for high CPU load and take the appropriate action.
The CPU load is high.
If your application has high disk access, network access, or computing requirements, a high CPU load is expected. You can upgrade the instance type to resolve the resource bottleneck. For more information, see Overview of instance type upgrades and downgrades.
NoteFor solutions to high CPU load, see Query and analyze CPU load on Linux systems.
If the CPU load is not high, proceed to the next step.
Check for insufficient public bandwidth.
Remote connection failures can be caused by insufficient public bandwidth. To troubleshoot this issue, perform the following steps.
Go to ECS console - Instances.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
In the instance list, click the instance ID. In the Configuration Information section, view the Internet Bandwidth.
If the instance bandwidth is 0 Mbps, the instance has no public bandwidth. To resolve this issue, you can upgrade the bandwidth. For more information, see Change bandwidth configurations (network resources).
Check for insufficient memory.
After you remotely connect to a Linux instance, the desktop may not display correctly and the connection may close without an error message. This issue might be caused by insufficient instance memory. Check the instance's memory usage by performing the following steps.
Log on to the Linux instance using a VNC connection.
For more information, see Log on to a Linux instance using a password.
View the memory usage. If the memory is insufficient, you can upgrade the instance type to resolve the resource bottleneck. For more information, see Overview of instance type upgrades and downgrades.
A specific error message is returned
When a remote logon fails, the system usually returns an error message. You can use the error message to quickly identify the cause and find a solution.
PAM security frame
The Pluggable Authentication Modules (PAM) security framework in Linux can load security modules to control access to account policies and logon policies of the ECS instance. If the related configurations are incorrect or a policy is triggered, SSH logon may fail. Common cases include the following:
Linux instance system environment configuration
Issues in the Linux system environment, such as a virus, incorrect account configuration, or incorrect environment variable configuration, can cause SSH logon to fail. Common cases include the following:
"fatal: mm_request_send: write: Broken pipe" error due to an SSH service exception caused by a virus
"main process exited, code=exited" error when the SSH service starts
System exception after SSH logon to a Linux instance due to ulimit restrictions
Error when you use the SSH command to log on to a Linux ECS instance
SSH remote connection exception on a Linux instance because the SELinux service is enabled
SSH service and parameter settings
The default configuration file for the SSH service is /etc/ssh/sshd_config. Incorrect parameter settings in the configuration file, or certain enabled attributes or policies, can cause SSH logon to fail. Common cases include the following:
"Too many authentication failures for root" error when you use SSH to log on to an instance
"error while loading shared libraries" error when the SSH service starts
"fatal: Cannot bind any address" error when the SSH service starts on a Linux ECS instance
"Bad configuration options" error when the SSH service starts
SSH logon or data transmission slows down because UseDNS is enabled for SSH
Configuration of directories or files associated with the SSH service
For security reasons, the SSH service checks the permission settings, owner, and group of related directories and files during runtime. Permissions that are set too high or too low can cause service errors, which in turn can cause client logon to fail. Common cases include the following:
SSH service key configuration
The SSH service uses asymmetric key encryption to encrypt transmitted data. The client and server exchange and verify the validity of the related key information. Common cases include the following:
"Host key verification failed" error when you use SSH to log on to an ECS instance