Common Causes and Troubleshooting Methods for Connection Reset

By Sunrainchy, from Alibaba Cloud Storage Team

1. An Introduction to RST

Similar to SYN and FIN, a TCP RST message is a kind of control message that can change the TCP state or respond to unexpected messages and is marked in the Flags field in the TCP Header. Compared with other messages, RST packets are designed to handle some abnormal conditions and are usually used by the protocol stack. Services should only use RST to forcibly terminate the connection in the case of have to.

2. The Role and Appearance Scenarios of the RST Message

2.1 Terminate ESTAB Stateful Connection

Under normal circumstances, after a connection is closed, the party that disconnects the connection enters the TIME_WAIT state for TW_TIMEOUT long (usually 2 MSL, 120s by default), and the other party enters the CLOSE state. If RST is used to terminate the connection, the connection will skip all processes from the current state, directly enter the CLOSE state, and will not go through various state transitions that occur during normal connection closure.

2.2 Terminate TIME_WAIT Stateful Connection

When the connection is in the TIME_WAIT state, a series of states of the current connection are in place, and the connection protocol stack in the TIME_WAIT state is still heaped with data that has not been read out by applications. At this time, the received RST message will immediately enter the CLOSE state from the TIME_WAIT state. If there is data in the protocol stack, it will also be discarded, and the upper applications can perceive that the connection has been RST.

2.3 Respond to the Unexpected Message

After a host receives a TCP message:

First of all, the host will check if it is an SYN packet in the process of establishing the connection. If it is an SYN packet, and it is found that the host does not Listen to the corresponding port, an RST packet will be returned. As shown in the following figure, the server responds to the RST message, and the logic package of curl software outputs Connection refused.

19:33:56.810709 IP a11348.cloud.xx21.58630 > a11348.cloud.xx21.ddi-tcp-1: Flags [S], seq 201987671, win 43690, options [mss 65495,sackOK,TS val 1621745206 ecr 0,nop,wscale 9], length 119:33:56.810722 IP a11348.cloud.xx21.ddi-tcp-1 > a11348.cloud.xx21.58630: Flags [R.], seq 0, ack 201987672, win 0, length 0

If it is not an SYN packet, the corresponding connection will be searched locally according to its quadruple. If it is found, the processing continues. If not, an RST message will be returned.

The two situations are typical unexpected scenarios, so why should the host respond to RST instead of silently discarding it or returning other types of messages? This is rarely explained. We can understand it this way. Why are these anomalous packets received? Without considering that these packets are forged, either this packet is a late message wandering in the network on the previous connection, or the connection status of the opposite end is still there, while the local connection status has ended. At this time, the existence of the connection of the opposite end is no longer meaningful when the local connection has been disconnected, so it is obliged to send RST to reset the connection of the opposite end. Secondly, there must be a response to the sent packets, which is conducive to troubleshooting network problems and handling application programs. In case of exceptions, they can be returned in time instead of waiting for the timeout.

2.4 Terminate Connection Forcibly in Special Business Scenarios

TCP is a full duplex protocol. Under normal circumstances, one side can only be closed in one direction. First, it actively sends FIN (EOF) to tell the other party that the data on my side has been sent. If the other party continues not to close the connection and keeps sending data (regardless of the buffer full zero window), it cannot be stopped. At this time, the connection can only be forcibly reset to terminate the connection. In reality, this rarely happens because application layer protocols cooperate well, while the situation will be different in attack scenarios.

3. Conditions for Correctly Resetting the Connection

It is impossible to reset the connection by sending an RST packet with the correct quadruple at will. If so, the connection can be maliciously reset by three parties easily. Therefore, in addition to the correct quadruple, the sequence number must also be correct to reset the connection. Otherwise, the RST message will be silently discarded.

Therefore, we found that if the connection is reset for no reason, it is either done by the opposite end or the network device through which the intermediate data packet passes (including the local host). After all, only when others see your data packet can they get the current correct sequence number and construct the correct RST message to reset the connection. This is one of the reasons why the TCP connection sequence number does not start directly from 1 but is generated randomly.

Note: Sometimes, Wireshark sees the sequence number start from 1 because the relative sequence number display is turned on. Wireshark automatically computes the offset based on the initial sequence number in the SYN packet for the sake of analysis, and the number is random. This feature can be turned off.

4. How to Analyze

The section above briefly introduces the emergence and application scenarios of RST. After understanding the information above, RST problems in most basic scenarios can be roughly inferred without capturing packets. However, if you encounter some complex RST scenarios, you still need to use tools to analyze them in depth.

4.1 Scenarios in Which the Intermediate Network Forges the RST Message

Here, I will take an actual case in which the intermediate network forges the RST message to analyze and introduce it.

4.1.1 Confirm the Source of RST Based on TTL

The first step in analyzing RST is to look at the source of the RST packet, whether it is sent by the opposite end or the intermediate device. At this time, TTL is enough. TTL stands for time to live. In order to prevent data packets from wandering on the Internet due to configuration errors, the value of TTL will be reduced by 1 every time a message passes through. Until the TTL is 0 and is discarded, the corresponding ICMP message is returned. The initial value of TTL is usually values like 64 and 128. Also, there are other specially set values. For example, the initial TTL value of OSS return packets is 64, while the initial value of ALB outgoing packets is 102.

Compared with the TTL of RST normal messages and RST messages, if it is an RST packet sent by the opposite end, the TTL should also be the same (at least close).

The TTL of a normal SYN ACK message is 46.

The TTL of the RST packet is 85:

According to the TTL analysis above, the TTL value is 46 and 85, respectively. It can be concluded that this RST message must not be sent by OSS, but it is not sure whether it was forged by the intermediate network or replied to by ALB. The two packets experienced 18 hops and 17 hops, respectively, and the two numbers are close, so it is still possible that ALB sent them.

4.1.2 Raw ping Confirms TTL to the ALB Device

The TTL of the client raw ping VIP is 87 hops. At this time, it can be determined that the hop counts from the client to the ALB is 102 - 87 = 15 hops. Moreover, no TTL with the value of 87 is found in the message sent by the client, so it can be determined that RST is not from ALB either.

Sometimes, if the client cooperates, you can get Ping screenshots. If the client doesn't cooperate, it doesn't matter. There will still be other clues on the network.

4.1.3 Determine the Target Message of RST

A necessary condition for the RST connection mentioned above is that the sequence number should be correct. According to the sequence number, it can be inferred which message the RST is aimed at. In other words, the message triggers the firewall rule.

Under normal circumstances, the intermediate firewall device will intercept the SYN packet, SNI in TLS ClientHello, HTTP Host, and special strings in the data packets in case of RST. Therefore, finding the target message of RST when RST occurs may reveal the logic behind the connection being RST.

There are three sequence numbers here:

The sequence number of the No.47 SYN ACK packet is 3788899386.

The sequence number of the No.48 ACK packet is 3788899387.

The sequence number of the No.49 RST packet is 3788899387, which equals the No.47 packet plus 1.

It can be seen that this RST packet is reset for the No.48 SYN ACK packet, but it is unknown whether the RST packet is triggered by the No.47 SYN ACK packet or the No.48 ACK packet. However, from common sense, the ACK packet has no special reference value except carrying ACK information, so it can be determined that it is reset for the SYN part of the No.47 message.

4.1.4 Analyze the RST Position by Determining the RST Time

Find the RST message and the RST target message and calculate the time difference between the two, which is the distance between the RST and the current packet capture host. Through the magnitude of this time difference and the comparison between this time difference and the RT link, the position of the RST can be roughly obtained.

First, look at the latency of the whole link. With the help of timing analysis of SYN ACK or ping in 4.1.2, it can be confirmed that the RT link is about 26ms.

Looking at the RST message and the time interval for triggering this RST message, it can be confirmed that whether it is triggered by the No.47 SYN ACK packet or the No.48 ACK packet, the time intervals are both 0.04ms. According to the time, we can determine that the RST packet was sent by a device close to the user-side host or the client host.

Let's continue the analysis and find the final RST position.

4.1.5 Determine the Location of the Device That Sends RST Based on TTL

If we know the initial TTL value of the RST packet, we can compute how many hops of devices from which the RST was sent on the client. You can roughly locate the device location of RST by using traceroute. Even if the initial TTL value is unknown, under normal circumstances, the initial TTL value is usually 64 or 128, and the approximate RST position can be located.

Unfortunately, the TTL values of the final state of different RST packets are different in this case. That's probably because the TTL of this RST is random within a certain range. Therefore, it is difficult for this method to work in this case scenario. I will make a targeted analysis after the server-side packet capture.

4.1.6 IP Package Identification

This field is used as the unique identification of IP (there may be conflicts), and the message can be roughly determined in combination with the time when the message is sent. This field should not be modified during forwarding and passing through NAT equipment.If the server's special packet sets this value, then during the investigation, this field can be used to determine which device sent the RST for the connection. For example, when an ALB triggers an RST, this field will be set to a special value.

You can select the field, right-click, and select Apply as Column. The message can be displayed above in a separate column, and other fields are the same.

4.1.7 Another Interesting Phenomenon

We found that there is more than one RST here. There are two RSTs. The other RST is sent from the server according to TTL, and the time interval of this RST is exactly 26ms compared with the target message. It can also be said that the relative time interval sum of the target messages of the two RSTs is 26ms, which is an RT link value.

According to the analysis, it is determined that it matches the behavior of the firewall. When the intermediate firewall device sends an RST to terminate a connection, it constructs two separate RST packets based on the captured packets, which are then sent to both the client and the server. At the same time, it kills the connection between the client side and the server side. (The analysis here is not perfect. The two RSTs are constructed according to SYN and SYN ACK. Please see the analysis in 4.1.8 for more information.) The first RST message is sent to the server before the third ACK packet of the client's three-way handshake arrives and RST the server connection. At this time, the server connection status has disappeared, and the ACK message sent to the client arrives. Since no TCP connection is matched, the server also responded to an RST, which leads to the emergence of the second RST packet.

So far, the analysis above is based on the client's packets. As to whether it is correct, let's take the server's packet capture to verify it.

4.1.8 Analysis of Server Packet Capture

The reason is from the analysis above. The server packet capture is mainly to verify whether the previous series of investigation conclusions are correct:

1. The server received an RST. The IP identification is 1536, just as the client sees it.

2. The packet capture on the server is not triggered by SYN ACK packets but by SYN packets. Thus, RST is triggered by SYN packets (including SYN ACK packets) in their respective directions.

Note: RST needs to get the correct sequence number, so pure SYN can only construct the same-direction RST, while the SYN ACK packet can construct bidirectional RST according to seq and ack.

3. The extra RST analysis in 3 and 4.1.6 is correct. It arrives before the third packet of the RST three-way handshake arrives, triggering the server to return RST.

TTL analysis in 4.1.5:

There are 82 hops and 85 hops left from TTL to OSS and to the user host. It seems the hop counts to both sides are similar, but from the time point of view, the user side is closer in physical distance. The hop counts from the user host to the OSS side is 21 hops, and the initial TTL value is (82+85+21)/2 = 94 hops. It should be 9 hops away from the user host. If the monitoring packets are concurrently RST on the same device, the analysis above is correct. However, most of the networks are bypassed sniffing, and RST is constructed and sent on other devices, so the analysis above is reasonable but can only be used as an aid.

5. Typical Connection Anomalies of the Fourth and Seventh Layer

5.1 Physical Network Exceptions

In reality, the results of various firewall blocking manifested in physical networks can be divided into three phenomena:

1. Black Hole

The network is unconnected, and all messages passing through a certain link are silently discarded.

2. Packet Loss

The access slows down, especially under the mainstream congestion control algorithm. TCP connection performance will be extremely low. (There are some cases where black holes are preferred, but not all paths can be controlled.)

3. Connection Reset

It is the phenomenon described in Chapter 4, which is usually carried out for SYN packets, making establishing a connection impossible. Here is a list of some common phenomena, but there are some other phenomena.

5.2 Business Exceptions

From the content and business side, the mainstream blocking is:

1. Blocked by the IP Address

There are anomalies in a specific IP address or a specific network segment. It is usually manifested that a specific IP address or network segment is unavailable, or the connection is RST.

2. Blocked by Content

A specific HTTP plaintext URL is abnormal. For example, accessing a specific resource will be RST, but other resources under the same domain name can be accessed.

3. Blocked by Host

A specific HTTP plaintext host is abnormal.

For example, www.example.com will be blocked under the current network environment and then:

curl www.example.com is RST, while curl www.example.com -H "Host: anotherhost is normal.

4. Blocked by HTTPS SNI (Server Name Indication)

HTTPS shakes hands with a specific SNI exception carried in the Client Hello.

For example, if the SNI of www.example.com is blocked under the current network environment, then:

The openssl s_client -servername www.example.com -connect www.example.com:443 will be RST, while the openssl s_client -servername anothersni -connect www.example.com:443 is normal.

5. Preemptive Response/Data Tampering

It can only take effect for HTTP plaintext and responds to incorrect data or inserts a piece of unexpected data (such as inserting a pop-up advertisement code).

6. Cross-Carrier Access

6. Summary

As a basic Internet object storage service, OSS often encounters various network problems. The network is a black box for most developers, but the protocol is open. Grasp the details of the protocol and speculate on the real root cause behind it. When you encounter a connection reset, you can troubleshoot it by yourself according to the steps above. If the cause is still not found, you can contact OSS to assist in the troubleshooting.

Community