All Products
Search
Document Center

Troubleshooting Method for High Packet Loss Latency in Website Access of Linux Instances

Last Updated: Apr 21, 2022

Overview

If the website access is slow or inaccessible, you are advised to perform a link test if significant problems have been eliminated and significant packet loss is detected using the ping command. In the Linux environment, we recommend that you use the mtr command-line tool to perform a link test or the traceroute command-line tool to determine the source of the problem. Typically, the link test steps are as follows.

  1. Use link testing tools to detect network conditions and server status.
  2. Analyze and process according to the link test results.

Description

Take note of the following items:

  • Before you perform high-risk operations such as modifying the specifications or data of an Alibaba Cloud instance, we recommend that you check the disaster recovery and fault tolerance capabilities of the instance to ensure data security.
  • Before you modify the specifications or data of an Alibaba Cloud instance, such as an Elastic Compute Service (ECS) instance or an ApsaraDB RDS instance, we recommend that you create snapshots or enable backups for the instance. For example, you can enable log backups for an ApsaraDB RDS instance.
  • If you have granted specific users the permissions on sensitive information, such as usernames and passwords, or submitted sensitive information in the Alibaba Cloud Management Console, we recommend that you modify the sensitive information at the earliest opportunity.

The following section describes how to use the mtr command-line tool and the tracert command-line tool and how to analyze the link test results.

mtr command line tool

Mtr (My traceroute) is a pre-installed network testing tool for almost all Linux distributions. It integrates the graphical interface of tracert and ping commands and is very powerful. ping and tracert are commonly used to detect network conditions and server status, as described below.

Command Description
ping Send the packet to the specified server. If the server responds, it will send back the packet, with the time to return the packet.
tracert Returns all nodes (routes) that pass through the middle from the user's computer to the specified server and the response speed of each node.

By default, mtr sends ICMP packets for link probing and uses the "-u" parameter to specify UDP packets for probing. compared with traceroute, which performs a trace test only once, mtr continuously detects the relevant nodes on the link and provides corresponding statistics. mtr can avoid the impact of node fluctuations on the test results, so its test results are more correct, we recommend that you use it first.

Usage notes

mtr [-hvrctglspni46] [--help] [--version] [--report]
[--report-cycles=COUNT] [--curses] [--gtk]
[--raw] [--split] [--no-dns] [--address interface]
[--psize=bytes/-s bytes]
[--interval=SECONDS] HOSTNAME [PACKETSIZE]

Sample output

[root@centos ~]# mtr 223.5.5.5
My traceroute [v0.75]
mycentos6.6 (0.0.0.0) Wed Jun 15 23:16:27 2016
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. ???
2. 192.X.X.20 0.0% 7 13.1 5.6 2.1 14.7 5.7
3. 111.X.X.41 0.0% 7 3.0 99.2 2.7 632.1 235.4
4. 111.X.X.197 0.0% 7 1.8 2.0 1.2 2.9 0.6
5. 211.X.X.25 0.0% 6 0.9 4.7 0.9 13.9 5.8
6. 211.X.X.70 0.0% 6 1.8 22.8 1.8 50.8 23.6
211.X.X.134
211.X.X.2
211.X.X.66
7. 42.X.X.186 0.0% 6 1.4 1.6 1.3 1.8 0.2
42.X.X.198
8. 42.X.X.246 0.0% 6 2.8 2.9 2.6 3.2 0.2
42.X.X.242
9. ???
10. 223.5.5.5 0.0% 6 2.7 2.7 2.5 3.2 0.3

Description of common optional parameters

  • -r or -- report: displays the output in report mode.
  • -p or -- split: lists the results of each trace separately, instead of -- report counting the entire result.
  • -s or -- psize: specifies the size of the ping packet.
  • -n or -- no-dns: does not perform domain name inverse resolution for the IP address.
  • -a or -- address: specifies the IP address for sending packets. Used when the host has multiple IP addresses.
  • -4: uses only IPv4.
  • -6: uses only IPv6.

During mtr running, you can also enter the corresponding letters to quickly switch modes. The meaning of each letter is as follows.

  • ? or h: display the help menu.
  • d: switches the display mode.
  • n: switches to enable or disable DNS domain name resolution.
  • u: switches to ICMP or UDP packets for probing.

Response description

By default, the data columns in the response are described as follows.

  • The first column (Host): the IP address and domain name of the node. Press the n key to switch the display.
  • The second column (Loss%): the packet loss rate of the node.
  • The third column (Snt): the number of packets sent per second. The default value is 10, which can be specified by the "-c" parameter.
  • The fourth column (Last): the latest probe latency.
  • The fifth, sixth, and seventh columns (Avg, Best, and Worst): the average, minimum, and maximum values of the probe latency.
  • Column 8 (StDev): standard deviation. The larger the number, the more unstable the corresponding node is.

traceroute command-line tool

traceroute is a pre-installed network testing tool for almost all Linux distributions. It is used to track the path through which Internet Protocol (IP) packets are transmitted to the destination address. traceroute first sends UDP probe packets with the maximum time to live value (Max_TTL), and then listens for ICMP TIME_EXCEEDED responses on the entire link starting from the gateway. traceroute sends UDP packets with the TTL value starting from 1 and increases the value by 1 each time until it receives the ICMP PORT_UNREACHABLE message. The ICMP PORT_UNREACHABLE message is used to indicate that the destination host is located, or the maximum TTL of the command is reached. By default, traceroute sends UDP packets for link detection. The "-I" parameter can be used to specify that ICMP packets are sent for detection.

Usage notes

traceroute [-I] [ -m Max_ttl ] [ -n ] [ -p Port ] [ -q Nqueries ] [ -r ] [ -s SRC_Addr ] [  -t TypeOfService ] [ -f flow ] [ -v ] [  -w WaitTime ] Host [ PacketSize ]

Sample output

[root@centos ~]# traceroute -I 223.5.5.5
traceroute to 223.5.5.5 (223.5.5.5), 30 hops max, 60 byte packets
1 * * *
2 192.X.X.20 (192.X.X.20) 3.965 ms 4.252 ms 4.531 ms
3 111.X.X.41 (111.X.X.41) 6.109 ms 6.574 ms 6.996 ms
4 111.X.X.197 (111.X.X.197) 2.407 ms 2.451 ms 2.533 ms
5 211.X.X.25 (211.X.X.25) 1.321 ms 1.285 ms 1.304 ms
6 211.X.X.70 (211.X.X.70) 2.417 ms 211.138.114.66 (211.X.X.66) 1.857 ms 211.X.X.70 (211.X.X.70) 2.002 ms
7 42.X.X.194 (42.X.X.194) 2.570 ms 2.536 ms 42.X.X.186 (42.X.X.186) 1.585 ms
8 42.X.X.246 (42.X.X.246) 2.706 ms 2.666 ms 2.437 ms
9 * * *
10 public1.alidns.com (223.5.5.5) 2.817 ms 2.676 ms 2.401 ms

Description of common optional parameters

  • -d: uses the socket-level troubleshooting function.
  • -f: sets the TTL value for the first probe packet.
  • -F: disables segmentation.
  • -g: specifies the source routing gateways. A maximum of eight routing gateways can be specified.
  • -i: sends a packet using the specified network interface controller. Used when the host has multiple network interface controller.
  • -I: uses ICMP packets instead of UDP packets for probing.
  • -m: specifies the maximum TTL of the probe packet.
  • -n: uses the IP address instead of the host name to disable reverse DNS lookup.
  • -p: sets the UDP communication port.
  • -r: ignores the common routing table and directly sends the data packet to the remote host.
  • -s: specifies the IP address of the packets sent by the local host.
  • -t: sets the TOS value for the probe packet.
  • -v: shows the command execution process in detail.
  • -w: sets the waiting time for the remote host to return packets.
  • -x: enables or disables packet verification.

Analyze link test results

Based on the following example diagram of link test results.

  1. Determine whether an exception exists in each region and handle it separately according to the situation in each region.
    • Area A: The client's local network, that is, the local LAN and the local network provider network. For exceptions in this area, troubleshoot and analyze the problems related to the local network of the client. For local network provider network-related node issues, please give feedback to the local operator.

    • Area B: carrier network. For exceptions in this area, you can query the owner based on the IP address of the abnormal node, and then directly or through Alibaba Cloud after-sales technical support, feedback to the corresponding operator.

    • Area C: the local network of the target server, that is, the network provider network of the target host. For exceptions in this area, you need to feed back the problem to the network provider of the destination host.

  2. Combines Avg (average) and StDev (standard deviation) to determine whether exceptions exist on each node.

    • If StDev is very high, it simultaneously observes the Best and Worst of the corresponding node to determine whether the corresponding node has an exception.
    • If StDev is not high, Avg is used to determine whether the corresponding node has an exception.

      Note: You can determine whether StDev is high based on the latency values in other columns of the same node rather than a specific time range standard, but based on the latency values in other columns of the same node. For example, if Avg is 30 ms and StDev is 25 ms, StDev is considered high. If Avg is 325ms, the same StDev is 25ms, which is considered to be a low deviation.

  3. Check the node packet loss rate. If "Loss %" is not zero, there may be a problem with the network of this hop route. There are usually two causes of node packet loss.

    • The ICMP sending rate of the node is artificially limited, resulting in packet loss.
    • An exception exists in the node.
  4. Determine the cause of packet loss of the current abnormal node.

    • If no packet loss occurs on the subsequent nodes, the packet loss on the current node is due to operator policy restrictions and can be ignored. As shown in the network of the 2nd hop route in the previous link test result example diagram.

    • If the subsequent node also has packet loss, the current node has a network exception, resulting in packet loss. As shown in the network of the 5th hop route in the previous link test result example diagram.

      Note: The preceding two situations may occur at the same time, that is, the corresponding node has both policy speed limit and network exception. In this case, if packet loss occurs continuously on the current node and its subsequent nodes, and the packet loss rate of each node is different, the packet loss rate of the network of the last few hop routes shall prevail. As shown in the preceding example diagram of the link test results, packet loss occurs on the networks of the 5th, 6th, and 7th hop routes. Therefore, the final packet loss situation is based on the 40% of the 7th hop route network.

  5. Check whether there is a significant delay to check whether the node has an exception. The analysis is carried out through the following two aspects.

    • If the network latency of a one-hop route increases significantly, it is usually determined that the node has a network exception. As shown in the preceding example diagram of the link test results, if the delay of subsequent nodes after the network with the fifth hop route increases significantly, it is inferred that the network node with the fifth hop route has a network exception.

      Note: A high latency does not necessarily mean that the corresponding node has an exception. A large latency may also be caused by the data packet return link. We recommend that you analyze it together with the reverse link test.

    • The ICMP policy rate limiting may also cause a sharp increase in latency on the corresponding node, but the latencies of the subsequent nodes return to the normal status. As shown in the preceding example diagram of the link test results, the network of the third hop route has a 100% packet loss rate and a significant increase in latency. However, the latencies of the subsequent nodes immediately return to the normal status. Therefore, it is determined that the sharp increase in latency and the packet loss on the node are caused by the policy rate limiting.

Recommended operations

  • If 100% packet loss occurs at the destination address, we recommend that you troubleshoot the security policy configuration of the destination server.
  • If the packets are redirected in a loop and cannot reach the target server, we recommend that you contact the operator of the corresponding node for processing.
  • If the data packet cannot receive any feedback after the jump, it is recommended to make further confirmation in conjunction with the reverse link test and contact the operator of the corresponding node for processing.
  • Alibaba Cloud Chinese mainland data centers and data centers in other countries or regions have network communication leased lines. To reduce the packet loss rate during communication, we recommend that you use Express Connect.
  • If the host packet drop and latency are very high, we recommend that you do mtr bidirectional tests, that is, local-to-server and server-to-local tests. If you cannot remotely log on to the ECS instance, log on to the ECS instance through the Management Terminal.

Applicable scope

  • Elastic Compute Service (ECS)