How the Alibaba Cloud OS Console Tackles Network Packet Loss in One Move

By Yongde Zhang

Background

In the context of growing cloud computing and large-scale business cloud adoption, high-quality network communication is an important guarantee for maintaining efficient business operations. As a core challenge in modern network architectures, packet loss may lead to multi-dimensional business risks. During business deployment or operation, minor packet loss may cause communication interruptions, abnormal data transmission, and deviations in business logic execution. Severe cases can lead to a series of failures, such as health check failures, unresponsive pings, and denial of service in the operation and maintenance (O&M) system.

Recently, a client encountered severe network packet loss when deploying a business cluster in a new region, resulting in stalled business deployment and ongoing cost consumption. This issue directly caused a halt in the client's production deployment. For such a critical problem, does it really require a lot of time and effort to troubleshoot? No! After using the Alibaba Cloud operating system (OS) console (hereinafter referred to as the "OS console"), customers can quickly locate problems, not only complete business deployment and achieve stable operation, but also effectively curb continuous cost consumption.

Next, we will draw on practical cases to introduce how to use the OS console's packet loss diagnosis to resolve packet loss issues.

Locate the root cause of the problems through packet loss diagnosis and analysis

Scenario 1: Quick Problem Delimitation

A message service client encountered a systematic health check exception when deploying a cluster in a new region of Alibaba Cloud ACK (Alibaba Cloud Kubernetes). As a result, the business deployment process was completely blocked. Therefore, it was necessary to investigate the cause of the health check failure.

Based on previous O&M experience, in most cases, once iptables are checked and confirmed to be working properly, suspicion typically turns to kernel packet loss. Troubleshooting kernel packet loss is a relatively cumbersome and difficult process. This is because troubleshooting the kernel packet loss requires the troubleshooters to be familiar with how packets are processed in the kernel code, and to monitor the packet flow by combining the appropriate function entry points that data flows through with hook points. This whole process places considerable demands both on the professionalism of the troubleshooting personnel and the time required.

In this case, how did we quickly locate the problem in combination with the OS console?

First, we examined the status of the machines in this region: As shown in the following figure, ECS served as the host of ACK pods, with an SLB at the front end.

We directly performed tcpdump on the eth0 network interface controller on the ECS. From the packet capture results below, the source was the health check network segment of the SLB, and the SLB was continuously sending SYN packets to this machine to request a connection. However, the local machine did not return an ACK packet, causing the health check to fail. Then why didn't the machine return the ACK packet?

Caused by iptables rules?

According to the general troubleshooting idea, first of all, we would suspect that some requests were blocked due to the iptables rule setting. However, after confirming the normal machines and abnormal machines, we found that the iptables had not changed, so the problem caused by the iptables rules could be eliminated.

Kernel packet loss?

To troubleshoot kernel packet loss issues in the past, expertise in network kernel modules might have been required for tracking and analysis. But now, with only a few clicks, we can use the Alibaba Cloud OS console, and this "expert" can accomplish in a short time what previously required professional personnel.

Use the OS console to diagnose the problem instance:

As shown in the figure, we selected Network Diagnosis and Packet Loss Diagnosis in system diagnosis, then selected the Instance ID we needed to diagnose according to step 4, and finally clicked Execute Diagnosis. After the diagnosis was completed, we clicked View Report to view the packet loss on the machine.

The result was shown in the figure above. No known packet loss exception was displayed in the report of the OS console. Therefore, we basically ruled out the case of kernel packet loss.

Neither iptables rules nor kernel packet loss, what else could be possible?

Combined with the diagnostic report of the OS console, it could be basically confirmed that the kernel did not have packet loss, and the potential risks of the underlying protocol stack were successfully eliminated. Further analysis revealed that the successful reception of SYN packets by the eth0 interface indicated no data loss in the network link, and the absence of anomalies in iptables rules confirmed that the issue was not caused by configuration rules. After going through the preceding troubleshooting process, we realized that there was still an overlooked troubleshooting dimension: the network drivers or middleware modules might be abnormal. Based on this assumption, we printed out the hooks in the system for inspection:

As shown in the preceding figure, compared with normal machines, there were many more hooks of the sched_cls type. After confirmation with the ACK R&D team, we knew that this was a hook for a network component. We strongly suspected that the hooks added by this component were causing SYN packets to be filtered out. Therefore, immediately after the component was uninstalled, the health checks returned to normal.

It can be seen that with the help of the OS console, we have very quickly carried out a preliminary delimitation of the problem, eliminating the suspicion of kernel packet loss, so that we can more quickly focus on the investigation in other directions, saving more time for solving the problem.

Pinpoint the Issue

A customer found that they couldn't connect to port 1678 using telnet after creating a new instance, which had a significant impact on their business. A blocked port of this kind would prevent their business processes from communicating normally with external entities.

This case was somewhat similar to the preceding case, which was also a network failure. In the face of network failure, the first thing we had to do was to capture packets on the port or network interface controller, and then to check the packet flow on the port or network interface controller.

The customer executed the telnet program on his machine and found that port 22 of the telnet machine was connected, but port 1678 and other ports were not connected. In addition, checks showed that the ports and the listening processes corresponding to these ports were all running properly with no issues.

Applying the usual troubleshooting approach, could the issue be with the iptables rules? We checked the iptables settings first. With the cooperation of the customer, it was confirmed that there was no obvious problem with the iptables on this machine, and no special rules were set.

Since it was not an issue with iptables, could the problem be caused by some driver hooks, drawing on the experience from the previous case? Then we checked the security components and found that no additional security software was installed, and no abnormal function hooks were detected. Therefore, the issue was not caused by hooks either.

Since there was no problem with hooks and iptables. Could it be that the kernel was dropping packets? With this doubt, we could use the OS console to diagnose abnormal instances:

Soon, after the diagnosis was completed, we checked the diagnostic report.

In this report, the diagnostics indicated that we needed to delete iptables packet loss rules or related Netfilter drivers. The conclusion was clear: the packet loss was caused by Netfilter. Since the packet loss was caused by Netfilter, we must first check the rule settings of nftables:

By viewing the nftables rule settings, it was found that nft indeed had a drop rule for port 1678.

After deleting the corresponding rules and updating the listening 1678 port on this machine, it was found that this port became accessible, allowing connections to be established. The issue has been resolved.

Conclusion

During routine system O&M, packet loss issues can lead to loss of business communications, failure of business operations, and even inability to deploy services. However, the packet loss problem is not completely prohibitive. The Alibaba Cloud OS Console provides a simple, user-friendly, and professional diagnostic tool. If you suspect that the system has a packet loss problem, we can perform the following steps in conjunction with the OS console for troubleshooting:

Directly use the packet loss diagnosis function of the OS console to see if the diagnosis indicates a clear problem point.
If the diagnostic report does not show that the kernel has lost packets, check whether there is any extra security software installed in the system, or compare with the normal environment to see if there are any extra hooks.
If there is no unexpected driver or hook, check whether the iptables are correct.
If the packet loss point cannot be clearly identified, tools such as funcgraph or bpf can be used to place monitoring points and capture packets along the suspected path to locate the loss.

Typically, by using the OS console and following these four steps, you can handle most packet loss issues and resolve them easily.

Contact Us

If you have any questions or suggestions while using the OS console, you can search for the group number 94405014449 and join the DingTalk group.

Community

How the Alibaba Cloud OS Console Tackles Network Packet Loss in One Move

Background

Locate the root cause of the problems through packet loss diagnosis and analysis

Scenario 1: Quick Problem Delimitation

Caused by iptables rules?

Kernel packet loss?

Neither iptables rules nor kernel packet loss, what else could be possible?

Pinpoint the Issue

Conclusion

Contact Us

Read previous post:

Read next post:

OpenAnolis

You may also like

Comments

OpenAnolis

Related Products

Bastionhost

Managed Service for Grafana

Networking Overview

Container Service for Kubernetes