Problem description
Application-level symptoms: The application experiences network timeouts, connection interruptions, or data transfer failures.
System-level packet loss: When you continuously monitor the
/proc/net/softnet_statfile with thecatcommand, the count in the second column (dropped) or the third column (squeezed) increases rapidly.# Each row corresponds to a CPU core. # Column 1: Total number of network frames received. # Column 2: Number of packets dropped because the backlog queue was full (dropped). # Column 3: Number of packets deferred because the processing time budget was exceeded (squeezed). $ cat /proc/net/softnet_stat 000bb344 00000471 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000bc76f 00000305 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001High CPU usage for software interrupts: When you use tools such as
topormpstat, you observe abnormally high CPU usage forsi(softirq).
Causes
When the network interface controller (NIC) driver receives a data packet, it uses a software interrupt (NET_RX_SOFTIRQ) to notify the kernel protocol stack to process the packet. To buffer traffic bursts, the kernel maintains a softnet backlog queue for each CPU core. Packet loss is primarily caused by two factors:
The
softnetbacklog queue is too small: In high network throughput scenarios, if the rate at which packets enter the queue is consistently higher than the CPU processing rate, the queue fills up quickly. New incoming packets are then dropped directly. This causes thedroppedcount in the second column of the/proc/net/softnet_statfile to increase.CPU processing power is insufficient: Even if the queue size is adequate, packet processing can be interrupted if a CPU core cannot process the packets in the
softnetbacklog queue within its allocated time budget (net.core.netdev_budget). This situation is called atime_squeezeand causes thesqueezedcount in the third column of the/proc/net/softnet_statfile to increase. This indicates that the bottleneck is CPU computing power, not the queue size.
Solutions
First, monitor the dynamic changes in /proc/net/softnet_stat to determine if packet loss is due to an insufficient queue size (dropped increases) or a CPU processing bottleneck (squeezed increases). Then, apply targeted optimizations.
Step 1: Diagnose the root cause of packet loss
Monitor the softnet status in real time to distinguish between a dropped issue and a squeezed issue.
Log on to the ECS Linux instance.
Run the following command to refresh the
softnetstatistics every second. Focus on the incremental changes in thedroppedandsqueezedcolumns./proc/net/softnet_statare cumulative since the system started. A problem is indicated only when the values increase continuously and rapidly.watch -d 'awk "{print \"CPU\"(NR-1)\": dropped=\"\$2\", squeezed=\"\$3}" /proc/net/softnet_stat'Based on the command output, determine the problem type:
The
droppedcolumn increases continuously, while thesqueezedcolumn remains mostly unchanged: This indicates that thesoftnetbacklog queue is too small. Proceed to Step 2: Adjust the backlog queue size.The
squeezedcolumn increases continuously (regardless of whetherdroppedincreases): This indicates that CPU processing power is the bottleneck. Proceed to Step 3: Optimize CPU processing capability.
Step 2: Adjust the backlog queue size
Increase the value of the net.core.netdev_max_backlog parameter to expand the queue capacity. This helps mitigate packet loss caused by traffic bursts.
Run the following command to view the current
netdev_max_backlogvalue. The default value is typically 1000.sysctl net.core.netdev_max_backlogBased on the instance's network bandwidth, refer to the following table to set a reasonable
netdev_max_backlogvalue.Important: An unreasonably large value increases memory consumption and may introduce network latency. Set this value with caution. Memory usage estimation formula:
Memory usage (Bytes) ≈ netdev_max_backlog × Average packet size × Number of CPU cores.The size you set depends mainly on your network bandwidth and business scenario.
Business Scenarios
Bandwidth
Recommended value
Description
Default/Low configuration
1 Gbps
1000 (default)
The default value is typically 1000, which is sufficient for normal traffic.
Medium load
1 Gbps to 10 Gbps
5000 to 10000
Suitable for most standard web servers and application servers.
High concurrency/High throughput
10 Gbps
30000
Suitable for Nginx gateways, Redis, and high-frequency API services.
Extremely high performance
40 Gbps+
60000 to 100000
Suitable for core switch nodes, DDoS traffic scrubbing, and ultra-high-frequency trading systems.
Temporarily modify the parameter value to apply it immediately. Replace
NETDEV_MAX_BACKLOG_NUMBERwith the desired value.sysctl -w net.core.netdev_max_backlog=NETDEV_MAX_BACKLOG_NUMBERTo ensure the configuration persists after a server restart, make the setting permanent.
Method 1: Recommended Create or modify the
/etc/sysctl.d/99-network-tuning.conffile and add the following content:# Increase kernel softnet backlog queue size net.core.netdev_max_backlog = NETDEV_MAX_BACKLOG_NUMBERMethod 2: Add the following content to the end of the
/etc/sysctl.conffile:net.core.netdev_max_backlog = NETDEV_MAX_BACKLOG_NUMBERThen, run
sysctl -pto apply the configuration.
Return to Step 1 and monitor the
droppedcount again to verify that it has stopped increasing. This confirms the optimization was effective.
Step 3: Optimize CPU processing capability
Use multiple CPU cores to process network software interrupts in parallel by enabling Receive Packet Steering (RPS).
Confirm the number of vCPUs for the instance. RPS has no effect on single-core instances.
Find the name of the NIC to optimize, which is usually
eth0oreth1.Enable RPS to distribute network software interrupts across all CPU cores for processing.
rps_cpusis a CPU bitmask. For example, the mask for an 8-core CPU isff(binary11111111), and for a 16-core CPU isffff. You can calculate the mask based on the number of CPU cores or use a sufficiently large value (such asffffffff) to cover all cores.# Replace <interface> with the NIC name, for example, eth0 # Replace <cpu_mask> with the calculated CPU mask, for example, ff (8-core) echo <cpu_mask> > /sys/class/net/<interface>/queues/rx-0/rps_cpus # Example: Enable RPS for the eth0 NIC on an 8-core CPU echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpusTo make the configuration persist across restarts, add the preceding command to a startup script, such as
/etc/rc.local.Return to Step 1 and monitor the
squeezedcount again to verify that it has stopped increasing. This confirms the optimization was effective.
Recommendations
Set up monitoring and alerts: In Alibaba Cloud CloudMonitor, configure custom monitoring for your ECS instance. Monitor the growth rate of
droppedandsqueezedin/proc/net/softnet_statand set alert rules. This ensures you are promptly notified if a problem occurs.Select a suitable instance type: For network-intensive applications, select a network-enhanced instance family (such as c7ne or g7ne). These instance families provide higher network packets per second (PPS) and bandwidth performance.