This topic describes potential causes and solutions for instances running CentOS 7 with kernel version 3.10.0-514 that experience hang or abnormal reboot due to a high number of instances within the same VPC CIDR block.
Problem description
Instances using CentOS public images or custom images may encounter hang or abnormal reboot if they meet the following conditions:
The kernel version is 3.10.0-514.
NoteUse the command
uname -ato verify the CentOS system's kernel version.The number of instances in the same VPC CIDR block is over 128.
NoteThe likelihood of this issue increases with the number of instances in the same VPC CIDR block.
Users transitioning from the classic network to a virtual private cloud (VPC) should be particularly aware of this issue. For more information, see Migrate from classic network to VPC.
Cause
Exceeding the kernel parameter net.ipv4.neigh.default.gc_thresh1's default value of 128, combined with network communication between instances, results in an overflow of kernel ARP (Address Resolution Protocol) cache entries.
In CentOS kernel version 3.10.0-514, this triggers an ARP entry reclamation process that contends with other ARP management functions, potentially leading to kernel crashes. Symptoms of kernel crashes include abnormal instance restarts and downtime. Typical kernel stack traces from such crashes are as follows:
PID: 35 TASK: ffff88023fe13ec0 CPU: 0 COMMAND: "kworker/0:1"
[exception RIP: __write_lock_failed+9]
RIP: ffffffff813275c9 RSP: ffff88023f7e3dc8 RFLAGS: 00000297
RAX: ffff88019c338000 RBX: ffff880035c89800 RCX: 000000000000000a
RDX: 0000000000000372 RSI: 000000012eeea6c0 RDI: ffff880035c8982c
RBP: ffff88023f7e3dc8 R8: ffffffff81aa7858 R9: 0001f955a06a7850
R10: 0001f955a06a7850 R11: 0000000000000000 R12: 0000000000000372
R13: ffffffff81aa7850 R14: ffff880035c89828 R15: ffff88019c339b90
CS: 0010 SS: 0018
#0 [ffff88023f7e3dd0] _raw_write_lock at ffffffff8168e7d7
#1 [ffff88023f7e3de0] neigh_periodic_work at ffffffff8157f3ac
#2 [ffff88023f7e3e20] process_one_work at ffffffff810a845b
#3 [ffff88023f7e3e68] worker_thread at ffffffff810a9296
#4 [ffff88023f7e3ec8] kthread at ffffffff810b0a4f
#5 [ffff88023f7e3f50] ret_from_fork at ffffffff81697758
PID: 0 TASK: ffff880173afce70 CPU: 20 COMMAND: "swapper/20"
[exception RIP: native_halt+5]
RIP: ffffffff81060ff5 RSP: ffff880173b1b878 RFLAGS: 00000046
RAX: 000000000000912c RBX: ffff881fbf30f380 RCX: 000000000000912e
RDX: 000000000000912c RSI: 000000000000912e RDI: ffff8801736a0000
RBP: ffff880173b1b878 R8: 0000000000000086 R9: 0000000000000000
R10: 0000000000000000 R11: ffff880173b1b95e R12: 0000000000000082
R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000e20
CS: 0010 SS: 0018
#0 [ffff880173b1b880] kvm_lock_spinning at ffffffff81060b5a
#1 [ffff880173b1b8b0] __raw_callee_save_kvm_lock_spinning at ffffffff8105ff05
#2 [ffff880173b1b900] _raw_spin_lock_irqsave at ffffffff8168dcd3
#3 [ffff880173b1b940] mod_timer at ffffffff81098e24
#4 [ffff880173b1b988] add_timer at ffffffff81098fe8
#5 [ffff880173b1b998] fbcon_add_cursor_timer at ffffffff81381069
#6 [ffff880173b1b9c0] fbcon_cursor at ffffffff8138422a
#7 [ffff880173b1ba10] hide_cursor at ffffffff813f6628
#8 [ffff880173b1ba28] vt_console_print at ffffffff813f8058
#9 [ffff880173b1ba90] call_console_drivers.constprop.15 at ffffffff81086ca1
#10 [ffff880173b1bab8] console_unlock at ffffffff810884be
#11 [ffff880173b1baf0] vprintk_emit at ffffffff810889d4
#12 [ffff880173b1bb60] vprintk_default at ffffffff81088d49
#13 [ffff880173b1bb70] printk at ffffffff8167f854
#14 [ffff880173b1bbd0] no_context at ffffffff8167ecbb
#15 [ffff880173b1bc20] __bad_area_nosemaphore at ffffffff8167ee29
#16 [ffff880173b1bc68] bad_area_nosemaphore at ffffffff8167ef93
#17 [ffff880173b1bc78] __do_page_fault at ffffffff81691f1e
#18 [ffff880173b1bcd8] trace_do_page_fault at ffffffff81692176
#19 [ffff880173b1bd18] do_async_page_fault at ffffffff8169181b
#20 [ffff880173b1bd30] async_page_fault at ffffffff8168e3b8
[exception RIP: get_next_timer_interrupt+440]
RIP: ffffffff810991a8 RSP: ffff880173b1bde0 RFLAGS: 00010017
RAX: 0000000000000000 RBX: 0098950e05e51640 RCX: 0000ffbc0000ffbc
RDX: 0000000b3fe32cf2 RSI: ffff8801736a1318 RDI: 000000000affe32d
RBP: ffff880173b1be30 R8: 0000000000000001 R9: 000000000000002f
R10: 000000000000002d R11: ffff8801736a1028 R12: 0000000affe32cf2
R13: ffff8801736a0000 R14: ffff880173b1bde8 R15: ffff880173b1be00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#21 [ffff880173b1be38] tick_nohz_stop_sched_tick at ffffffff810f3418
#22 [ffff880173b1be90] __tick_nohz_idle_enter at ffffffff810f35be
#23 [ffff880173b1bec0] tick_nohz_idle_enter at ffffffff810f3aed
#24 [ffff880173b1bed0] cpu_startup_entry at ffffffff810e7c13
#25 [ffff880173b1bf28] start_secondary at ffffffff8104f11aSolutions
Permanent solution
Run the command sudo yum update kernel to update the kernel to version 3.10.0-693.21.1.el7.x86_64 or newer.
After the kernel upgrade, the instance must be restarted. For detailed instructions, see Restart an instance.
Temporary solutions
If a kernel upgrade is not possible, consider the following two methods to temporarily mitigate the issue.
Method 1
Adjust the kernel parameter values by running the commands below. Set gc_thresh1 higher than the number of instances in the same VPC CIDR block, ensuring gc_thresh3 ≥ gc_thresh2 ≥ gc_thresh1. For example:
sysctl -w net.ipv4.neigh.default.gc_thresh1=4096
sysctl -w net.ipv4.neigh.default.gc_thresh2=8192
sysctl -w net.ipv4.neigh.default.gc_thresh3=8192To retain these settings after a restart, add the kernel parameter configurations to the /etc/sysctl.conf file.
Method 2
When planning your VPC, avoid placing too many instances in a single CIDR block.