Problem description
A system failure occurs on the Elastic Compute Service (ECS) instances that run Alibaba Cloud Linux 2 and have the following properties:
- Image: Alibaba Cloud Linux 2.1903 LTS 64-bit
- Kernel: kernel-4.19.91-23.al7 or earlier
The following call stack information is shown during the system failure.
[ 332.057218] watchdog: BUG: soft lockup - CPU#7 stuck for 11s! [split_v2:28356]
[ 332.057219] mousedev isst_if_common hid_generic usbhid
[ 332.057223] CPU: 3 PID: 28336 Comm: split_v2 Kdump: loaded Not tainted 4.19.91-19.1.al7.x86_64 #1
[ 332.057507] Kernel panic - not syncing: softlockup: hung tasks
[ 332.057508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 332.057510] CPU: 6 PID: 28355 Comm: split_v2 Kdump: loaded Tainted: G L 4.19.91-19.1.al7.x86_64 #1
[ 332.057513] cp_new_stat+0x13d/0x160
[ 332.057514] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000019
[ 332.057515] Call Trace:
[ 332.057516] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
[ 332.057518] __se_sys_newfstat+0x2e/0x40
[ 332.057518] Call Trace:
[ 332.057519] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057521] RBP: 00007eff1201bf10 R08: 00007eff1201c700 R09: 00007eff1201c700
[ 332.057523] do_syscall_64+0x5b/0x1b0
[ 332.057524] <IRQ>
[ 332.057525] RSP: 0018:ffffa389886efde8 EFLAGS: 00050206
[ 332.057529] dump_stack+0x66/0x8b
[ 332.057531] R10: 00007eff1201c9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057534] panic+0xd8/0x24c
[ 332.057535] RAX: 000000c000100090 RBX: ffffa389886efea8 RCX: 0000000000000090
[ 332.057536] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1201c700
[ 332.057539] __do_page_fault+0x11d/0x470
[ 332.057540] ? 0xffffffffc0477000
[ 332.057541] RDX: 0000000000000090 RSI: ffffa389886efdf8 RDI: 000000c000100000
[ 332.057552] watchdog_timer_fn+0x253/0x260
[ 332.057555] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057556] ? softlockup_fn+0x40/0x40
[ 332.057557] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057559] __hrtimer_run_queues+0xeb/0x250
[ 332.057560] R10: ffff8bfb1690a310 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04df00
[ 332.057562] hrtimer_interrupt+0x122/0x270
[ 332.057563] RIP: 0033:0x7eff1b11e3a4
[ 332.057564] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057566] smp_apic_timer_interrupt+0x6a/0x140
[ 332.057568] do_page_fault+0x32/0x140
[ 332.057570] apic_timer_interrupt+0xf/0x20
[ 332.057572] _copy_to_user+0x22/0x30
[ 332.057573] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057574] </IRQ>
[ 332.057575] RSP: 002b:00007eff1181aed8 EFLAGS: 00000246
[ 332.057578] RIP: 0010:__do_page_fault+0x227/0x470
[ 332.057579] ORIG_RAX: 0000000000000005
[ 332.057580] Code: 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 f6 85 91 00 00 00 02 41 bf 14 00 00 00 0f 84 c5 fe ff ff fb 66 0f 1f 44 00 00 <e9> b9 fe ff ff f6 85 88 00 00 00 03 75 0d f6 85 92 00 00 00 04 0f
[ 332.057582] cp_new_stat+0x13d/0x160
[ 332.057583] RSP: 0018:ffffa389886f7ca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 332.057585] __se_sys_newfstat+0x2e/0x40
[ 332.057586] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffffffff93a00ae0
[ 332.057587] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057588] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffa389886f7d38
[ 332.057589] do_syscall_64+0x5b/0x1b0
[ 332.057590] RBP: ffffa389886f7d38 R08: 0000000000000000 R09: 0000000000000000
[ 332.057591] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000009
[ 332.057592] R10: 0000000000000000 R11: 0000000000000000 R12: 000000c000100000
[ 332.057594] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057595] R13: ffff8bfb168bd940 R14: ffff8bfaee04af80 R15: 0000000000000014
[ 332.057597] RIP: 0033:0x7eff1b11e3a4
[ 332.057599] async_page_fault+0x1e/0x30
[ 332.057601] ? restore_regs_and_return_to_kernel+0x25/0x25
[ 332.057602] RBP: 00007eff1181af10 R08: 00007eff1181b700 R09: 00007eff1181b700
[ 332.057602] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057604] do_page_fault+0x32/0x140
[ 332.057606] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057607] R10: 00007eff1181b9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057608] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057609] async_page_fault+0x1e/0x30
[ 332.057610] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1181b700
[ 332.057612] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057613] RSP: 002b:00007eff08808ed8 EFLAGS: 00000246
[ 332.057614] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057615] ORIG_RAX: 0000000000000005
[ 332.057616] RSP: 0018:ffffa389886f7de8 EFLAGS: 00050206
[ 332.057617] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057618] RAX: 000000c000100090 RBX: ffffa389886f7ea8 RCX: 0000000000000090
[ 332.057619] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000024
[ 332.057620] RDX: 0000000000000090 RSI: ffffa389886f7df8 RDI: 000000c000100000
[ 332.057621] RSP: 0018:ffffa389886ffde8 EFLAGS: 00050206
[ 332.057623] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057624] RBP: 00007eff08808f10 R08: 00007eff08809700 R09: 00007eff08809700
[ 332.057625] R10: ffff8bfb1690b810 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04af80
[ 332.057626] R10: 00007eff088099d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057627] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057628] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08809700
[ 332.057630] _copy_to_user+0x22/0x30
[ 332.057631] RAX: 000000c000100090 RBX: ffffa389886ffea8 RCX: 0000000000000090
[ 332.057632] cp_new_stat+0x13d/0x160
[ 332.057633] RDX: 0000000000000090 RSI: ffffa389886ffdf8 RDI: 000000c000100000
[ 332.057634] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057635] __se_sys_newfstat+0x2e/0x40
[ 332.057636] R10: ffff8bfb1690ad10 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee048000
[ 332.057637] do_syscall_64+0x5b/0x1b0
[ 332.057638] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057640] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057642] _copy_to_user+0x22/0x30
[ 332.057643] RIP: 0033:0x7eff1b11e3a4
[ 332.057645] cp_new_stat+0x13d/0x160
[ 332.057646] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057647] __se_sys_newfstat+0x2e/0x40
[ 332.057648] RSP: 002b:00007eff08007ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057651] do_syscall_64+0x5b/0x1b0
[ 332.057652] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057654] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057655] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000002e
[ 332.057656] RIP: 0033:0x7eff1b11e3a4
[ 332.057657] RBP: 00007eff08007f10 R08: 00007eff08008700 R09: 00007eff08008700
[ 332.057658] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057659] R10: 00007eff080089d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057660] RSP: 002b:00007eff07806ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057662] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08008700
[ 332.057663] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057663] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000001e
[ 332.057664] RBP: 00007eff07806f10 R08: 00007eff07807700 R09: 00007eff07807700
[ 332.057665] R10: 00007eff078079d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057665] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff07807700
Cause
By default, the transparent huge pages (THP) feature is enabled on Alibaba Cloud Linux ECS instances. During garbage collection (GC) of the memory, the system calls MADV_NOHUGEPAGE to disable the THP feature, calls MADV_FREE to release some 4KB-sized pages, and cuts transparent huge pages in the operating system. If a page fault exception occurs in other kernel processes, pages occupy CPU resources and the scheduling process of cutting transparent huge pages cannot be completed. This process is suspended and unable to end. The process of page fault always waits for the process of cutting transparent huge pages to end, which leads to SOFT LOCKUP. If your Alibaba Cloud Linux instance is configured with /proc/sys/kernel/softlockup_panic, SOFT LOCKUP triggers a kernel failure.
Take note of the following items:
- Before you perform high-risk operations such as modifying instance configurations or data, we recommend that you check the disaster recovery and fault tolerance capabilities of the instances to ensure data security.
- You can modify the configurations and data of instances such as ECS and ApsaraDB RDS instances. We recommend that you create snapshots or enable RDS log backup before you modify instance configurations or data.
- If you have granted permissions to users or submitted sensitive information such as logon accounts and passwords in Alibaba Cloud Management Console, we recommend that you modify the information in a timely manner.
You can perform the following steps to troubleshoot the problem:
- Log on to the ECS instance. For more information, see Overview.
- Run the following command to check whether one of the following solutions is applicable to your system kernel version:
uname -r
If an output similar to the following one is returned, one of the following solutions is applicable to your system kernel version:4.19.91-19.1.al7.x86_64
- Select one of the following solutions based on your system kernel version:
- For kernel versions earlier than 4.19.91-19.1.al7.x86_64, you can perform the following steps:
- Run the following command to update the kernel of the operating system to the latest version:
yum update kernel
- Run the following command to restart the server for the new kernel version to take effect:
reboot
- If the problem persists, run the following command to install a hot patch for the kernel.
- Run the following command to update the kernel of the operating system to the latest version:
- For kernel versions from V4.19.91-19.1.al7.x86_64 to V4.19.91-23.al7.x86_64, you can run the following command to install a hot patch for the kernel:
yum install -y kernel-hotfix-5902278-`uname -r | awk -F"-" '{print $NF}'`
- For kernel versions earlier than 4.19.91-19.1.al7.x86_64, you can perform the following steps:
Applicable scope
- ECS