Problem description
A system failure occurs on an Elastic Compute Service (ECS) instance that run Alibaba Cloud Linux 2 and have the following properties:
Image: Alibaba Cloud Linux 2.1903 LTS 64-bit
Kernel:kernel-4.19.91-23.al7 or earlier
The following call stack information is shown during the system failure.
[598398.653602] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[598398.655740] PGD 0 P4D 0
[598398.656113] Oops: 0000 [#1] SMP PTI
[598398.656601] CPU: 10 PID: 5519 Comm: kworker/u192:5 Kdump: loaded Not tainted 4.19.91-21.2.al7.x86_64 #1
[598398.657851] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
[598398.658885] Workqueue: writeback wb_workfn (flush-254:48)
[598398.659644] RIP: 0010:locked_inode_to_wb_and_lock_list+0x1a/0x110
[598398.660484] Code: 6b 50 eb ce 48 01 6b 48 eb db 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 4c 8d a7 88 00 00 00 55 48 89 fd 53 48 8b 9d f8 00 00 00 <48> 8b 03 48 83 c0 58 48 39 c3 74 13 48 8b 83 18 02 00 00 a8 03 0f
[598398.662967] RSP: 0018:ffffbae05dc07c00 EFLAGS: 00010246
[598398.663662] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9d096bd0ed80
[598398.664611] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9d096bd0ed80
[598398.665555] RBP: ffff9d096bd0ed80 R08: 0000000000000001 R09: ffff9d344af394b2
[598398.666530] R10: 0000000000000006 R11: 0000000000000018 R12: ffff9d096bd0ee08
[598398.667470] R13: ffff9d096bd0ee68 R14: ffffbae05dc07e08 R15: ffff9d4e4914b858
[598398.668448] FS: 0000000000000000(0000) GS:ffff9d4e5e680000(0000) knlGS:0000000000000000
[598398.669543] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[598398.670322] CR2: 0000000000000000 CR3: 000000396720a003 CR4: 00000000003606e0
[598398.671267] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[598398.672238] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[598398.673196] Call Trace:
[598398.673563] writeback_sb_inodes+0x238/0x470
[598398.674170] __writeback_inodes_wb+0x87/0xb0
[598398.674754] wb_writeback+0x248/0x2e0
[598398.675260] ? 0xffffffffb7000000
[598398.675728] ? cpumask_next+0x16/0x20
[598398.676268] wb_workfn+0x350/0x420
[598398.676764] ? __switch_to+0xab/0x460
[598398.677297] process_one_work+0x15b/0x370
[598398.677859] worker_thread+0x49/0x3e0
[598398.678391] kthread+0xf8/0x130
[598398.678834] ? process_one_work+0x370/0x370
[598398.679419] ? kthread_park+0xb0/0xb0
[598398.679954] ret_from_fork+0x35/0x40
[598398.680451] Modules linked in: 8021q garp stp llc tcp_diag inet_diag sunrpc intel_rapl_msr intel_rapl_common iosf_mbi mousedev hid_generic isst_if_common psmouse pvpanic nfit button usbhid uhci_hcd ehci_pci ehci_hcd cirrus
[598398.683106] CR2: 0000000000000000
[598398.683580] ---[ end trace c60ddcb70b40a540 ]---
Cause
When the EXT4 file system deletes inodes from logs, the process may cause a race condition with the journaling block device 2 (JBD2) transaction committing process. A use-after-free (UAF) problem may occur, which leads to a system failure.
Solution
Take note of the following items:
Before you perform high-risk operations such as modifying instance configurations or data, we recommend that you check the disaster recovery and fault tolerance capabilities of the instances to ensure data security.
You can modify the configurations and data of Alibaba Cloud instances such as ECS and ApsaraDB RDS instances. We recommend that you create snapshots or enable RDS log backup before you modify instance configurations or data.
If you have granted permissions to users or submitted sensitive information such as logon accounts and passwords in Alibaba Cloud Management Console, we recommend that you modify the information in a timely manner.
You can perform the following steps to troubleshoot the problem:
Log on to the ECS instance. For more information, see Overview.
Run the following command to check whether one of the following solutions is applicable to your system kernel version:
uname -r
If an output similar to the following one is returned, one of the solutions is applicable to your system kernel version:
4.19.91-19.1.al7.x86_64
Select one of the following solutions based on your system kernel version:
For kernel versions earlier than 4.19.91-19.1.al7.x86_64, you can perform the following steps:
Run the following command to update the kernel of the operating system to the latest version:
yum update kernel
Run the following command to restart the server for the new kernel version to take effect:
reboot
If the problem persists, run the following command to install a hot patch for the kernel.
For kernel versions from V4.19.91-19.1.al7.x86_64 to V4.19.91-23.al7.x86_64, you can run the following command to install a hot patch for the kernel:
yum install -y kernel-hotfix-5260815-`uname -r | awk -F"-" '{print $NF}'`
Applicable scope
ECS