Resolve an ECS instance crash with the Objects remaining in kmalloc alert caused by the memory cgroup kmem feature.
Problem description
When the memory cgroup kmem feature is enabled, the instance may go down with an alert log similar to the following in the OS kernel:
[80569.393775] BUG kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a0) (Tainted: P B W OE ------------ T):
Objects remaining in kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a0)
[80569.397756] -----------------------------------------------------------------------------
[80569.397756]
[80569.400724] INFO: Slab 0xffffea0001e94a00 objects=32 used=1 fp=0xffff88007a528000 flags=0x1fffff00004080
[80569.402702] CPU: 21 PID: 26626 Comm: dockerd Tainted: P B W OE ------------ T 3.10.0-693.2.2.el7.x86_64 #1
[80569.404898] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
[80569.406747] ffffea0001e94a00 000000004eb9a19f ffff883afee53aa0 ffffffff816a3db1
[80569.408833] ffff883afee53b78 ffffffff811dbf54 ffffffff00000020 ffff883afee53b88
[80569.410731] ffff883afee53b38 656a624f8190fff8 616d657220737463 6e6920676e696e69
[80569.412630] Call Trace:
[80569.414005] [<ffffffff816a3db1>] dump_stack+0x19/0x1b
[80569.415627] [<ffffffff811dbf54>] slab_err+0xb4/0xe0
[80569.417204] [<ffffffff811e0623>] ? __kmalloc+0x1e3/0x230
[80569.420419] [<ffffffff811e1939>] kmem_cache_close+0x149/0x2e0
[80569.422006] [<ffffffff811e1ae4>] __kmem_cache_shutdown+0x14/0x80
[80569.423606] [<ffffffff811a6874>] kmem_cache_destroy+0x44/0xf0
[80569.425149] [<ffffffff811f6019>] kmem_cache_destroy_memcg_children+0x89/0xb0
[80569.426800] [<ffffffff811a6849>] kmem_cache_destroy+0x19/0xf0
[80569.428309] [<ffffffff8123b18e>] bioset_free+0xce/0x110
[80569.431306] [<ffffffffc06d0b43>] dm_destroy+0x13/0x20 [dm_mod]
[80569.432803] [<ffffffffc06d69be>] dev_remove+0x11e/0x180 [dm_mod]
[80569.435851] [<ffffffffc06d7015>] ctl_ioctl+0x1e5/0x500 [dm_mod]
[80569.437363] [<ffffffffc06d7343>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[80569.438882] [<ffffffff8121524d>] do_vfs_ioctl+0x33d/0x540
[80569.443291] [<ffffffff812154f1>] SyS_ioctl+0xa1/0xc0
[80569.446228] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
Cause
With the memory cgroup kmem feature enabled, kmem_cache_destroy deletes memcg cache and checks whether refcount is 0 before destroying kmem_cache. If refcount is not 0, tasks may allocate slab memory through the memcg cache of kmem_cache, triggering race conditions that crash the instance.
Solution
Before you proceed, create snapshots to back up instance data to prevent data loss. See Create snapshot manually.
Disable the memory cgroup kmem feature in the instance:
-
Open /etc/default/grub:
vim /etc/default/grub -
Press I to enter Insert mode. Add the following to the
GRUB_CMDLINE_LINUXline:cgroup.memory=nokmemGRUB_TIMEOUT=1 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M cryptomgr.notests cgroup.memory=nokmem rcupdate.rcu_cpu_stall_timeout=300 fnames=0 console=tty0 console=ttyS0,115200n8 noibrs nvme_core.io_timeout=4294967295" GRUB_DISABLE_RECOVERY="true" -
Press Esc, enter :wq, and press
Enterto save and close the file. -
Update GRUB:
grub2-mkconfig -o /boot/grub2/grub.cfg -
Restart the instance:
reboot
If you cannot disable the kmem feature through command line tools, do not set memory.kmem.limit_in_bytes in any program on the instance. This keeps the kmem feature disabled.