Fix an ECS instance crash caused by the Out of memory and no killable processes kernel panic by checking for memory leaks and adjusting OOM settings.
Symptom
An instance crashes at runtime with a call stack similar to the following:
[28663.625353] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[28663.625363] [ 1799] 0 1799 26512 245 56 3 0 -1000 sshd
[28663.625367] [29219] 0 29219 10832 126 26 3 0 -1000 systemd-udevd
[28663.625375] Kernel panic - not syncing: Out of memory and no killable processes...
[28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G OE 3.10.0-1062.9.1.el7.x86_64 #1
[28663.676873] Call Trace:
[28663.679312] [<ffffffff8139f342>] dump_stack+0x63/0x81
[28663.684421] [<ffffffff811b2245>] panic+0xf8/0x244
[28663.689184] [<ffffffff811b98db>] out_of_memory+0x2eb/0x550
[28663.694726] [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0
[28663.700959] [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40
[28663.707279] [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260
[28663.713599] [<ffffffff81216535>] alloc_pages_current+0x95/0x140
[28663.719573] [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40
[28663.725113] [<ffffffff81075dae>] pgd_alloc+0x1e/0x160
[28663.730225] [<ffffffff810875e4>] mm_init+0x184/0x240
[28663.735249] [<ffffffff81088102>] mm_alloc+0x52/0x60
[28663.740186] [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780
[28663.759839] [<ffffffff81257b9c>] do_execve+0x2c/0x30
[28663.764864] [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150
[28663.777246] [<ffffffff81741dd9>] ret_from_fork+0x39/0x50
Cause
The kernel fails to allocate memory and cannot kill any process to free memory, so the instance crashes. Possible causes:
-
A kernel memory leak exhausts available memory.
-
Processes with
oom_score_adjset to-1000consume excessive memory and cannot be killed.Noteoom_score_adjis an integer that controls OOM kill priority. A lower value makes a process less likely to be killed; a higher value makes it more likely.
Solution
Before you proceed, create snapshots to back up instance data to prevent data loss. See Create snapshot manually.
-
Check for a kernel memory leak.
See What do I do if an instance has a high percentage of slab_unreclaimable memory?
-
Verify the
oom_score_adjsettings.-
Get the PID of the target process using
ps,top, orpgrep:ps aux | grep <Process name>Replace
<Process name>with the actual process name. -
Check the
oom_score_adjvalue:cat /proc/<PID>/oom_score_adjReplace
<PID>with the actual PID.Evaluate whether the
oom_score_adjvalues are appropriate for your environment. A process withoom_score_adjset to-1000is exempt from OOM killing, which may cause available memory to run out.
-