what do I do if an ECS instance goes down and the "Out of memory and no killable processes" error message appears -

Fix an ECS instance crash caused by the Out of memory and no killable processes kernel panic by checking for memory leaks and adjusting OOM settings.

Symptom

An instance crashes at runtime with a call stack similar to the following:

[28663.625353] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[28663.625363] [ 1799]     0  1799    26512      245      56       3        0         -1000 sshd
[28663.625367] [29219]     0 29219    10832      126      26       3        0         -1000 systemd-udevd
[28663.625375] Kernel panic - not syncing: Out of memory and no killable processes...
[28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G           OE   3.10.0-1062.9.1.el7.x86_64 #1
[28663.676873] Call Trace:
[28663.679312]  [<ffffffff8139f342>] dump_stack+0x63/0x81
[28663.684421]  [<ffffffff811b2245>] panic+0xf8/0x244
[28663.689184]  [<ffffffff811b98db>] out_of_memory+0x2eb/0x550
[28663.694726]  [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0
[28663.700959]  [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40
[28663.707279]  [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260
[28663.713599]  [<ffffffff81216535>] alloc_pages_current+0x95/0x140
[28663.719573]  [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40
[28663.725113]  [<ffffffff81075dae>] pgd_alloc+0x1e/0x160
[28663.730225]  [<ffffffff810875e4>] mm_init+0x184/0x240
[28663.735249]  [<ffffffff81088102>] mm_alloc+0x52/0x60
[28663.740186]  [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780
[28663.759839]  [<ffffffff81257b9c>] do_execve+0x2c/0x30
[28663.764864]  [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150
[28663.777246]  [<ffffffff81741dd9>] ret_from_fork+0x39/0x50

Cause

The kernel fails to allocate memory and cannot kill any process to free memory, so the instance crashes. Possible causes:

A kernel memory leak exhausts available memory.
Processes with oom_score_adj set to -1000 consume excessive memory and cannot be killed.

Note
oom_score_adj is an integer that controls OOM kill priority. A lower value makes a process less likely to be killed; a higher value makes it more likely.

Solution

Important

Before you proceed, create snapshots to back up instance data to prevent data loss. See Create snapshot manually.

Check for a kernel memory leak.

See What do I do if an instance has a high percentage of slab_unreclaimable memory?
Verify the oom_score_adj settings.
1. Get the PID of the target process using ps, top, or pgrep:
```
ps aux | grep <Process name>
```
  Replace <Process name> with the actual process name.
2. Check the oom_score_adj value:
```
cat /proc/<PID>/oom_score_adj
```
  Replace <PID> with the actual PID.
  
  Evaluate whether the oom_score_adj values are appropriate for your environment. A process with oom_score_adj set to -1000 is exempt from OOM killing, which may cause available memory to run out.