what do I do if an ECS instance goes down and the "Out of memory and no killable processes" error message appears -

If your Elastic Compute Service (ECS) instance goes down and the Out of memory and no killable processes error message appears in an error log, you can use the solution described in this topic to fix the issue.

Problem description

An instance goes down at runtime and a call stack similar to the following one is displayed:

[28663.625353] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[28663.625363] [ 1799]     0  1799    26512      245      56       3        0         -1000 sshd
[28663.625367] [29219]     0 29219    10832      126      26       3        0         -1000 systemd-udevd
[28663.625375] Kernel panic - not syncing: Out of memory and no killable processes...
[28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G           OE   3.10.0-1062.9.1.el7.x86_64 #1
[28663.676873] Call Trace:
[28663.679312]  [<ffffffff8139f342>] dump_stack+0x63/0x81
[28663.684421]  [<ffffffff811b2245>] panic+0xf8/0x244
[28663.689184]  [<ffffffff811b98db>] out_of_memory+0x2eb/0x550
[28663.694726]  [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0
[28663.700959]  [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40
[28663.707279]  [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260
[28663.713599]  [<ffffffff81216535>] alloc_pages_current+0x95/0x140
[28663.719573]  [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40
[28663.725113]  [<ffffffff81075dae>] pgd_alloc+0x1e/0x160
[28663.730225]  [<ffffffff810875e4>] mm_init+0x184/0x240
[28663.735249]  [<ffffffff81088102>] mm_alloc+0x52/0x60
[28663.740186]  [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780
[28663.759839]  [<ffffffff81257b9c>] do_execve+0x2c/0x30
[28663.764864]  [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150
[28663.777246]  [<ffffffff81741dd9>] ret_from_fork+0x39/0x50

Cause

When the operating system kernel of an instance fails to allocate memory to processes and attempts to kill specific processes to release memory, no processes that run on the instance can be killed. As a result, the instance goes down. The issue may be caused by the following reasons:

A memory leak occurs in the operating system kernel, which causes insufficient available memory in the system.
The processes whose oom_score_adj value is set to -1000 use excessive memory and cannot be killed. This also causes insufficient available memory in the system.
Note
The value of oom_score_adj is an integer that indicates the likelihood of a process being selected to be killed by the kernel under Out of Memory (OOM) conditions. A lower value indicates that a process is less likely to be selected for OOM killing by the kernel, while a higher value indicates that a process is more likely to be selected.

Solution

Important

Before you perform the operations in the solution on a Linux instance on which the issue occurred, we recommend that you create snapshots for the Linux instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot for a disk.

Check whether a memory leak occurs in the operating system kernel.
For more information, see What do I do if an instance has a high percentage of slab_unreclaimable memory?
Check whether the oom_score_adj value is properly set.
1. Run the ps, top, or pgrep command to obtain the PID of a specific process. Sample command:
```
ps aux | grep <Process name>
```
  Replace <Process name> with the name of the process whose PID you want to obtain.
2. Run the following command to check the oom_score_adj value:
```
cat /proc/<PID>/oom_score_adj
```
  Replace <PID> with the actual PID that you obtained.
  In combination with your environment and requirements, you can evaluate whether the OOM killing settings for processes are reasonable based on the value of oom_score_adj. If the value of oom_score_adj for a process is -1000, the process has a lower priority and is less likely to be selected for OOM killing by the kernel. As a result, the available memory in the system may become insufficient.