This issue occurs because the physical memory of the operating system is exhausted or the process of a Java Virtual Machine (JVM) that runs the application stops responding. This topic uses the Linux operating system as an example to describe how to resolve this issue.
Out-of-memory (OOM) killer triggered by low physical memory on the operating system
By default, the OOM killer mechanism is enabled for an operating system. If an operating system runs low on physical memory and swap space, the OOM killer selectively kills processes. Linux allocates each running process a score that is called oom_score. You can view the score in /proc/<pid>/oom_score. A high score indicates a higher priority. In this case, the OOM killer kills processes in descending order and starts from the process with the highest score.
If a process is killed by the OOM killer, the OOM killer writes information. For example, OOM killer writes the process ID (PID) to the logs of the operating system. This way, you can search operating system (OS) logs to check whether a process is killed by the OOM killer.
The following logs show that a process of the ECS cluster is killed by the OOM killer:
[Wed Aug 31 16:36:42 2017] Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice child [Wed Aug 31 16:36:42 2017] Killed process 43805 (keystone-all) total-vm:4446352kB, anon-rss:4053140kB, file-rss:68kB [Wed Aug 31 16:56:25 2017] keystone-all invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [Wed Aug 31 16:56:25 2017] keystone-all cpuset=/ mems_allowed=0 [Wed Aug 31 16:56:25 2017] CPU: 2 PID: 88196 Comm: keystone-all Not tainted 3.10.0-327.13.1.el7.x86_64 #1
The following logs show that a process of the Swarm cluster is killed by the OOM killer.
Memory cgroup out of memory: Kill process 20911 (beam.smp) score 1001 or sacrifice child Killed process 20977 (sh) total-vm:4404kB, anon-rss:0kB, file-rss:508kB
To search for logs, you can run the following command:
root# grep -i 'killed process' /var/log/messages
You can also run the following command:
root# egrep "oom-killer|total-vm" /var/log/messages
If the issue occurs, use one of the following methods to resolve the issue:
- Increase the physical memory size of ECS instances. Otherwise, reduce the memory size that is allocated to the killed process.
- Check whether a swap partition is mounted on an ECS instance.If no swap partition is mounted, mount a swap partition to an ECS instance. In most cases, the OOM killer is triggered because no swap partition is mounted to one or more ECS instances. Mounting a swap partition to an ECS instance has a negative impact on the performance of the instance. However, this can be used to ensure the health of processes.
JVM processes that run applications unexpectedly exit.
In most cases, the running process of a JVM may stop responding due to invalid Java Native Interface (JNI) calls, out-of-heap-space errors in C, and other errors. If the preceding issue occurs, a file named hs_err_<jvm_pid>.log is generated in the working directory of the current JVM process. You can run the pwdx <jvm_pid> command to query the directory. In the log file, you can identify the cause of the error or the thread that is being executed when the error occurs. You can also enable core dumps to be generated for further analysis.
In addition, you can enable Analysis of Abnormal Exit on the Basic Information tab of the application in the Enterprise Distributed Application Service (EDAS) console. If the application monitoring and alerting feature is enabled, an alert is triggered when the JVM process unexpectedly exits. In this case, you can log on to the ECS instance to query logs and perform analysis to identify the cause.