This problem is usually caused by the following reasons (the example is based on Linux operating systems).
When the operating system is out of physical memory and swap space, the OOM Killer mechanism (enabled by default) selectively kills processes. How does OOM Killer know which process to kill first?In fact, every process in Linux operating systems has an oom_score (located in /proc/<pid>/oom_score). The higher the value of oom_score, the more likely it is selected and killed by OOM Killer.
When a process is killed by OOM Killer, its PID and other information are logged. You can query the operating system logs to check whether a process has been killed by OOM Killer.
The log indicating that an ECS cluster process is killed by OOM Killer is as follows:
[Wed Aug 31 16:36:42 2017] Out of memory: Kill process 43805 (keystone-all) score 249 or sacrifice child
[Wed Aug 31 16:36:42 2017] Killed process 43805 (keystone-all) total-vm:4446352kB, anon-rss:4053140kB, file-rss:68kB
[Wed Aug 31 16:56:25 2017] keystone-all invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[Wed Aug 31 16:56:25 2017] keystone-all cpuset=/ mems_allowed=0
[Wed Aug 31 16:56:25 2017] CPU: 2 PID: 88196 Comm: keystone-all Not tainted 3.10.0-327.13.1.el7.x86_64 #1
The log indicating that a Swarm cluster process is killed by OOM Killer is as follows:
Memory cgroup out of memory: Kill process 20911 (beam.smp) score 1001 or sacrifice child
Killed process 20977 (sh) total-vm:4404kB, anon-rss:0kB, file-rss:508kB
In summary, the query command is
root# grep -i 'killed process' /var/log/messages
root# egrep "oom-killer|total-vm" /var/log/messages
The following solutions are available for this problem:
- Extend the physical memory of the ECS instance or reduce the memory allocated to the killed process
- Check whether the ECS instance is mounted with a swap partition. If not, search the methods for mounting swap partitions in Linux systems and mount the swap partitions yourself. By default, ECS instances are not mounted with swap partitions. Most of the OOM Killer problems in the Alibaba Cloud ECS environment are caused by the lack of swap partitions. Process health is more important than performance.
JVM process crashing during runtime are usually due to abnormal JNI calls, C Heap OOM, and other bugs. When this problem occurs, the following log file is generated in the working directory of the current JVM process (run the pwdx <jvm_pid> command to query the logs):
Typically, in the log file, you can locate the thread that is being executed when the JVM process crashes or the cause of the crash. If necessary, enable the generation of coredump files for further analysis.
In addition, you can select the Abnormal Exit Analysis option on the Application Information page of the EDAS console. If the application monitoring and alarm function is enabled, an alarm is generated when the JVM process is terminated unexpectedly. In this case, you can log on to the ECS instance to query logs for cause analysis.