OOM Killed in Flink Containerized Environment

JVM memory partition

For most Java users, the frequency of dealing with JVM Heap in daily development is much greater than that of other JVM memory partitions, so other memory partitions are often collectively referred to as Off-Heap memory. For Flink, the problem of excessive memory usually comes from Off-Heap memory, so it is necessary to have a deeper understanding of the JVM memory model.

According to JVM 8 Spec[1], the memory partition managed by JVM

In addition to the standard partitions stipulated in the above Spec, JVM often adds some additional partitions for advanced functional modules in specific implementations. Taking HotSopt JVM as an example, according to the Oracle NMT[5] standard, we can subdivide the JVM memory into the following areas:

● Heap: The memory area shared by each thread, which mainly stores objects created by the new operator. The release of memory is managed by GC and can be used by user code or JVM itself.

● Class: Metadata of a class, corresponding to Method Area (excluding Constant Pool) in Spec, and Metaspace in Java 8.

● Thread: Thread-level memory area, corresponding to the sum of PC Register, Stack and Natvive Stack in Spec.

● Compiler: Memory used by the JIT (Just-In-Time) compiler.

● Code Cache: A cache used to store code generated by the JIT compiler.

● GC: Memory used by the garbage collector.

● Symbol: memory for storing Symbols (such as field names, method signatures, and Interned Strings), corresponding to the Constant Pool in the Spec.

● Arena Chunk: JVM applies for a temporary cache of operating system memory.

● NMT: memory used by NMT itself.

● Internal: Other memory that does not meet the above classification, including Native/Direct memory requested by user code.

● Unknown: memory that cannot be classified.

Ideally, we can strictly control the upper limit of the memory of each partition to ensure that the overall memory of the process is within the container limit. However, too strict management will bring additional usage costs and lack of flexibility. Therefore, in practice, JVM only provides a hard upper limit for a few of the partitions exposed to users, while other partitions can be viewed as a whole. For the memory consumption of the JVM itself.

The specific JVM parameters that can be used to limit partition memory are shown in the following table (it is worth noting that the industry does not have an accurate definition of JVM Native memory. Native memory in this article refers to the non-Direct part of Off-Heap memory, which is different from Native Non-Direct are interchangeable).

As can be seen from the table, it is relatively safe to use Heap, Metaspace and Direct memory, but the situation of non-Direct Native memory is more complicated, which may be some internal use of the JVM itself (such as the MemberNameTable mentioned below), or It may be the JNI dependency introduced by the user code, or it may be the Native memory requested by the user code itself through sun.misc.Unsafe. Theoretically speaking, the Native memory requested by user code or third-party lib requires the user to plan the memory usage, while the rest of Internal can be incorporated into the memory consumption of the JVM itself. In fact, Flink's memory model follows a similar principle.

Flink TaskManager memory model

First review the TaskManager memory model of Flink 1.10+.

Obviously, the Flink framework itself will not only include the Heap memory managed by the JVM, but also apply for its own management of Off-Heap Native and Direct memory. In my opinion, Flink's Off-Heap memory management strategies can be divided into three types:

● Hard Limit: The memory partition of the hard limit is Self-Contained, and Flink will ensure that its usage will not exceed the set threshold (if the memory is insufficient, an exception similar to OOM will be thrown)

● Soft Limit: A soft limit means that memory usage will be below the threshold for a long time, but may briefly exceed the configured threshold.

● Reserved: Reserved means that Flink does not limit the use of partition memory, but only reserves a part of the space when planning memory, but it cannot guarantee that the actual use will not exceed the limit.

Combined with the memory management of the JVM, what consequences will a memory overflow of a Flink memory partition cause? The judgment logic is as follows:

1. If Flink has a partition with a hard limit, Flink will report that the partition has insufficient memory. Otherwise go to the next step.

2. If the partition belongs to the partition managed by the JVM, when its actual value increases and the JVM partition also runs out of memory, the JVM will report the OOM of the JVM partition to which it belongs (such as java.lang.OutOfMemoryError: Jave heap space). Otherwise go to the next step.

3. The memory of the partition continues to overflow, eventually causing the overall memory of the process to exceed the container memory limit. In an environment with strict resource control enabled, the resource manager (YARN/k8s, etc.) will kill the process.

In order to intuitively display the relationship between Flink's memory partitions and JVM memory partitions, the author organizes the following memory partition mapping table

According to the previous logic, among all Flink memory partitions, only the JVM Overhead that is not Self-Contained and the JVM partition to which it belongs has no memory hard limit parameter may cause the process to be killed by OOM. As a hodgepodge of memory reserved for various purposes, JVM Overhead is indeed prone to problems, but at the same time it can also be used as a bottom-up isolation buffer to alleviate memory problems from other areas.

Common Causes of OOM Killed

Consistent with the above analysis, the common cause of OOM Killed in practice is basically the leak or overuse of Native memory. Because the OOM Killed of virtual memory is easy to avoid through the configuration of the resource manager and usually there is not much problem, so the following only discusses the OOM Killed of physical memory.

RocksDB Native Memory Uncertainty

As we all know, RocksDB directly applies for Native memory through JNI, and is not controlled by Flink, so in fact, Flink indirectly affects its memory usage by setting the memory parameters of RocksDB. However, currently Flink obtains these parameters by estimating, which are not very accurate values, for the following reasons.

The first is the problem that part of the memory is difficult to calculate accurately.

The second is a bug in RocksDB Block Cache, which will cause the cache size to be unable to be strictly controlled, and may exceed the set memory capacity in a short period of time, which is equivalent to a soft limit.

For this problem, usually we only need to increase the threshold of JVM Overhead to let Flink reserve more memory, because the memory overuse of RocksDB is only temporary.

glibc Thread Arena problem

Another common problem is the famous 64 MB problem of glibc, which may cause the memory usage of the JVM process to increase significantly, and eventually be killed by YARN.

Specifically, the JVM applies for memory through glibc, and in order to improve memory allocation efficiency and reduce memory fragmentation, glibc maintains a memory pool called Arena, including a shared Main Arena and thread-level Thread Arena. When a thread needs to apply for memory but the Main Arena has been locked by other threads, glibc will allocate a Thread Arena of about 64 MB (64-bit machine) for the thread to use. These Thread Arenas are transparent to the JVM, but will be counted into the overall virtual memory (VIRT) and physical memory (RSS) of the process.

By default, the maximum number of Arenas is the number of cpu cores * 8. For an ordinary 32-core server, it takes up to 16 GB, which is not unreasonable. In order to control the total amount of memory consumed, glibc provides the environment variable MALLOC_ARENA_MAX to limit the total amount of Arena. For example, Hadoop sets this value to 4 by default. However, this parameter is only a soft limit. When all Arenas are locked, glibc will still create a new Thread Arena to allocate memory [11], resulting in unexpected memory usage.

Generally speaking, this problem occurs in applications that need to create threads frequently. For example, HDFS Client will create a new DataStreamer thread for each file being written, so it is easier to encounter the problem of Thread Arena. If you suspect that your Flink application has encountered this problem, a relatively simple verification method is to check whether there are many consecutive anon segments whose size is a multiple of 64MB in the pmap of the process. For example, the blue segments of 65536 KB in the figure below are very likely It's Arena.


The fix for this problem is relatively simple, just set MALLOC_ARENA_MAX to 1, that is, disable Thread Arena and only use Main Arena. Of course, the price of this is that the efficiency of thread allocation memory will be reduced. However, it is worth mentioning that it may not be feasible to use Flink's process environment variable parameters to override the default MALLOC_ARENA_MAX parameter. The reason is that in the case of non-whitelist variable conflicts, NodeManager will merge the original URLs. value and the appended value, resulting in a result like MALLOC_ARENA_MAX="4:1".

Finally, there is a more thorough optional solution, which is to replace glibc with Google's tcmalloc or Facebook's jemalloc [12]. In addition to not having Thread Arena issues, memory allocation performance is better and less fragmentation. In fact, the official image of Flink 1.12 also changed the default memory allocator from glibc to jemelloc [17].

JDK8 Native memory leak

There is a Native memory leak bug [13] in versions prior to Oracle Jdk8u152, which will cause the internal memory partition of the JVM to grow continuously.

Specifically, the JVM will cache the mapping pairs of string symbols (Symbol) to methods (Method) and member variables (Field) to speed up the lookup. Each pair of mappings is called MemberName, and the entire mapping relationship is called MemeberNameTable, which is defined by java.lang. invoke.MethodHandles This class is responsible. Before Jdk8u152, MemberNameTable used Native memory, so some outdated MemberNames would not be automatically cleaned up by GC, causing memory leaks.

To confirm this problem, you need to check the memory status of the JVM through NMT. For example, I have encountered a MemeberNameTable of more than 400 MB in an online TaskManager.

After JDK-8013267[14], MemeberNameTable was moved from Native memory to Java Heap, which fixed this problem. However, there is more than one JVM native memory leak problem, such as the memory leak problem of the C2 compiler [15], so for users who do not have a dedicated JVM team like the author, upgrading to the latest version of JDK is the best way to fix the problem .

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us