Community Blog Alibaba Dragonwell ZGC – Part 1: New Garbage Collector ZGC Unboxing and the First Experience of ZGC

Alibaba Dragonwell ZGC – Part 1: New Garbage Collector ZGC Unboxing and the First Experience of ZGC

Part 1 of this 3-part series introduces the basic concepts of GC and the large-scale practice of ZGC.

By Hao Tang

Java garbage collection (GC) is a powerful tool to improve the efficiency of Java code development. However, the pauses caused by GC usually affect the response time of Java applications. Java 11 introduces the Z Garbage Collector (ZGC), which is able to keep the GC pause times in several milliseconds. This new garbage collector obviously paves a new way to optimize the response time of Jav applications and becomes one of the reasons for many applications to switch from Java 8 to Java 11. Alibaba Dragonwell 11 is a free and production-ready OpenJDK 11 release that provides long-term support with performance enhancements and security fixes. The first release of Dragonwell 11 already provides the experimental ZGC feature of Java 11. As more and more applications adopted ZGC as their garbage collector to improve the response time, more and more challenging problems have appeared during the practice of ZGC. Then, Alibaba released Dragonwell whose ZGC feature was upgraded from an experimental feature in OpenJDK 11 to a production-ready feature. The upgrade ensures the quality and stability of Dragonwell 11 long-term support.

This article is the first part of a series of technical sharing on Alibaba Dragonwell ZGC which introduces the ZGC of Alibaba Dragonwell 11. This series has three articles. Part 1 introduces the basic concepts of GC and the large-scale practice of ZGC. Part 2 will introduce the principle and tuning of ZGC. Part 3 will introduce Dragonwell in terms of its production-ready transformation for ZGC.

A Brief Introduction to Java GC

Garbage collection (GC) is an automatic memory management mechanism for Java languages. A garbage collector destroys dead objects (that can no longer be referenced) to free memory for later memory usage. With the help of GC, Java developers only need to mind their own bussiness by calling new statements to create objects without writing statements to destroy objects. This obviously improves code development efficiency and code quality.

GC Performance Metrics

There is no such thing as a free lunch. GC brings convenience but significant side effects. For Java services, users care about two metrics: throughput (QPS, query per second) and RT (response time). GC usually has a negative impact on QPS and RT. GC pauses will increase the RT, especially the RT P99/P999 of long-tail requests (RT of requests ranked 99%/99.9% from fast to slow). GC needs to suspend all Java threads (to avoid the data race between GC threads and Java threads) to ensure the correctness of the GC algorithm. Java threads cannot respond to any requests during the pause, so the RT of the service becomes longer.

GC reduces the throughput (the upper limit of QPS). GC threads occupy additional CPU resources, thus affecting the share of CPU used by Java threads. People often think that GC must go hand in hand with a pause. However, this view is not correct. Modern Java GC can start concurrent GC threads to execute concurrently with Java threads.

Smooth Experience with ZGC

The Java language provides several types of GC mechanisms to satisfy different needs. These GC mechanisms have different characteristics in terms of throughput and response time:

  • Parallel GC: High throughput and long pauses
  • G1 GC, CMS GC: The throughput and the pause time of the GC are relatively good. G1 GC is the default GC (the target pause time is 200 ms), while CMS GC is not recommended in Java11.
  • ZGC, Shenandoah GC: The GC pause time is short, and the throughput is average.

In this article, the protagonist ZGC is a new-introduced generation of GC by OpenJDK11. Its pause time can be kept in less than 10 ms and withstand TeraBytes level of heap.

The GCs before Java 11 typically experience pause time of more than 100 ms, which always brings negative effects to metrics such as RT P99. This makes the running Java service seem like it is stumbling over a potholed road. The millisecond-level paused ZGC can help reduce RT P99, and the running Java service has a much more smooth experience. In most cases, ZGC only needs to adjust the size of the heap and the number of concurrent GC threads, which makes the developer easier to adjust the GC options.

ZGC Practice

This section demonstrates the scenarios to adopt ZGC before we put it into practice. The purpose is to enable you to select the correct GC. On the basis of evaluating the characteristics of the applications, we allowed the corresponding applications to run on ZGC and achieved improved RT. However, the ZGC of OpenJDK 11 is still in the experimental stage, and we have encountered some problems in practice. We documented these issues as risk items and tried to address them in Dragonwell 11.

ZGC Applicable Scenarios

ZGC has achieved excellent millisecond pause performance, but the side effect is that ZGC may reduce the throughput. (ZGC project homepage claims to lose up to 15% throughput.) The reasons include three aspects:

  1. The ZGC of Java 11 is a single generational GC. Each round of ZGC requires processing longevity objects (objects that still live after multiple GCs). The GCs before Java 11 are generational GCs that do not need each GC to deal with longevity objects.
  2. ZGC needs to enable concurrent GC threads, which could reduce the CPU share used by Java threads.
  3. ZGC's read barrier (described later) makes each operation of loading objects from the heap have additional overhead. In addition, since ZGC does not support the compressed pointer technology, ZGC cannot enjoy the performance improvement of compressed pointers on small heaps within 32 GB.

Based on the description of the characteristics of ZGC, the author summarizes the applicable scenarios of ZGC for users that intend to switch to ZGC:

  1. Java services that have high requirements for long-tail requests such as RT P99/P999: These services require real-time responses and are sensitive to the slowest 1% or 0.1% requests.
  2. The memory and CPU resources of the machine are abundant. Rich computing resources can enable a larger heap and more concurrent GC threads.
  3. Tolerable Throughput Reduction: After weighing, the service considers RT P99/P999 metrics to be more important than QPS metrics.
  4. Few Longevity Objects: ZGC of Java11 has not been divided into generations and cannot process such objects efficiently.

In addition, if the Java service is still running on Java 8, you must consider the cost of switching to Java 11.

ZGC Scale Practice

In Alibaba, many Java applications have strict requirements for the RT of long-tail requests. These Java applications upgrade to Java 11 and adopt ZGC in order to break through the GC pause bottleneck of the RT. The following example shows how Alibaba uses ZGC to obtain RT improvement. “Concurrent Mark/Relocate” mentioned in this section will be explained in the second article in this series.

1. High-Performance Databases: Lindorm is an internal high-performance branch of Alibaba NoSQL HBase. Lindorm has been running steadily on ZGC for nearly two years, during which it passed the Double 11 exam. During Lindorm, the ZGC pause is stable at about 5 ms, and the maximum is no longer than 8 ms. ZGC has improved the RT glitch metrics of online clusters, with an average RT optimized from 15% to 20% and P999 RT reduced to less than half. In Double 11, with the blessing of ZGC, the Lindorm RT P999 time was reduced from 12 ms to 5 ms in 2019. The following figure shows the GC pause performance of Lindorm on ZGC. (Unit: microseconds).


2. MSMQ Applications: RocketMQ no longer relies on local file systems and supports distributed file systems as storage to improve the Auto Scaling capability. RocketMQ originally used G1GC, but the GC pause reached more than 200ms. Even after a large number of tuning, it fails to be lowered. After research, it is found that the main factor of GC pause is that the C language library must be called based on JNA when accessing the distributed file system, and JNA relies on a finalizer to reclaim objects in native memory. These objects will be reclaimed after at least two G1 GC cycles. Many object transfers (about 500,000 objects are transferred in each GC) lead to a long pause. ZGC reclaims native objects in the concurrency phase to avoid long-term pause. RocketMQ only sets the heap size of ZGC and the number of concurrent threads. This way, the current online GC pause is less than 2 ms, reducing the glitch of system access. The following figure shows the RT metrics after RocketMQ uses ZGC.


3. Risk Control Calls: Some online applications are sensitive to the duration of risk control calls. These applications set service-call-timeout time to be short (< 50 ms). The duration of one Young GC for the risk control system is around 60 ms. If we encounter GC, the service call will time out. Let’s take the red envelope business as an example. If a timeout occurs, the red envelope will either not be issued or will be issued but be taken away by the bargain manipulator. This will have an impact on the business. ZGC is required to improve availability. The pause time of ZGC running online is within 10 ms. This can meet the requirements of these RT-sensitive applications. Due to a large number of cache objects in the risk control system, the Concurrent Mark phase takes a long time, which affects the throughput. The risk control system reduces the cache of longevity objects to support the smoother running of ZGC, thereby increasing the upper limit of QPS.

OpenJDK11 ZGC's Risk

In the early days of the preceding practice, we adopted the ZGC of OpenJDK11. However, this feature exists only in the experimental stage. Since OpenJDK 11 released the experimental ZGC, the stability of the ZGC has been enhanced, and its functions have been improved. ZGC had become a production-ready feature by the time OpenJDK 15 was released. OpenJDK 11 is a long-term supported version, but the currently released OpenJDK 12-16 is not long-term supported. Therefore, it is difficult to deploy OpenJDK 15/16 to use production-ready ZGC in production practice.

The preceding practice shows that for Java heaps with 10 GB to hundreds of GB, ZGC indeed can be paused within 10 ms. However, all of these applications report that QPS is not improved. This means the ZGC does not perform well in throughput scenarios. In the throughput scenario, the ZGC's reclaim speed cannot keep up with the allocation speed, and the allocation stalling (Allocation Stall) occurs. The thread currently creating the object is paused and waits for ZGC to release free space. Lindorm also reported that the overall effect of ZGC on the small heaps was not as good as G1GC. In addition, the practices above have encountered some problems with the OpenJDK11 experimental ZGC:

1. Unforeseeable Crash: When Lindorm finds that two variables point to the same object, the code detects that the two variables are not equal. RocketMQ business finds that the program crashes without signs during operation. After careful investigation, it can be found that the read barrier is separated from the load operation, and GC pause may be entered in the middle. This situation has been fixed in JDK14.

2. In the Worst Case, OOM Occurs: The risk control service notices that ZGC may throw OOM. This phenomenon occurs in the Concurrent Relocate phase. ZGC reserves a space for Concurrent Relocate. However, the ZGC code of JDK11 cannot guarantee that the reserved space is sufficient. If the object Relocate is fast, OOM may be thrown. This issue is resolved in OpenJDK16.

3. Page Cache Flush affects RT: ZGC divides the heap into small /medium /large-sized ZPages. (Different sized objects are allocated to different types of ZPages.) If the allocation speed of objects of various sizes is unstable (for example, if medium-sized objects suddenly increase), you need to convert a small/large ZPage to a medium ZPage, which takes a long time. Lindorm noticed that this phenomenon can affect RT. The ZGC of OpenJDK15 alleviates this phenomenon.

These are all the problems that need to be solved in ZGC's production practice. Alibaba Dragonwell 11 is downstream of OpenJDK11 and inherits all features, including ZGC. Part 3 of this 3-part series will introduce the ZGC production-ready transformation on Dragonwell 11. Before that, readers can try to enable ZGC in Dragonwell 11.

Enable ZGC in Dragonwell 11

Java developers need to update the JDK to Alibaba Dragonwell or later. If you want to enable ZGC, you need to enable -XX:+UseZGC at Java startup. Readers can browse the relevant tuning options of Dragonwell ZGC.

Part 2 of this 3-part series will introduce the principles and tuning techniques of ZGC. We can see how Alibaba Dragonwell 11 solves some problems in production practice by transforming the ZGC into production readiness.

Dragonwell has joined the Java language and virtual machine SIG in the Anolis community (OpenAnolis). At the same time, Anolis operating system (Anolis OS) 8 supports Dragonwell cloud-native Java. You are welcome to join the SIG community and construct the community together.

About the Author

Hao Tang joined the Alibaba Cloud programming language and compiler team in 2019 and is currently engaged in JVM memory management optimization.

0 0 0
Share on


22 posts | 0 followers

You may also like



22 posts | 0 followers

Related Products