How to Reduce the Application's CPU Usage by Nearly 20% Through Flame Graph Profiling

By Yuming

Seeing the title, many people may wonder if it's just a clickbait, right? Let me explain in detail.

Our application code is written in Java, so the flame graph tool used is Arthas, and the following profiling is also based on this.

Arthas Flame Graph Usage

Official Documentation: https://arthas.aliyun.com/doc/profiler.html

Start Flame Graph Profiling

$ profiler start
Started [cpu] profiling

Stop

$ profiler stop --format flamegraph
profiler output file: /tmp/test/arthas-output/20211207-111550.html
OK

As shown above, an HTML flame graph file will be generated by default. For more information about the specified output format, please refer to the official documentation.

Flame Graph

The horizontal axis of the flame graph represents the CPU usage time. The wider the horizontal axis, the more CPU usage. You can also see how much CPU this method has occupied by moving the mouse.

The vertical axis represents the call stack, with the higher flame representing a deeper call stack.

The green sections represent Java code, the yellow sections represent JVM C++ code, the orange sections represent kernel-mode C language code, and the red represents user-mode C language code.

How to Profile a Flame Graph (with Practical Cases)

The situation is, that business is simply to perform a series of operations after monitoring the shipping messages, which can be divided into automatic and batch operations. When large users in batch perform business operations, tens of thousands of orders even hundreds of thousands of orders will be generated at the same time, which is equivalent to the traffic during the big promotion. Therefore, CPU usage is always spiked, sometimes even up to over 80% on a single machine. Moreover, during daily traffic, the CPU usage seems somewhat high compared to the traffic data, so we decided to use a flame graph for profiling.

Viewing from Top to Bottom

Below are two typical cases in the optimization process.

Case 1

The simplest way to profile a flame graph is to find large flat tops. If a method is time-consuming and called many times, then its horizontal axis must be wider, which is reflected in the flame graph as a large flat top.

The red box in the graph shows the execution of the business code, in which the blue box is the time-consuming operation initially located. After you click on it, you can see that the left side is the CPU usage of Sentinel sampling, which occupies 3% to 4% of the total CPU. This section should not increase with the rise in traffic, so we did not focus on this one this time. The second section is executed in the MetaQ consumer code, which requires close attention. This is because our application is message-driven; it performs corresponding operations after receiving the shipping messages from TP. As the MetaQ traffic increases, the corresponding operations will likely increase.

It occupies an astonishing 9.3% of the total CPU usage.

Upon clicking, it can be seen that it is a desensitization tool, whose impact on performance is almost on par with that of the shipping messages. After troubleshooting, it was found that it is a link-collection tool used within our department. When collecting MetaQ messages, it performs desensitization processing on the messages. The desensitization tool will match names, email addresses, and mobile phone numbers respectively, while the transaction messages we receive contain the entire order information. This object is very large (containing extended fields and many other information), and the workload for desensitization using regular expressions is enormous. Under normal circumstances, using this tool to collect online traffic does not have a great impact on performance, but in our scenario, the impact is a bit unexpected.

Since we usually do not pay attention to the content of messages on the link graph and mainly use it to view HSF links, we directly turn off the collection of MetaQ by DP.

Case 2

After using global search, it occupies nearly 6% of the CPU. This is a flame graph intercepted under daily traffic, and the proportion will be even larger when the system traffic increases.

Upon opening, it can be seen that during the HSF call, obtaining the Java call stack is quite time-consuming. When writing codes before, we suspected that obtaining the call stack would decrease some performance, but we did not expect that it would take more time than one HSF call itself. This section was used to obtain the call source and print it to the log for easy troubleshooting (legacy code). Later, the HSF call logs were changed to be processed through HSF Filters, eliminating the logic for obtaining the call stack.

I did not locate Case 2 by viewing from top to bottom at the beginning, because there is more than one section to call HSF, and each call does not take long. It is difficult to find Case 2 from top to bottom as a whole.

Viewing from Bottom to Top

Case 2

If viewing from top to bottom does not yield clear results, we can profile from the main traffic entry points of our system, clicking step by step from the bottom entry points, which can also help us identify the issue. With a skeptical attitude, we did not initially notice the problem at the top. We began by examining the message entry point and then clicked our way up to discover that this call consumed more time than HSF.

It can be seen that this call is not concentrated in one section, and there are calls in every section, so the overall impact on performance is great.

After the global search, you can see that there are calls everywhere (shown by the purple sections in the graph).

When viewing from bottom to top is more suitable for refinement, specifically profiling and optimizing a link.

Optimization

Let's calculate the data.

Overall, the optimization achieved is roughly 5% to 6% (in July, there were 30 machines, and in August, the number was reduced to 27). Taking 5%, the system CPU is occupied at about 6.5%. From the perspective of daily traffic, the user-mode CPU usage is reduced by 19% + from 26% to 21%. Considering the reduction in the number of machines, an optimization of around 20% is likely achieved.

Since this is the optimization result of daily traffic, the reduction in system load should be more obvious when the traffic greatly increases during the big promotion. After optimization, even with large users performing batch operations, the instantaneous traffic should not cause any machine's CPU usage to exceed 60%.

Subsequently, we will continue to focus on optimization during the Double 11 stress tests. As traffic increases, more issues may be exposed.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community

How to Reduce the Application's CPU Usage by Nearly 20% Through Flame Graph Profiling

Arthas Flame Graph Usage

Start Flame Graph Profiling

Stop

Flame Graph

How to Profile a Flame Graph (with Practical Cases)

Viewing from Top to Bottom

Case 1

Case 2

Viewing from Bottom to Top

Case 2

Optimization

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Web Hosting Solution

YiDA Low-code Development Platform

mPaaS

Web Hosting