Community Blog Unveiling ARMS Continuous Profiling: New Insights into Application Performance Bottlenecks

Unveiling ARMS Continuous Profiling: New Insights into Application Performance Bottlenecks

This article describes application performance bottlenecks from a new perspective.

By Zihao Rao and Long Yang

1. Increasing Application Complexity: Challenges in Root Cause Identification

With the continuous development of software technology, many enterprise software systems have evolved from monolithic applications to cloud-native microservices models. This evolution enables applications to achieve high concurrency, easy scalability, and enhanced development agility. However, it also results in longer software application dependencies and reliance on various external technologies, posing challenges in troubleshooting online issues.

Despite the rapid evolution of observable technologies in distributed systems over the past decade, many problems have been mitigated to some extent. However, there are still challenges in identifying and addressing certain issues, as shown in the following figure.

Figure 1 CPU persistence peak

Figure 2 Where the heap memory space is used

Figure 3 Unable to locate the root cause of time consumption in the trace

How do we locate root causes?

Some of you may have more experience with troubleshooting tools, so you may think of the following troubleshooting methods for the preceding problems:

  1. For CPU peak diagnosis, use the CPU hotspot flame graph tool to troubleshoot.
  2. For the memory problem, memory snapshots can be used for memory use diagnosis.
  3. For the time consumption missing during slow trace diagnosis, the trace command provided by Arthas can be used for the method duration diagnosis.

These solutions can sometimes solve problems, but if you have experience in troubleshooting related problems, you would know that these methods have their conditions and limits. For example,

  1. For online problems that are difficult to reproduce the test environment, the CPU hotspot flame graph tool is helpless.
  2. Memory snapshots may not only affect the stable operation of online applications but also require rich experience in the use and analysis of relevant tools to diagnose problems.
  3. Using the trace command of Arthas to troubleshoot becomes very difficult when the slow trace is unstable and difficult to track. In addition, it is also very difficult to locate call requests across multiple applications and machines.

2. Continuous Profiling: a New Perspective to Understand Applications

Is there a simple, efficient, and powerful diagnostic technology that can help us solve the preceding problems? That is continuous profiling that we will introduce in the article.

What is Continuous Profiling?

Continuous profiling helps monitor and locate application performance bottlenecks by collecting the application's CPU & memory stacktrace information in a real-time manner. With this simple introduction, the concept of continuous profiling may still be vague to you. Though you might not have heard of continuous profiling before, you must have heard of or used the jstack provided by JDK, which is a tool that can print thread method stacks and locate the thread state when troubleshooting application problems.

Figure 4 jstack tool

The idea of continuous profiling is similar to jstack. It also captures CPU, memory, and other resources executed by the application thread at a certain frequency or threshold to apply for the use of stacktrace information, and then presents the relevant information through some visualization technologies, so that we can see the use of application-related resources more clearly. At this point, some of you who use more performance analysis tools may think of flame graphs:

Figure 5 Flame graph tool

Usually, one-time performance diagnosis tools such as Arthas CPU hotspot flame graph generation tool which can be manually turned on or off during stress testing, are a kind of real-time profiling technology. The main difference between the two is that one-time performance diagnosis tools are real-time, not continuous.

Usually, we use CPU hotspot flame graph tools in stress testing scenarios. To do a stress testing performance analysis, we use some tools to grab the flame graph of an application over some time during stress testing. Continuous profiling is not only a solution to the performance observation of stress testing scenarios. More importantly, through some technical optimization, it can continuously profile the various resource use of the application in a low-overhead manner and follow the entire running lifecycle of the application. Then, through the flame graph or other visualization methods, it shows us more in-depth observable effects compared to observable technologies.

Principles of Continuous Profiling Implementation

After discussing the basic concept of continuous profiling, you may be interested in its implementation principles. Here is a brief introduction to some related implementation principles.

We know that Tracing collects information in calls by tracking method points on the key execution path to restore details such as parameters, return values, exceptions, and call duration. However, it is challenging for business applications to comprehensively monitor all method points, and an excessive number of method points can lead to high overhead. This can result in blind spots in Tracing monitoring, as shown in Figure 3. Continuous profiling goes further by tracking resource applications related to key locations in the JDK library or relying on specific events in the operating system to collect information. This not only minimizes overhead but also provides greater insight through the collected information.

For instance, the general concept of CPU hotspot profiling involves obtaining information about threads executing on the CPU through system calls at the lowest level of the operating system, and then collecting method stack information at certain intervals. For example, if the interval is 10 ms, the stack trace information of 100 threads will be collected in 1 second, as depicted in Figure 6. Subsequently, the stack trace is processed, and visualization techniques such as the flame graph are used to display the CPU hotspot profiling result. It is important to note that the above is just a brief description of implementation principles. Different profiling engines and objects to be profiled typically involve various technical implementations.

Figure 6 Principle of continuous profiling collection

In addition to the common CPU hotspot flame graph profiling, for the use and application of various system resources, profiling results provided by continuous profiling can also help the analysis of the related source application and the introduction to the implementation principle, but note that different profiling implementation technologies may have different results.


Visualization Technology After Continuous Profiling

In so much information about continuous profiling, we have mentioned the flame graph, one of the most widely used technologies in terms of data visualization after continuous profiling and collection. What are the secrets of the flame graph?

What Is a Flame Graph?

The flame graph is a visualization analysis tool of program performance that helps developers track and display the function calls of a program and the time taken by the calls. The core idea is to convert the program's function call stacktrace into a rectangular flame-shaped image. The width of each rectangle indicates the resource use proportion of the function, and the height indicates the overall call depth of the function. By comparing the flame graphs at different time, the performance bottleneck of the program can be quickly diagnosed, so that targeted optimization can be carried out.

In a broad sense, we can draw a flame graph in two ways:

(1) the narrow flame graph with the bottom element of the function stacktrace being at the bottom, and the top element of the stacktrace being at the top, as shown in the left figure below

(2) the icicle-shaped flame graph with the bottom element of the stacktrace being at the top, and the top element of the stacktrace being at the bottom, as shown in the right figure below

7 7_2

Figure 7 Two types of flame graphs

How Do We Use the Flame Graph?

As a visualization technique for performance analysis, the flame graph can be used for performance analysis only based on the understanding of how it should be read. For example, for a CPU hotspot flame graph, one of the arguments is to see if there is a wider stack top in the flame graph. What is the reason?

This is because what the flame graph draws is the stacktrace of method execution in the computer. The call context of the function in the computer is based on a data structure called stack for storage. The stack data structure is characterized by the last-in-first-out (LIFO) of elements. Therefore, the bottom of the stack is the initial call function, and upper layers are called subfunctions. When the last subfunction, that is, the top of the stack, is executed, elements will be pushed out of the stack from top to bottom in turn. Therefore, the stack top is wider, which means that the execution time of the subfunction is long, and the parent function below it will also take a long time because it cannot be pushed out of the stack immediately after it has been executed.

Figure 8 Stack data structure

Therefore, the steps for analyzing the flame graph are as follows:

  1. Determine the type of the flame graph and find the top of the stack.
  2. If the flame graph occupies many resources in total, continue to check whether there is a wider part at the top of the flame graph stack.
  3. If there is a wide top of the stack, search from the top of the stack to the bottom. Find the first package of the method line defined by the analyzed application itself, and then focus on whether the method needs optimization.

The following figure shows a flame graph with high resource usage. To analyze the performance bottlenecks in the flame graph, perform the following steps:

  1. The following figure is an icicle-shaped flame graph with the bottom of the stack being at the top and the top of the stack being at the bottom. Therefore, it should be analyzed from bottom to top.
  2. Analyze the top of the stack located at the bottom. It can be found that the wider top of the stack on the right is the method on the right: java.util.LinkedList.node(int).
  3. The wider stack top is a library function in JDK rather than a business method, so we should search from bottom to top along the stack top method: java.util.LinkedList.node(int), and pass through java.util.LinkedList.get(int)->com.alibaba.cloud.pressure.memory.HotSpotAction.readFile() in turn. com.alibaba.cloud.pressure.memory.HotSpotAction.readFile() is a business method belonging to the analyzed application, that is, the method line defined by the first analyzed application itself, which takes 3.89 seconds. It accounts for 76.06% of the entire flame graph, so it is the biggest bottleneck where the resource occupation is high during the collection period of the flame graph, so you can sort out the logic of the relevant methods in the business according to the method name to see if there is room for optimization. In addition, according to the above analysis method, the lower left corner of the graph java.net.SocketInputStream can also be analyzed, and it is found that the fully qualified name of the parent method defined by itself is com.alibaba.cloud.pressure.memory.HotSpotAction.invokeAPI, with a total proportion of about 23%.

Figure 9 Flame graph analysis process

3. Out-of-the-box ARMS Continuous Profiling Capability

After the preceding introduction, you may have a certain understanding of the concept of continuous profiling, collection principles, and visualization techniques. Then, we will introduce the out-of-the-box continuous profiling capability provided by ARMS (Application Real-time Monitoring Service) to help troubleshoot and locate various online problems.

ARMS provides the one-stop continuous profiling feature. Nearly 10,000 of application instances are available online for continuous collection and monitoring.

Figure 10 ARMS continuous profiling capabilities

The left side of the figure shows an overview of the continuous profiling capabilities of ARMS. Data processing, and data visualization are shown in sequence from top to bottom. As for specific functions, corresponding solutions are provided for several scenarios with the most urgent user needs, such as CPU and heap memory analysis, and CPU and memory hotspot functions are provided. ARMS provides the code hotspot feature for slow trace diagnosis. Continuous profiling on ARMS is a product developed by the ARMS team together with the Alibaba Cloud Dragonwell team. Compared with general profiling solutions, it features low overhead, fine granularity, and complete method stacks.


The ARMS product documentation provides the best practices for the corresponding sub-features:

• For more information about how to diagnose high CPU utilization, see Use CPU hotspots to diagnose high CPU consumption [1].

• For more information about how to diagnose high heap memory utilization, see Use memory hotspots to diagnose high heap memory usage [2].

• For more information about how to diagnose the root cause of slow traces, see Use the hotspot code analysis feature to diagnose slow traces[3].

Success stories

Since the release of the relevant functions, it has better assisted users in diagnosing some complicated problems that have been plaguing online for a long time, and has been well received by many users. The following are some customer stories.

1.  User A found that when an application service was just started, the first few requests were slow, and the monitoring blind area appeared when using Tracing, so time-consumption distribution cannot be diagnosed. Finally, User A used ARMS code hotspots to help diagnose that the root cause of the related slow trace is the time-consuming initialization of the Sharding-JDBC framework, which helped the user finally understand the root cause of this problem.

Figure 11 User problem diagnosis Case 1

2.  User B found that during the stress test, the response time of some nodes in all instances of the application was much slower than that of other nodes, and the root cause cannot be found by using Tracing. Finally, through the code hotspots, User B found that a large amount of time would be spent on writing logs when the relevant application instances were under certain pressure. Then, according to the relevant information, the user checked the resource utilization rate of the log collection component in the application environment. It is found that a large amount of CPU was occupied during the stress test, resulting in slow request processing due to the lack of competition for resources in the application instance writing logs.

Figure 12 User problem diagnosis Case 2

3.  User C found that during the running of the online application, the heap memory usage was always very large. Through memory hotspots, it was quickly discovered that the persistent processing of the subscribed upstream service information during the running of the microservice framework of this version used by the application resulted in a large amount of heap memory usage. Then, this user consulted the relevant framework service provider. Finally, User C learned that the problem can be solved by upgrading the framework version.

Figure 13 User problem diagnosis Case 3


The utilization rate of CPU /memory and other resources in the running process of many enterprise applications is in fact quite low. With minimal resource consumption, continuous profiling offers a new perspective for applications, allowing for detailed root cause identification when exceptions occur.

If you are interested in the continuous profiling feature in ARMS, click here to learn more.


[1] Use CPU hotspots to diagnose high CPU consumption
[2] Use memory hotspots to diagnose high heap memory usage
[3] Use code hotspots to diagnose problems with slow traces

0 1 0
Share on

Alibaba Cloud Native

165 posts | 12 followers

You may also like