×
Community Blog WASM Performance Analysis - Linux-perf

WASM Performance Analysis - Linux-perf

This article uses the performance analysis tool perf built in Linux to analyze the performance of WASM.

By Jiaxiang

This series of articles summarizes my research and practical experience with WASM performance analysis. The principles discussed may also apply to the performance analysis of other JIT-generated code, and I'd like to share some of my findings here.

The previous article WASM Performance Analysis - Instrumentation Solution briefly introduces the WASM language and shares an instrumentation-based approach for conducting WASM performance analysis. The article also points out the drawbacks of the instrumentation method: it significantly impacts performance and makes the analysis process more complex. In this article, we will use the performance analysis tool perf built in Linux to analyze the performance of WASM. Again, we will experiment with WAVM, which currently lacks support for performance analysis.

linux-perf

perf is a powerful performance analysis tool provided by the Linux kernel. It analyzes programs based on a predefined set of performance events known as perf_events. In our specific scenario, we focus on using perf record to sample the program execution. This means sampling call stacks, registers, and others at regular intervals during program execution, and then aggregating and analyzing these samples afterward.

Common performance analysis principles generally fall into two categories: one is tracing, similar to the instrumentation discussed in the previous article, which is used to record each execution process as the program runs; the other is the aforementioned sampling, which is used to periodically capture data during program execution and may lose some information but has a lesser impact on program performance.

perf Basics

For an executable file fib compiled into machine code, performance analysis is quite simple, as follows. The -g parameter is used to record the call graph, that is, the call stack.

$ perf record -g ./fib 40
fibonacci(40)=102334155
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.412 MB perf.data (5010 samples) ]

This generates a binary file named perf.data that contains the sampling results. This file is unreadable but can be converted into a readable text file by using perf script, or you can analyze it directly with perf report:

$ perf script > perf.unfold
$ perf report

The following is a sample in perf.unfold. We can see the structure of the call stack, sampling time, and other information. Performance analysis can be conducted by aggregating all samples.

1

After you obtain perf.unfold, you can also use FlameGraph to obtain the readable function overhead ratio.

Call Stack Sampling Solution

When using perf record, we specifically added the -g parameter to output the call stack. Let's also try running it without this parameter. In the final perf.unfold, each sampling point only records the currently executing function:

2

We can check the perf documentation. For recording the call stack, in addition to the above -g parameter, perf also provides the following three parameters.

--call-graph=fpis equivalent to -g. The fp (frame pointer) points to the function's position in the stack and is generally used for backtrace, stored in a dedicated register. However, programs now perform backtraces without needing the frame pointer, so the compiler optimizes the register to a general-purpose register. Therefore, if you choose this sampling method, you need to compile your code with the -fno-omit-frame-pointer parameter to prevent compiler optimizations. During each sampling, perf traces back through the call stack using the frame pointer, recording the current IP and the addresses of the functions on the call stack.

--call-graph=dwarf uses the information of the .debug section in the ELF file to produce the call stack. The saved call stack is very complete and does not require the above compilation parameters. Dwarf sampling is resource-intensive, as it dumps the contents of registers and the stack directly, resulting in much larger perf.data files compared with fp sampling.

--call-graph=lbr leverages the Intel PT feature of Intel CPUs. Simply put, it adds some registers in Intel hardware called Last Branch Records (LBRs), which store information about recently occurred branch instructions. Then, the perf tool uses these LBRs to generate the call stack.

The sampling results are similar to those obtained with the frame pointer, but no additional compilation parameters are required. Only hardware support (Broadwell or newer versions) is needed.

WASM and perf

Based on the information provided above, it appears that perf does not support WASM. perf record outputs function addresses or complete contents of the stack/register. For ELF files, this raw data can be directly pinpointed or parsed through the .debug section to obtain the function name. However, for WASM, the VM executes the machine code, and there is no direct mapping between the WASM code and the machine code. As a result, subsequent analysis tools provided by perf are unable to interpret the results.

Frame Pointer Sampling

Sample Content: Unknown

Let's run a random WASM program by using WAVM with frame pointer sampling and examine the output in perf.unfold:

3

As expected, all functions on the WASM side are marked as unknown. Without additional processing, you cannot see the details of WASM execution.

/tmp/perf-pid.map Solution

We can view the raw contents of perf.data with the following command:

$ perf report -D -i perf.data > perf.dump

4

The perf.data file only records address-based call stacks, and the conversion to function name is completed in perf script. Clearly, without a proper mapping of addresses to function names, it outputs unknown. Looking back at the output, you may notice that unknown is followed by /tmp/perf-23.map. What is this?

Research reveals that as early as 2009, perf introduced support for JIT-generated code: "perf report: Add support for profiling JIT generated code". The JIT interface support is implemented through this map file. The map file records function names, sizes, and memory addresses, which are used to resolve the function address. Examples of performance analysis tools that leverage this interface include the JVM's perf-map-agent and Python 3.12.

The solution is to modify the runtime to record function names and their address mappings in the format specified by the map file. Generally, WASM execution goes through a load module phase and an instantiation phase. JIT compilation to machine code occurs during instantiation, so our task is to write the map file during this phase. Taking WAVM as an example, we can directly write the map file at the end of instantiation and implement a few perf_map functions (refer to perf-map-agent).

pid_t pid = getpid();
FILE *perf_fp = perf_map_open(pid);
for(FunctionMutableData* functionMutableData : functionDefMutableDatas)
{
    perf_map_write_entry(perf_fp, 
                         functionMutableData->function->code, 
                         functionMutableData->numCodeBytes, 
                         functionMutableData->debugName);
}
perf_map_close(perf_fp);

After making the modifications, let's test the effect and check perf.unfold again. Indeed, it no longer shows unknown, but it only displays function indexes:

5

This is because the WASM file is compiled by using the Emscripten toolchain, which does not include function names in the output unless specified. Checking the documentation reveals that adding the -g2 parameter includes function names without affecting performance. It simply appends function names to the custom section of the WASM file, leaving the execution flow unchanged.

Additionally, Emscripten offers the --profiling-funcs parameter, which is equivalent to -O1 -g2. The optimization level O1 preserves the source code's function structure for easier analysis or debugging but may affect performance.

By contrast, the wasi-sdk's Clang compiler includes function names by default, so no extra steps are needed.

After applying these settings and converting perf.unfold into a flame graph, the call stack in the form of the function name can be seen:

6

Limits

While we have added support for frame pointer sampling as described above, this method also has its drawbacks. For some performance-sensitive third-party libraries (including the libc library), the compilation process typically includes default optimizations that can lead to incomplete sampling. This results in scenarios like the one shown in the figure, where a significant portion of the data remains marked as unknown. Therefore, in large projects, it is necessary to compile each third-party library with the -fno-omit-frame-pointer parameter to reduce the proportion of unknown entries. However, this parameter comes at a cost: it can increase performance overhead by 1% to 2%.

7

Dwarf Sampling

Principles and Problems

Compared with frame pointer sampling, dwarf sampling does not require additional compilation parameters. Let's first look at the raw data from dwarf sampling. This primarily includes register contents (user regs) and stack contents (user stack). The user stack contains call stack information starting from the stack pointer (SP) register address, typically encompassing 8192 bytes of stack content by default.

8

During the parsing phase with perf script, perf uses the user stack content in memory to query the corresponding dynamic shared object (DSO) files through libraries like libunwind or libdw. It then parses entry addresses by using the .debug section of the DSO file (similar to the information sampled by frame pointer). Finally, the function names are retrieved on the perf side. The following figure shows an example:

9

However, testing has revealed that for WASM code, the corresponding DSO file is actually the previously mentioned map file, which cannot be used for unwinding. As a result, the obtained results only contain functions on the VM side.

Parse WASM Function Entries in the Stack

For the aforementioned issues, there are two potential solutions:

  1. Could libunwind or libdw support the unwinding of JIT-generated code?
  2. Could the runtime directly write the compiled results to an ELF file on disk and include debug structures?

Brute-force Solution

Regarding the first approach, I reviewed the source code of libunwind and found that it did define a data structure for unwind information specific to runtime-generated code (JIT-based runtimes), namely unw_dyn_proc_info_t, see https://www.nongnu.org/libunwind/man/libunwind-dynamic(3).html

Unfortunately, this feature has not been fully implemented. There is also a blog from abroad discussing this topic, which concludes with no definitive solution available:

I'd like to finish on a positive note instead of that sad note, but alas, this is a tale of woe.

Since there is no native support, I decided to try a brute-force approach. Currently, perf record has provided me with a large amount of raw data (stack dump), and my task is to extract valuable information (WASM function entries) from it. Below is an illustration of this raw data. The red boxes contain the content of the stack pointer register, pointing to the stack; we can ignore this. The blue boxes contain function entries, which are the same function due to recursive calls. The green boxes contain other details such as function parameters.

10

My approach is very simple and brute-force:

  1. Get the address ranges corresponding to the WASM functions through WAVM.
  2. Modify the _unwind__get_entries function in perf so that when parsing WASM functions, it bypasses libunwind's unw_init_remote-related callback functions and follows the logic described below.
  3. Directly from the current position in the stack, traverse through the remaining content in 8-byte segments. Convert these segments into addresses. If an address matches a WASM function range, add that function to the entry list. If it doesn't match, consider it irrelevant and ignore it.

The final result is also quite direct, as shown in the following figure: the top three layers represent WASM functions, while the rest belong to WAVM (unaffected by the above modifications).

11

Clearly, the brute-force method is not advisable for the following reasons:

  1. If the corresponding value of a parameter is exactly the address of a WASM function, the parameter will also be resolved as a function entry, resulting in call stack confusion.
  2. If a WASM function calls functions on the host side, this method would also fail to parse those correctly.

However, this approach does highlight one important point: stack dumps can be used to parse WASM call stacks. We can only hope that the developers of unwind libraries will step up and provide better support for this functionality.

JIT dump

Perf also provides a design to dump JIT compilation results for performance analysis, as detailed in jitdump-specification.txt. This document specifies the format of JIT dump files and includes debug information to enable performance analysis similar to that of standard ELF files. LLVM implements this functionality. Examples of performance analysis tools based on this approach include the JVM's perf-jitdump-agent. Previously, I noticed that WAMR (another WASM Runtime) added support for JIT dump. However, upon testing, it only supports frame pointer sampling and does not successfully implement the dwarf sampling mode. Upon reviewing LLVM's source code, it appears that the dwarf sampling implementation might not have been completed.

12

lbr Sampling

lbr sampling is quite similar to frame pointer sampling, except it uses Intel's newer set of registers to sample branch addresses. I tested this approach and found that the solution by using /tmp/perf-pid.map for frame pointer sampling also works with lbr sampling. Below is the flame graph generated from this sampling:

13

Surprisingly succinct. However, there is a problem with WASM function entries, which stems from the limited number of hardware lbr registers. The maximum recorded stack depth is 32, so calls deeper than 32 layers may be lost.

There is a way to mitigate this limitation by using lbr advanced features. This requires processor support for virtual lbr through the Processor Trace (PT) feature, which you can verify with the command grep intel_pt /proc/cpuinfo. Unfortunately, while this mode allows recording deeper call stacks, it does not support parsing function names by using /tmp/perf-pid.map, leading to a loss of detailed WASM execution information.

Performance Optimization Case Study based on WASM perf + Flame Graphs

After evaluating the perf performance analysis methods mentioned above, we ultimately chose the more mature frame pointer sampling combined with a function name-to-address mapping file approach. Based on this, we worked with colleagues at Baxia to generate a flame graph for the application security detection engine and analyzed the embedded WASM programs. Below is a portion of one of the WASM strategies:

14

From this analysis, we found that JSON-related functions accounted for approximately 60% of the total overhead, while other business logic only took up about 40%. These JSON-related functions came from the third-party library jsoncpp. Upon identifying this hotspot, our Baxia colleagues replaced it with the higher-performance rapidjson library. After completing the replacement, we retested and generated another flame graph:

15

The results show that JSON-related operations have dropped significantly to 38.3%, reducing the overall cost of this strategy to 61.1% of its original value when other components remain unchanged. Additionally, the team at Baxia also replaced the JSON libraries in two similar strategies with rapidjson, achieving an average performance improvement of 36%. Ultimately, this led to an overall performance boost of 3.31% for the entire engine.

Summary

The current WASM performance analysis is still not fully mature:

  1. Frame pointer sampling requires disabling frame pointer optimization, which may introduce a performance overhead of around 1%.
  2. Dwarf sampling lacks complete support for unwinding.
  3. lbr sampling is limited by Intel's hardware constraints and sampling depth.

Comparatively, the frame pointer sampling approach, which has more industry implementation examples, is better suited to our scenario. Therefore, it became our final choice for WASM performance analysis.


Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

1 1 0
Share on

Alibaba Cloud Community

1,129 posts | 348 followers

You may also like

Comments

Tim David March 18, 2025 at 9:57 am

This article takes a deep dive into the tricky world of analyzing WASM (WebAssembly) performance using Linux's built-in perf tool. The author explains the challenges of using traditional methods, like instrumentation, and how perf’s sampling techniques can help, despite some roadblocks. Techniques like frame pointer, dwarf, and LBR sampling are explored, with a focus on the hurdles of missing function names and incomplete unwind support in WASM. To tackle these, the author suggests solutions like using map files and tweaking the runtime setup. The goal is to improve performance analysis and make WASM a bit more manageable for developers.