ChaosBlade Java Scenario Performance Optimization


ChaosBlade is an open source experimental injection tool of Alibaba that follows the principles of chaotic engineering and chaotic experimental models, helps enterprises improve the fault tolerance of distributed systems, and ensures business continuity in the process of enterprise cloud or cloud native system migration.

Currently, the supported scenarios include: basic resources, Java applications, C++applications, Docker containers, and the Kubernetes platform. This project encapsulates the scene implementation by domain into a separate project, which can not only standardize the implementation of the scene in the domain, but also facilitate the horizontal and vertical expansion of the scene. By following the chaotic experimental model, the unified call of ChaosBlade cli is realized.

However, there are some performance problems with fault injection in the Java scenario. This is mainly reflected in the fact that the CPU utilization will fluctuate significantly during fault injection, which may lead to 100% CPU utilization in serious cases. This situation has a good impact on offline services, but it is more serious for online services, because high CPU utilization may lead to poor overall service performance, which affects the time consumption of the interface.

Through the performance optimization of the ChaosBlade Java scenario, the jitter of the CPU during fault injection is effectively controlled, and the jitter of the CPU utilization rate reaching 100% will no longer occur. After testing, the Dubbo custom exception throwing fault is injected into the online service instances of about 8C, 4G, and QPS 3K, the CPU utilization rate can be controlled within the instantaneous jitter range of about 40%, and the overall performance is improved nearly 2.5 times.

This article will introduce in detail the problems that affect the performance and how to optimize these problems.

Java Scenario

Before introduction, understand the injection process of the ChaosBlade Java scenario.

Fault injection in Java scenarios is implemented based on the bytecode enhancement framework JVM Sandbox. One fault injection is divided into two steps:

1. The ChaosBlade executes the prepare command and triggers the sandbox to mount the Java agent to the target JVM.

2. The ChaosBlade executes the create command and triggers the sandbox to enhance the bytecode of the target JVM to achieve fault injection.

Optimize in the Prepare stage


Simulate a simple HTTP service locally, and control its CPU Idle to be about 50%. After the blade prepare jvm -- pid mount agent is executed, it is found that the CPU idle rate decreases rapidly and greatly. Fault injection in production may directly cause the Idle to drop and trigger the alarm:


Collect the CPU profile to generate a flame graph to observe the CPU usage during blade preparation. As shown in the figure below, the loadPlugins method is a disaster area for resource consumption.

The loadPlugins method mainly loads all plug-ins supported in ChaosBlade Java, such as dubbo, redis, kafka, etc. After loading these plug-ins, you can perform fault injection. During plug-in loading, bytecode enhancement will be performed on the classes and methods defined in the plug-in.

The problem that leads to CPU consumption is that it takes a lot of time to load a full amount of plug-ins. When we inject faults, we choose a specific plug-in for fault injection. Obviously, full load is not the optimal solution


Optimization idea: Since specific plug-ins will be selected during fault injection, lazy loading can be used to solve the problem. When we load a specific plug-in for fault injection, the granularity of loading will become smaller, and the CPU consumption will naturally be smaller:

Core code:

In the fault injection phase, the specified plug-in is used for lazy loading.

Effect after improvement

Decrease of CPU Idle

CPU usage in the flame graph almost "disappears"

Create phase optimization

In actual use, it is found that fault injection causes CPU Idle to fall to the bottom in many cases, and the falling time is relatively short, basically around 20S. Some cases are related to the business code of the target service or the jvm parameter settings of the target service. This article only describes the CPU Idle falling to the bottom caused by or indirectly caused by ChaosBlade.

CPU Idle falls to the bottom: This means that the CPU idle rate is reduced to 0, which means that the CPU utilization has reached 100%.

Dubbo fault optimization

• Problem description

ChaosBlade supports fault injection for dubbo providers or consumers (such as throwing exceptions). When a service is both a provider and a consumer, fault injection for the provider will trigger a bug, which may cause the CPU Idle to drop.

Normal situation: a service that is both a provider and a consumer. Its request processing process is that the traffic will first enter the provider, be processed, and then be executed by the business logic. Finally, the request will be forwarded through the consumer.

For consumer fault injection: When using ChaosBlade to inject faults into consumers, the traffic will throw an exception when it reaches the consumer, and it will not really forward the traffic, so as to achieve the effect of simulating the occurrence of faults.

For provider fault injection: When using the ChaosBlade to inject faults into the provider, the traffic will throw an exception when it arrives at the provider and will not forward the traffic downward.

All the above are expected results. In fact, when ChaosBlade injects faults into providers or consumers, it will inject faults into providers and consumers at the same time, which may cause additional resource waste.

1. The bytecode enhanced classes have changed a lot

2. For example, when injecting a provider fault, we hope that the traffic does not go through the business logic, because once an exception is thrown in the consumer, the traffic will naturally go through the exception processing of the business logic when it returns (such as printing error logs, retrying, etc.), which may cause the CPU Idle to decline due to the processing problem of the business logic.

Cause of the problem: The bytecode enhancement logic of the ChaosBlade is based on the granularity of the plug-in. For example, dubbo belongs to a plug-in, but plug-ins such as dubbo and kafka that target both provider and consumer fault injection will inject faults into both provider and consumer.

• Optimization

When loading plug-ins, load them on demand according to the specific plug-in name, for example, execute the command:

./blade create dubbo throwCustomException --provider --exception Java.lang.Exception --service org.apache.dubbo.UserProvider --methodname GetUser

It means that the provider of dubbo needs to inject faults, so only the provider plug-in is loaded for bytecode enhancement.

Custom script failure optimization

• Problem description

When using the ChaosBlade to inject a custom script fault, the CPU Idle drops to the bottom. The custom script is a supported method in the ChaosBlade jvm fault. It means that users can write any piece of Java code and then inject it into the corresponding target classes and methods. This method is very flexible. You can do many things through the ChaosBlade custom script to inject a fault.

From the thread stack, we can see that the thread is blocked when decompressing the jar file. Why is it blocked here?

In fact, when ChaosBlade injects a custom script, the custom script (Java code) is only treated as a string. When the plug-in is really activated, the string will be parsed, and then turned into Java code for the jvm to load, compile, and execute the code.

The problem is that when fault injection occurs, external traffic is also continuously calling the current service. According to the above logic, it is possible that when the plug-in is activated, because the external traffic is also constantly called, a large number of requests are made to parse the custom script, which causes the thread to be blocked. Because the process of parsing the custom script to the correct point where the jvm can load it is relatively complex and slow, and thread safety should be guaranteed in some places.

In fact, the ChaosBlade also has a cache. As long as the custom script is compiled once, subsequent requests will directly execute the script. However, this cache does not compile well in the scenario of concurrent requests

• Optimization

Through the above troubleshooting, we can actually think of an optimization method, which is to advance the loading time of custom scripts.

The ChaosBlade fault injection is divided into two steps. In the first step, if the agent cannot get the custom script information, it will be loaded before the plug-in is activated in the second step (because once the plug-in is activated, there will be traffic that will execute the fault injection embedding method, thus triggering the script compilation)

This optimization idea is not only applicable to custom script failures, but also to custom exception throwing failures.

In the fault execution of custom exception throwing, the exception class characters entered by the user will be reflected and loaded only when the traffic comes. The bottom layer of the class loader also needs to be locked, so it may also cause the thread blocked

Optimization content: add a pre fault execution interface, which can be implemented by plug-ins that need to perform certain actions before fault injection.

Log printing optimization

• Problem description

There are two main problems caused by log printing:

1. The internal log framework of the business system, such as using log4j/logback to synchronize log printing, is likely to cause a large number of threads to be blocked because the business system handles exceptions and prints logs after an injection failure (such as throwing exceptions). Because synchronous log printing requires locking, and the exception stack is relatively time-consuming for printing with more content, a large number of threads may be blocked when QPS is high.

2. The ChaosBlade's own log is printed. Each time the fault injection rule is successfully matched, the info log will be output:"Match rule: {}", JsonUtil.writer().writeValueAsString(model));

When outputting logs, the fault model will be serialized and output using jackson, which will trigger class loading (locking operation). When there are a large number of requests, it may cause a large number of threads to block.

• Optimization

The thread block caused by the log printing of the business system is not within the scope of ChaosBlade optimization. If you encounter similar situations, you can solve them yourself.


1. Change log synchronous printing to asynchronous printing

2. The error stack when ChaosBlade custom throws an exception can be ignored as much as possible to reduce the content of log output.

The optimization of ChaosBlade print logs is relatively simple. You only need to replace the part of the match rule serialization failure model. Implement toString for the model, and print the model directly when printing.


• Log performance

After the ChaosBlade injection fails, log in to the target machine and observe the logs. First, it is found that the jvm sandbox fails to attach to the target jvm

Next, I saw a more critical log: Metaspace overflowed!!!


At the beginning of the article, the process of ChaosBlade injecting Java faults was introduced. It was known that during fault injection, the jvm sandbox would be dynamically attached to the target process JVM. After the attachment, the internal jars of the sandbox and the user-defined module jars of the sandbox would be loaded. In this process, a large number of classes would be loaded. When the class was loaded, the Metaspace would be allocated to store the metadata of the class.

Here are two considerations:

1. Is it because the Metaspace of the business service JVM is set too small?

2. The GC of Metaface is not triggered or leaked, causing the metadata of the class not to be recycled?

Log in to the target machine and use jinfo to observe the parameters of the jvm. It is found that the MaxMetaspaceSize is set to 128M, which is really small, because the default MaxMetaspaceSize is - 1 (unlimited, limited to local memory).

Let the business service adjust the MaxMetaspaceSize parameter to 256M, and then restart the Java process. Fault injection is really OK again, and the fault takes effect normally.

However, the actual problem is not so simple. Metaspace OOM failures still occur after several consecutive injections. It seems that the corresponding space in Metaspace cannot be reclaimed when the fault is cleared.

• Local recurrence

Since the essence of ChaosBlade Java fault injection is a plug-in of jvm sandbox, the core logic of class loading, bytecode enhancement and so on are all on jvm sandbox, so we directly locate the problem on jvm sandbox, and use the demo project provided by jvm sandbox for replication.

The startup parameter is set to MaxMetaspaceSize=30M, because there are very few demo module classes, followed by the quick reproduction of OOM.

The TraceClassLoading and TraceClassUnlocking parameters are used to observe the information of JVM SANDBOX loading/unloading classes during fault injection and clearing.

After multiple injections and cleanups, the Metaspace OOM of online business is reproduced. It can be seen that the Metaspace has not been recycled during multiple injections, and the space occupation curve is rising all the way.

Metaface OOM is because Metaface has not been recycled. The premise of Metaface recycling is that ClassLoader is closed, while JVM SANDBOX will close ClassLoader when shutdown. The customized ClassLoader in the JVM SANDBOX inherit the URLClassLoader. Official introduction to the closing method of URLClassLoader:

JIT (just in time compilation) causes CPU jitter

Problem description

In Java, compilers are mainly divided into three categories:

1. Front end compiler: JDK's Javac, that is, the process of converting *. Java files into *. class files

2. Instant compiler: C1 and C2 compilers of HotSpot virtual machine, Graal compiler, JVM runtime process of converting byte code into local machine code

3. Advanced compiler: JDK's Jaotc, GNU Compiler for the Java (GCJ), etc

After fault injection through ChaosBlade, the essence is to use jvm sandbox to enhance the target class and arson bytecode. This will also trigger JIT Just In Time for the JVM.

The purpose of JVM instant compilation is to convert bytecode into machine code, so that it can be executed more efficiently. However, resources will be consumed in the process of JVM instant compilation. The most typical scenario is that the CPU utilization of Java services is relatively high when they are started, and it gradually recovers after a period of time. This phenomenon is partly caused by the involvement of instant compilation.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us