×
Community Blog Why Can't Arthas Be Mounted in the Init Process in Container?

Why Can't Arthas Be Mounted in the Init Process in Container?

Part 4 of this series discusses the importance of understanding the underlying logic of arthas and init.

By Bubi

This article is the fourth article of the Java in the Container series. You are welcome to follow the series. 😊

Recently, in the container environment, it has been found that arthas cannot be used in the init process.

Note: AttachNotSupportedException: Unable to get pid of LinuxThreads manager thread. The specific operations and errors are listed below:

# java -jar arthas-boot.jar
[INFO] arthas-boot version: 3.5.6
[INFO] Found existing java process, please choose one and input the serial number of the process, eg : 1. Then hit ENTER.
* [1]: 1 com.alibabacloud.mse.demo.ZuulApplication
1
[INFO] arthas home: /home/admin/.opt/ArmsAgent/arthas
[INFO] Try to attach process 1
[ERROR] Start arthas failed, exception stack trace:
com.sun.tools.attach.AttachNotSupportedException: Unable to get pid of LinuxThreads manager thread
    at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:86)
    at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:78)
    at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:250)
    at com.taobao.arthas.core.Arthas.attachAgent(Arthas.java:117)
    at com.taobao.arthas.core.Arthas.<init>(Arthas.java:27)
    at com.taobao.arthas.core.Arthas.main(Arthas.java:166)
[INFO] Attach process 1 success.

Such a case happened before. The solution is to adjust the image to make other processes instead of the init process run in the Java process. However, this is not a long-term solution. We still need to take time to uproot this problem.

Reproduce the Issue

We create the following project to reproduce this problem:

public class Main {
  public static void main(String args[]) throws Exception {
    while (true) {
      System.out.println("hello!");
      Thread.sleep(30 * 1000);
    }
  }
}

FROM openjdk:8u212-jdk-alpine
COPY ./ /app
WORKDIR /app/src/main/java/
RUN javac Main.java
CMD ["java", "Main"]

Then, start the application as usual and try to use arthas or jstack:

$ # Build an image
$ docker build . -t example-attach
$ # Start the container
$ docker run --name example-attach --rm example-attach

$ # Enter the container in another terminal and execute jstack
$ docker exec -it example-attach sh
/app/src/main/java # jstack 1
1: Unable to get pid of LinuxThreads manager thread

Problem reproduced. Then, we start the analysis.

How Does a Normal Attach Process Work?

The following figure shows the JVM Attach processes during troubleshooting:

  1. Search the unix socket, which is /tmp/.java_pid${pid}. Check permissions if it exists and establish a connection.
  2. If it does not exist, create /proc/${pid}/cwd/.attach_pid${pid} and notify the JVM thread.
  3. First, judge whether it is LinuxThread or not. If it is, find the LinuxThreadsManager, and send SIGQUIT to all its child processes.
  4. If it isn't, send SIGQUIT directly to the target process.
  5. After the target process receives the signal, create Attach Listener to listen to /tmp/.java_pid${pid}.
  6. Start normal socket communication. According to the specific content of the communication, you can enable dumpThread (jstack) or load JavaAgent (such as arthas mentioned above).

Why Is an Attach Error Reported for the Init Process?

First of all, /tmp/.java_pid${pid} did not exist at that time. If it did, arthas would be loaded by direct communication. This can also be confirmed by checking the file.

Secondly, the .attach_pid${pid} file can also be created successfully,

This can be confirmed by the strace output:

open("/proc/424/cwd/.attach_pid424", O_RDWR | O_CREAT | O_EXCL | O_LARGEFILE, 0666 <unfinished ...>. 

The problem could be traced to thread judgment and signal sending. Let's take jstack as an example to find out why attach fails. Similarly to what we did before in the search process, we wanted to do it through debugging symbols, but the debugging symbols on Alpine can't display the source code content, and the compilation environment is very tricky. Therefore, strace is preferred. It is worth noting that fork is included in the jstack logic, so remember to use strace -f jstack 1 to check.

After checking the output of strace, we find there is no kill request. It seems the problem is about the thread model.

I just mentioned that JVM will determine whether it is LinuxThread. What is LinuxThread? First of all, look at the source code for the judgment:

1

Generally, the Linux kernel did not support threads at the beginning. LinuxThread mechanism implements threads through fork mechanism + shared memory space. However, LinuxThread is considered some independent parent-child processes in the kernel. There are many defects in signal processing and primitives synchronization. These logics should be handled through the manager thread. Later, Red Hat initiates NPTL, and threading capabilities are supported in the kernel, which means signals, synchronization, and other logic can be handled in a more standard way.

You can use getconf GNU_LIBPTHREAD_VERSION to check what kind of thread model it is. For example, the output on my machine is NPTL 2.34.

As written in the code above, you can use confstr(_CS_GNU_LIBPTHREAD_VERSION,) to obtain the current thread model.

Please see the Manuals for more information.

  • If confstr(_CS_GNU_LIBPTHREAD_VERSION,) returns 0, it suggests an old-version glibc and considers the process LinuxThread. First, find the manager thread (by finding the parent process) and then send SIGQUIT signal to each child process (this process needs to traverse all processes in the system).
  • If the result of confstr(_CS_GNU_LIBPTHREAD_VERSION,) contains NPTL, it is considered not LinuxThread and is processed according to NPTL: SIGQUIT is directly sent.

Unfortunately, LinuxThread/confstr(_CS_GNU_LIBPTHREAD_VERSION,) is not a POSIX standard, so Alpine's musl returns 0 for this call.

Based on the logic above, the process is considered LinuxThread by jvm. Let's try to find the parent process. If the pid is 1, the parent process cannot be found, so the error Unable to get pid of LinuxThreads manager thread is reported. This is the problem mentioned at the beginning of the article when arthas is unavailable.

Please see the Linux Thread Model Comparison: LinuxThreads and NPTL for a detailed comparison of the two thread models.

Why Can Attach Run in the Init Process?

First, manually enter the shell (sh is the init process number at this time) and then manually execute java Main (pid is 8). Then, let's see how the getLinuxThreadsManager works:

2

You can see that in this case, the jvm considers the manager thread to be an init process.

In this case, the sendQuitToChildrenOf(mpid) is executed later:

3
4

All child processes are traversed, and SIGQUIT is sent to them all. This logic is a bit strange. Please see this page for more information

Let's run it again and verify it with strace –f.

Process tree (where the green one is the thread):

5

The kill signal sent by jstack shows that jstack sends SIGQUIT to all child processes of the init process:

6

This behavior is consistent with what is mentioned above. Coincidentally, the SIGQUIT signal is ignored by most processes. So, jstack works normally in this case.

How Do We Solve This Problem?

The Simplest and Fastest Way: Workaround

Note: This method is recommended as there's no need to adjust the container parameters or restart the container.

Since attach has problems with sending signals, we will use shell to simulate this process:

pid=1 ;\
touch /proc/${pid}/cwd/.attach_pid${pid} && \
  kill -SIGQUIT ${pid} && \
  sleep 2 &&
  ls /proc/${pid}/root/tmp/.java_pid${pid}
# Then you can mount arthas java -jar arthas-boot.jar normally.

With the preceding operations, Attach Listener is enabled, and it listens to the path. After the second attach operation, arthas can be used in the normal way.

Note: You must create the .attach_pid${pid} file in advance.

Otherwise, the JVM sends the signal to the default sigaction to process the signal, which may cause the container to exit for pid 1.

Some people have also made a jattach tool based on similar principles. It can be installed directly in Alpine through apk. Add jattach, and then it will work the same way with jattach ${pid} properties.

Set Startup Parameters

Note: In this case, you need to adjust the startup parameters or environment variables and restart the application or container, which may lead to the loss of business scenes.

Jvm supports setting -XX:+ StartAttachListener, so when Jvm is started, the Attach Listener thread can be automatically started to the listening job, and arthas can also be used normally.

The better way for the container environment to go is to add the environment variable JAVA_TOOL_OPTIONS=-XX:+StartAttachListener to the container. This way, we can still achieve the same result without modifying the startup script.

Upstream Takes Precedence: Modify an Image

Note: This requires you to modify the image.

OpenJDK 8 hasn't done anything to fix this problem, so it is unavoidable if you directly use openjdk:8-jdk-alpine. The problem is also discussed in the Docker Registry.

However, OpenJDK 11 solved it (see the Source Code). We no longer have to judge the old LinuxThread model, which also allows arthas to work.

However, OpenJDK 8 in Alpine's official repository has fixed this problem through patching: https://gitlab.alpinelinux.org/alpine/aports/-/issues/13032

Since OpenJDK 8 is a relatively well-known JDK distribution, this problem is also fixed in eclipse-temurin:8-jdk-alpine, and this image can be used directly. For related discussion: https://github.com/adoptium/jdk8u/pull/8

Summary

In the issues on arthas or any articles related online, it always repeats how Java cannot be used as an init process. Therefore, in most cases, we have no way to mount diagnostic tools, resulting in the loss of scenes and the inability to locate faults in time.

As technical staff, we need to understand the underlying logic, so we can have more freedom in troubleshooting and architecture design and be more likely to locate and solve problems.

There will be more articles in this series to solve JVM problems in the container environment. Please stay tuned!

Related Links

[1] Java Attach Mechanism - Native
https://my.oschina.net/u/3784034/blog/5526214

[2] Manuals for Reference
https://man7.org/linux/man-pages/man3/confstr.3.html

[3] Linux threading models compared: LinuxThreads and NPTL
http://cs.uns.edu.ar/~jechaiz/sosd/clases/extras/03-LinuxThreads%20and%20NPTL.pdf

[4] Sagan standard
https://en.wikipedia.org/wiki/Sagan_standard

[5] jattach
https://github.com/apangin/jattach

[6] The problem is also discussed in Docker Registry
https://github.com/docker-library/openjdk/issues/76

[7] Source Code
https://github.com/openjdk/jdk11u/blob/jdk-11%2B28/src/jdk.attach/linux/classes/sun/tools/attach/VirtualMachineImpl.java#L78

0 2 1
Share on

You may also like

Comments

Related Products