Open source self-built/hosted and commercial self-developed Trace

With the rise of microservice architecture, the call dependencies on the server end have become increasingly complex. In order to quickly locate abnormal components and performance bottlenecks, accessing distributed link tracking Traces has become a consensus in the field of IT operations and maintenance. However, what are the differences between open source self built, open source hosted, or commercial self developed Trace products, and how should I choose? This is a question that many users encounter when researching Trace solutions, and it is also the most easily confused misconception.

To clarify this issue, we need to start from two aspects. Firstly, we need to sort out the core risks and typical scenarios of online applications. The second is to compare the differences in the capabilities of three Trace solutions: open source self built, hosted, and commercial self developed. The so-called 'knowing oneself and the enemy, invincible in a hundred battles', only by combining one's own actual situation can one choose the most suitable solution.

Two Types of Risks and Ten Typical Problems

Online application risks are mainly divided into two categories: "wrong" and "slow". The reason for 'error' is usually due to the program not running as expected, such as the JVM loading an incorrect version of a class instance, code entering an abnormal branch, environment configuration errors, etc. The "slow" is usually caused by insufficient resources, such as the CPU is full due to burst traffic, the microservice or database thread pool is exhausted, and the memory leak causes continuous FGC.

Whether it's a "wrong" problem or a "slow" problem. From the perspective of users, they all hope to quickly locate the root cause, stop losses in a timely manner, and eliminate hidden dangers. However, based on the author's over five years of experience in Trace development, operation and maintenance, and preparation for the 11th National Congress, the vast majority of online problems cannot be effectively located and solved solely through the basic ability of link tracking. The complexity of online systems determines that an excellent Trace product must provide more comprehensive and effective data diagnosis capabilities, such as code level diagnosis, memory analysis, thread pool analysis, etc; At the same time, in order to improve the usability and stability of Trace components, it is also necessary to provide capabilities such as dynamic sampling, lossless statistics, and automatic convergence of interface names. This is also why mainstream Trace products in the industry are gradually upgrading to the fields of APM and observable applications. For the convenience of understanding, this article still uses Trace to uniformly express the observability of the application layer.

In summary, in order to ensure the ultimate business stability, when selecting link tracking solutions for online applications, in addition to the general basic capabilities of Trace (such as call chain, service monitoring, and link topology), the "Top Ten Typical Problems" listed below (using Java applications as an example) can also be referenced to comprehensively compare the differential performance of open source self built, open source hosted, and commercial self developed Trace products.

1. 【 Code level automatic diagnosis 】 The interface occasionally times out, and the call chain can only see the timeout interface name, cannot see the internal methods, cannot locate the root cause, and is difficult to replicate. What should we do?

Students responsible for stability should be familiar with this scenario: the system may experience occasional interface timeouts during nighttime or hourly rush times. When problems are discovered and investigated again, the abnormal scene has been lost and is difficult to reproduce, making it impossible to diagnose through manual JStack. At present, open-source link tracking implementations can only see the timeout interface through the call chain. The specific reason and code segment causing the exception cannot be located, and ultimately can only be left unresolved. The above scenario was repeated until it caused a malfunction and ultimately suffered huge business losses.

To solve the above problems, a precise and lightweight slow call automatic listening function is needed, which can truly restore the first scene of code execution without the need for pre embedding, and automatically record the complete method stack of slow calls. As shown in the following figure, when the interface call exceeds a certain threshold (such as 2 seconds), it will start listening on the thread where the slow request is located until the request ends in the 15th second, immediately stopping listening, accurately preserving the snapshot set of the thread where the request is located during its lifecycle, and restoring the complete method stack and time consumption.

2. [Pooled monitoring] The microservice/database thread pool is often full, resulting in service timeout. It is very difficult to troubleshoot. How to solve this problem?

The microservice/database thread pool is full, leading to the timeout of business requests. This kind of problem occurs frequently every day. Students with rich diagnostic experience will consciously check the corresponding component log. For example, Dubbo will output relevant exception records when the thread pool is full. However, if the component does not output thread pool information, or the O&M students are not experienced enough in troubleshooting, such problems will become very difficult. At present, the open source version of Trace generally only provides JVM overview monitoring, and cannot specifically view the status of each thread pool, let alone determine whether the thread pool is exhausted.

The pooled monitoring provided by the commercial self-developed Trace can directly see the maximum number of threads, the current number of threads, and the number of active threads in the specified thread pool. The risk of running out of thread pool or high water level can be seen at a glance. In addition, you can also set the thread pool usage percentage alarm. For example, when the number of threads in the Tomcat thread pool exceeds 80% of the maximum number of threads, you can send an SMS notification in advance, and when the number reaches 100%, you can send a phone alarm.

3. [Thread Analysis] After promoting pressure testing/publishing changes, it was found that the CPU water level was very high. How can we analyze the bottleneck points of application performance and optimize them accordingly?

When we are doing large-scale pressure testing or major version changes (including a lot of code logic), we may encounter a sudden increase in CPU water level, but we cannot clearly locate which code segment is causing it. We can only keep doing JStacks, visually comparing thread state changes, and then continuously optimizing based on experience, ultimately consuming a lot of energy, but the effect is average. So is there a method for quickly analyzing application performance bottlenecks? The answer must be yes, and there must be more than one.

The most common method is to manually trigger a ThreadDump for a period of time (such as 5 minutes), and then analyze the thread overhead and method stack snapshot during this period. The drawback of manually triggering ThreadDump is that it has high performance overhead and cannot run normally, and cannot automatically retain on-site snapshots that have occurred. For example, during the pressure test period, the CPU is high, and by the time the pressure test is over and the backup is completed, the site is no longer there, and there is no time for manual ThreadDump.

The second is to provide the normal thread analysis function, which can automatically record the status, number, CPU consumption and internal method stack of each type of thread pool. In any period of time, click Sort by CPU Time to locate the thread category with the largest CPU overhead, and then click the method stack to see the specific code card points. As shown in the figure below, a large number of method cards with BLOCKED status are obtained in the database connection, which can be optimized by expanding the database Connection pool.

4. [Exception Diagnosis] After executing the release/configuration change, there were a large number of interface errors, but the cause could not be located in the first place, resulting in a business failure?

The biggest "culprit" that affects online stability is change. Whether it is application release change or dynamic configuration change, it may cause abnormal program operation. So, how to quickly determine the risk of change, identify problems in the first place, and stop losses in a timely manner?

Here, we would like to share a practice of intercepting abnormal releases in Alibaba's internal publishing system. One of the most important monitoring indicators is the comparison of Java Exception/Error exception count. Whether it is NPE (NullPointException) or OOM (OutOfMemoryError), monitoring and alerting based on the total/specific number of exceptions can quickly detect online anomalies, especially before and after changing the timeline.

An independent exception analysis and diagnosis page can view the trend and stack details of each type of exception, and further view the associated interface distribution, as shown in the following figure.

5. [Memory diagnosis] FGC is frequently used. It is suspected that there is a memory leak, but the abnormal object cannot be located. What should I do?

FullGC is definitely one of the most common problems in Java applications. FGC can be caused by various reasons such as too fast object creation, memory leak, etc. The most effective way to troubleshoot FGC is to execute heap memory HeapDump. The memory usage of various objects is clear and visible at a glance.

The white screen memory snapshot function can specify the machine to execute one click HeapDump and analysis, greatly improving the efficiency of troubleshooting memory problems. It also supports automatic dump to save abnormal snapshots in the memory leak scenario, as shown in the following figure:

6. [Online Debugging] How to troubleshoot the inconsistency between the online running status and local debugging behavior of the same code?

The code that has been debugged locally generates various errors as soon as it is sent to the production environment. What exactly is the problem? I believe all development classmates have experienced such nightmares. There are many reasons for this problem, such as Maven relying on multiple version conflicts, inconsistent dynamic configuration parameters in different environments, and differences in dependent components in different environments.

In order to solve the problem of online running code not meeting expectations, we need an online debugging diagnostic tool that can view the source code, input and output parameters, execution method stack and time consumption, static object or dynamic instance values of the current program running state in real-time, making online debugging as convenient as local debugging, as shown in the following figure:

7. [Full Link Tracking] User feedback shows that the website is very slow to open. How can we achieve full link call trajectory tracking from the web end to the server end?

The key to connecting the front and rear full links is to follow the same set of transparent transmission protocol standards. Currently, open source only supports backend application access and lacks front-end buried points (such as Web/H5, mini programs, etc.). The front and rear full link tracking scheme is shown in the following figure:

• Header transparent format: Unified use of Jaeger format, Key is uber trace id, Value is {trace id}: {span id}: {parent span id}: {flags}.

Front end access: CDN (Script injection) or NPM can be used as two low code access methods, supporting Web/H5, Weex, and various mini program scenarios.

Backend access:

Java applications recommend prioritizing the use of ARMS Agent, which is non-invasive and does not require code modification. It supports advanced functions such as edge diagnosis, lossless statistics, and precise sampling. User defined methods can be actively buried through the OpenTelemetry SDK.

Non Java applications are recommended to access through Jaeger and report data to ARMS Endpoint. ARMS will be perfectly compatible with link transparency and display between multilingual applications.

The current full link tracking solution of Alibaba Cloud ARMS is based on the Jaeger protocol, and the SkyWalking protocol is being developed to support lossless migration of SkyWalking's self built users. The call chain effect of full link tracing for front-end, Java applications, and non Java applications is shown in the following figure:

8. Lossless statistics: The cost of calling chain logs is too high. After enabling client sampling, the monitoring chart becomes inaccurate. How can we solve this problem?

The call chain log is positively correlated with traffic. The traffic of To C business is very large, and the cost of full reporting and storage of the call chain will be very high. However, if client sampling is enabled, there will be inaccurate statistical indicators. For example, if the sampling rate is set to 1%, only 100 out of 10000 requests will be recorded. The statistical data aggregated based on these 100 logs will cause serious sample skewing issues, Unable to accurately reflect actual service traffic or time consumption.

To address the above issues, we need to support lossless statistics on the client agent, where the same indicator will only report one piece of data no matter how many requests are made over a period of time (usually 15 seconds). In this way, the results of statistical indicators are always accurate and will not be affected by the sampling rate of the call chain. Users can confidently adjust the sampling rate, and the call chain cost can be reduced by up to 90% or more. The larger the traffic and cluster size of users, the more significant the cost optimization effect.

9. 【 Interface name automatic convergence 】 The RESTful interface causes URL name divergence due to parameters such as timestamp and UID, and monitoring charts are meaningless breakpoints. How can we solve this problem?

When there are variable parameters such as timestamp and UID in the interface name, it can lead to different names of the same type of interface, with very few occurrences and no monitoring value. It can also cause hotspots in storage/computing and affect cluster stability. At this point, we need to classify and aggregate the divergent interfaces to improve the value of data analysis and cluster stability.

At this point, we need to provide an automatic convergence algorithm for interface names, which can actively identify variable parameters, aggregate the same class of interfaces, observe the trend of category changes, and more in line with user monitoring needs; At the same time, it avoids data hotspot issues caused by interface divergence, improving overall stability and performance. As shown in the following figure:/safe/getXXXInfo/xxxx will be classified into one category, otherwise each request will be a chart with only one data point, resulting in poor user readability.

10. [Dynamic configuration distribution] Resource shortage caused by sudden online traffic requires immediate downgrading of non core functions. How to achieve dynamic downgrading or tuning without restarting the application?

Accidents always come suddenly, sudden traffic bursts, external attacks, and computer room failures can all lead to insufficient system resources. In order to ensure that the most important core business is not affected, we often need to dynamically downgrade some non core functions without restarting the application to release resources, such as lowering the sampling rate of the client call chain and closing some diagnostic modules with high performance costs. On the contrary, sometimes we need to dynamically enable some expensive deep diagnostic functions to analyze the current abnormal scene, such as memory dump.

Whether it is dynamic degradation or dynamic activation, it is necessary to perform dynamic configuration push down without restarting the application. However, open source Traces typically do not have such capabilities and require the establishment of a metadata configuration center and corresponding code modifications. And commercial Trace not only supports dynamic configuration push down, but can also be refined to each application's independent configuration. For example, if application A has occasional slow calls, the automatic slow call diagnostic switch can be turned on for monitoring; The time consumption of application B is sensitive to CPU overhead, so this switch can be turned off; The two applications take what they need and do not affect each other.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us