Link analysis K.O five classic problems

The "Third Game" of Link Tracking

When it comes to link tracking, it is natural for everyone to think of using call chains to troubleshoot exceptions in a single request, or using pre aggregated link statistics indicators for service monitoring and alerting. In fact, there is a third gameplay for link tracing: compared to call chains, it can solve boundary problems faster; Compared to pre aggregated monitoring charts, it allows for more flexible implementation of custom diagnostics. That is post aggregation analysis based on detailed link data, abbreviated as link analysis.

Link analysis is based on stored full link detail data, freely combining filtering conditions and aggregation dimensions for real-time analysis, which can meet customized diagnostic needs in different scenarios. For example, checking the distribution of slow call timing that takes more than 3 seconds, checking the distribution of error requests on different machines, and checking the traffic changes of VIP customers. Next, this article will introduce how to quickly locate five classic online problems through link analysis, in order to gain a more intuitive understanding of the usage and value of link analysis.

Link Analysis K.O "Five Classic Problems"

The usage of post aggregation based link analysis is very flexible. This article only lists the five most typical case scenarios, and other scenarios are welcome to explore and share together.

[Uneven Traffic] Load balancing configuration error, resulting in a large number of requests hitting a small number of machines, causing "hotspots" that affect service availability. What should we do?

The problem of "hot spot breakdown" caused by uneven traffic can easily lead to service unavailability, and there have been too many such cases in production environments. For example, load balancing configuration errors, registry anomalies that prevent the restart of node services from going online, DHT hash factor anomalies, and so on.

The biggest risk of uneven traffic lies in whether the "hot spot" phenomenon can be detected in a timely manner. Its symptoms are more like slow service response or error reporting, and traditional monitoring cannot directly reflect the hot spot phenomenon. Therefore, most students do not consider this factor in the first place, wasting valuable emergency response time and causing the spread of fault impact.

By analyzing the link data grouped by IP, we can quickly understand which machines the call requests are distributed on, especially the changes in traffic distribution before and after the problem occurs. If a large number of requests suddenly concentrate on one or a small number of machines, it is likely a hot issue caused by uneven traffic. Combined with the change event at the point where the problem occurred, quickly locate the erroneous changes that caused the fault and roll back in a timely manner.

[Single machine failure] Network card damage, CPU oversold, disk full, and other single machine failures have caused some requests to fail or timeout. How to troubleshoot?

Single machine failures occur frequently every moment, especially in the core cluster due to the large number of nodes, which is almost an "inevitable" event from a statistical probability perspective. A single machine failure will not cause a large area of service unavailability, but it will cause a small number of user requests to fail or time out, continuously affecting the user experience and causing certain Q&A costs. Therefore, it is necessary to promptly address such issues.

Single machine failures can be divided into two types: host failure and container failure (in K8s environment, they can be divided into Node and Pod). For example, CPU oversold, hardware failure, etc. are all at the host level, which will affect all containers; However, faults such as full disk and memory overflow only affect a single container. Therefore, when troubleshooting a single machine fault, analysis can be conducted based on two dimensions: host IP and container IP.

In the face of such problems, abnormal or timeout requests can be filtered out through link analysis, and aggregation analysis can be conducted based on the host IP or container IP to quickly determine whether there is a single machine fault. If abnormal requests are concentrated on a single machine, you can try replacing the machine for quick recovery, or troubleshoot various system parameters of the machine, such as whether the disk space is full and whether the CPU steel time is too high. If abnormal requests are scattered across multiple machines, it is highly likely that single machine failure factors can be ruled out, and downstream dependent services or program logic can be analyzed to determine if there are any abnormalities.

Slow interface governance: How to quickly sort out the list of slow interfaces and solve performance bottlenecks before new applications are launched or greatly promoted?

A systematic performance tuning is usually required when a new application is launched or in preparation for a major promotion. The first step is to analyze the current system's performance bottlenecks, sort out the list of slow interfaces and their frequency of occurrence.

At this point, calls that take longer than a certain threshold can be filtered out through link analysis, and then grouped and counted based on interface names. This allows for quick identification of the list and patterns of slow interfaces, and then the slowest interfaces with the highest frequency of occurrence can be managed one by one.

After the slow interface is found, the root cause of slow calls can be located by combining the relevant call chain, method stack, thread pool and other data. Common reasons include the following:

• The database/microservice Connection pool is too small, and a large number of requests are in the status of obtaining connections. You can adjust the maximum number of threads in the pool to solve this problem.

• N+1 problem. For example, if an external request calls hundreds of database calls internally, fragmentation requests can be merged to reduce the network transmission time.

The single request data is too large, which leads to long network transmission and deserialization times and can easily lead to FGC. You can change the full query to pagination query to avoid requesting too much data at once.

The logging framework is "hot locked", which can change the synchronous output of logs to asynchronous output.

How to analyze the changes in traffic and service quality of reinsurance customers/channels through business traffic statistics?

In actual production environments, services are usually standardized, but businesses need to be classified and graded. For the same order service, we need to classify and count according to categories, channels, users, and other dimensions to achieve refined operations. For example, for offline retail channels, the stability of each order and POS machine may trigger public opinion, and the SLA requirements of offline channels are much higher than those of online channels. So, how can we accurately monitor the traffic status and service quality of offline retail links in a universal e-commerce service system?

Here, custom Attributes filtering and statistics for link analysis can be used to achieve low-cost business link analysis. For example, we label offline orders with {"attributes. channel": "offline"} in the entrance service, and then label them separately for different stores, user groups, and product categories. Finally, by filtering attributes. channel=offline, and then grouping different business labels to count the number of calls, time consumption, or error rate, we can quickly analyze the traffic trends and service quality of each type of business scenario.

【 Grayscale Release Monitoring 】 500 machines are released in 10 batches. How can we quickly determine if there are any abnormalities after the first batch of grayscale releases?

Changing the three board axe to "grayscale, monitoring, and rollback" is an important criterion for ensuring online stability. Among them, batch grayscale changes are a key means to reduce online risks and control explosion radius. Once the service status of the grayscale batch is found to be abnormal, it should be rolled back in a timely manner instead of continuing to publish. However, many failures in the production environment are caused by the lack of effective grayscale monitoring.

For example, when the microservice registration center is abnormal, restarting the published machine cannot proceed with service registration and launch. Due to the lack of grayscale monitoring, although all the first batch of restarted machines failed to register, resulting in all traffic being centrally routed to the last batch of machines, the overall traffic and time consumption of application monitoring did not significantly change. Until the last batch of machines also failed to register after restarting, the entire application entered a completely unavailable state, ultimately leading to a serious online failure.

In the above case, if version labeling is performed on different machine traffic and attribute. version is grouped and counted through link analysis, it can clearly distinguish traffic changes and service quality before and after publication or different versions. There will be no situation where grayscale batch anomalies are masked by global monitoring.

Constraints for Link Analysis

Although link analysis is very flexible to use, it can meet the customized diagnostic needs of different scenarios. However, it also has several usage constraints:

The cost of analyzing based on link detail data is relatively high. The premise of link analysis is to report and store link detailed data as fully as possible. If the sampling rate is low, resulting in incomplete detailed data, the effectiveness of link analysis will be greatly compromised. In order to reduce the cost of full storage, edge data nodes can be deployed within the user cluster for temporary data caching and processing, reducing cross network reporting costs. Alternatively, cold and hot data can be separated and stored on the server, with hot storage performing full link analysis and cold storage performing slow link diagnosis.

2. The query performance cost of post aggregation analysis is high, and the concurrency is low, making it unsuitable for alarm purposes. Link analysis involves real-time full data scanning and statistics, with query performance costs far greater than pre aggregated statistical indicators, making it unsuitable for high concurrency alarm queries. The post aggregation analysis statement needs to be pushed down to the client for user-defined indicator statistics in combination with the user-defined indicator function, so as to support alarm and overall customization.

3. Combining custom label embedding points can maximize the value of link analysis. Link analysis is different from standard application monitoring pre aggregation indicators. Many custom scenario labels require users to manually embed and mark them, in order to effectively distinguish different business scenarios and achieve precise analysis.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us