Cloud Application Performance Diagnosis of System O&M Tool SysAK

By SIG for OpenAnolis system operation & maintenance

System operation and maintenance require stable business operation and maximize the use of resources. Therefore, the evaluation of application performance is an important part. As a powerful weapon for system operation and maintenance, SysAK has this ability. However, the diagnosis of application performance is more difficult than stability, and non-professionals cannot deal with it. This article introduces SysAK's methodology and related tools for performance diagnosis from a wide range of performance diagnosis practices.

Diagnosis Method of SysAK Application Performance

In short, the basic idea for SysAK to diagnose application performance is top-bottom and correlation expansion.

From top to bottom is application -> OS -> hardware, and association expansion includes peer applications, system impact, and network topology. It is simple to say, but it is a big project to implement.

1. Application Profile

The first thing to do is create an application profile, including its business throughput and system resource usage. Then, conduct a special analysis one by one based on the performance bottleneck that accounts for a large proportion of the profile. Specifically, it includes statistics on the concurrency, operation, and sleep of the application. The concurrency is simple. You need to count the number of business tasks, which is mainly used as a reference for the following resources.

1.1 Operational Statistics

Operational statistics refer to the classified statistics on the utilization of system infrastructure resources. The basic resources occupied by the application run time are in four categories:

CPU

We can know whether the throughput of the application is high through the CPU occupation. We can know whether the business run time is more about the business or the use of kernel resources through the CPU proportion of user/sys. Therefore, the run time length and the respective proportions of user and sys are included here at a minimum. If the proportion of sys is high, you need to continue to analyze whether the corresponding kernel resources are abnormal. Otherwise, you need to analyze whether there are bottlenecks in hardware resources.

Memory

Memory usage is used to determine whether memory application and access are factors that restrict business performance. Therefore, statistics on the total amount of memory allocation, frequency, number of missing pages, number of visits across NUMA nodes, and size are included here at a minimum.

File

File access is used to determine whether file IO is a factor that restricts business performance. Statistics on the read/write frequency, pagecache hit rate, and average I /O latency are included here at a minimum.

Networking

The message traffic is used to determine whether the network is a factor that restricts service performance. The traffic statistics and the network topology of the peer link must be included at a minimum.

1.2 Sleep Statistics

If sleep time accounts for a large proportion of the application running cycle, it is likely to be a key factor affecting business performance. At this time, it is necessary to analyze the sleep details. Data statistics for at least three types of behaviors must be included, including the number and duration of specific behaviors:

Active Sleep: If the proportion of data in this type is high, the application is responsible for its behavior.
User Critical Resource Competition: If the proportion of these data is high, you need to optimize the application.
Kernel Resource Waiting: If the proportion of this type of data is high, you need to analyze specific kernel resource bottlenecks. With the application profile, we have an understanding of the basic situation in the application running process. If we find the bottleneck is not in the business, we continue to analyze the corresponding system resources or hardware bottlenecks.

2. System Kernel Resources

System kernel resources constraints on application performance can be divided into three categories.

2.1 Interference

There are many interference sources for application operation during the operation of a server operating system, but the interference may not affect the service. Therefore, at least the frequency and running time of these interference sources need to be included to evaluate whether they are key factors.

The statistics of the following interference sources must be included at a minimum:

Device Hardware Interruption

If the frequency of a certain type of interruption is high or concentrated in a certain CPU, or if a single operation is long, the performance of the service may be affected. You can perform operations (such as breaking and binding) on the interruption and observe the effect.

System Timing Interruption

Too many system timers may also cause delays to wake up the process. You can analyze whether the process uses a large number of high-precision timers.

Soft Interruption

There is a burst increase in network traffic, etc.

Kernel Threads

Other High Priority Applications

2.2 Bottlenecks

The system has a wide variety of kernel resources. Different application models may have different dependencies on kernel resources. Bottlenecks cannot be fully covered, but several types of common kernel resources must be included at a minimum:

Running Queue Length

This can indicate whether the business process /thread is concurrent or whether the binding of cores is unreasonable.

fs/block Layer Latency

There may be different bottlenecks for different file systems or devices and IO scheduling algorithms. These need segment statistical delay to determine.

Memory Allocation Latency

Affected by memory usage and fragments, the latency of memory allocation may be large.

The Duration and Frequency of Pagefault

The overhead of memory requests, remaps, and tlb flush caused by memory page shortages is large. If it frequently enters the pagefault process, optimizing application policies may be a good choice, such as pre-allocating memory pools and using huge pages.

Competition for Critical Path Kernel Locks

The lock is an inevitable mechanism. Kernel-state lock contention causes the CPU of the sys state to rise, which requires specific analysis in conjunction with the context.

2.3 Policy

The kernel resources mentioned before cannot be completely covered, but there is another way to observe some data. Since different kernel strategies may have relatively large performance differences, you can try to find out the different points of configuration by comparing different systems. The following is the usual system configuration collection:

Kernel startup parameters
Kernel configuration interface – sysctl/procfs/sysfs
Kernel module differences
cgroup configuration

3. Virtualization

When the bottleneck point cannot be found, or we want to mine the residual value of performance, we usually focus on the hardware side. Currently, the business is deployed on the cloud, so before going deep into the hardware layer, the virtualization layer or host side is a necessary factor. The preceding methods can be reused for the system kernel resource constraints for the performance analysis of the host side, but things can be done less for the business profile. Compared with the application business, the logic of the virtualization layer will not change indefinitely. We can learn about the virtualization solutions provided by cloud vendors from various channels. Currently, the mainstream is the Linux kvm solution. Therefore, we can make a specific analysis of the technical points of kvm. The statistics should include:

The preemptive frequency and time of qemu threads, the frequency and events of guest trapping, and the running time of qemu threads on the host.

These are used to determine whether the performance loss is due to the virtualization layer or whether there is a possibility for improvement.

4. Hardware Performance

When the diagnostics come to the hardware layer, it is usually because more optimization space fails to be found simply from the application layer or the system layer. There are two ideas. One idea is to look at the point of hardware utilization to see if the application can be adjusted in the opposite direction to reduce dependence on or disperse the hot spots of hardware use. The other idea is that when the application cannot be adjusted, evaluate whether the performance of the hardware has reached the bottleneck. A set of methodology can be extended for the former. For example, Ahmed Yasin's TMAM does not extend too much in sysAK, but there is still necessary work to complete. In addition to data collection such as cache, tlb miss, and cpi, the more important thing is how to analyze these data in combination with the operation of the application process. For example, there is more competition for cache or bandwidth on the same CPU, which is due to the current business's program design. There are still other processes caused by contention, which can be optimized through technologies, such as binding cores and rdt.

5. Interactive Application Environment

It's not over yet. There is still a missing part. Currently, most applications are not alone, and interactive applications will have performance impacts. Therefore, we use the topology of the network connection mentioned above in the application profile. We can copy all the preceding performance diagnostic methods on the objects that interact with the current application.

Summary

Let's summarize this article with a picture:

The tools involved in the figure will appear in the following articles. Stay tuned for more information!

Community