How to use Kubernetes to monitor and locate slow calls
Slow call hazards and common causes
Slow invocation is a very common exception during software development. The potential hazards of slow calls include:
Front end business dimension: Firstly, slow calling may cause slow front-end loading, which may further lead to high application offloading rate and affect brand reputation.
Dimension of project delivery: Due to slow interfaces, the SLO cannot be achieved, resulting in project delays.
Business architecture stability: When interface calls are slow, it is very easy to cause timeouts. When other business services rely on this interface, it will trigger a large number of retries, leading to resource depletion and ultimately resulting in an avalanche of partial or entire service unavailability.
So, seemingly insignificant slow calls may hide huge risks, and we should be vigilant. It is best not to ignore slow calls, but to analyze the underlying reasons as much as possible in order to control risks.
What are the reasons for slow calls? There are countless reasons for slow calls, which can be summarized into five common ones.
The first issue is the high utilization of resources, such as CPU memory, disks, network cards, and so on. When these usage rates are too high, it is very easy to cause slow service.
The second issue is code design. Generally speaking, if SQL is associated with many tables and created many tables, it will greatly affect the performance of SQL execution.
The third issue is dependency. The service itself has no issues, but when calling downstream services, downstream returns are slow, and the service itself is in a waiting state, which can also lead to slow service calls.
The fourth issue is design, for example, if a table with a large amount of data is very large, and there is no database or table for querying billion level data, it can easily cause slow queries. Similar situations also involve time-consuming operations without caching.
The fifth issue is network related issues, such as cross continent calls, where the physical distance is too large, resulting in longer round-trip times and slow calls. Or the network performance between two points may be poor. For example, there is a problem of packet loss and high retransmission rate.
Today's example revolves around these five aspects, let's take a look together.
What are the general steps or best practices for locating slow calls? I have summarized three aspects here: golden signal+resource indicators+global architecture.
Let's take a look at the gold signal first. Firstly, the Golden Signal comes from the book Site Reliability Engineering in the Google SRE Bible. A set of minimum indicators used to characterize the health of a system, including:
Delay - used to describe the time taken by the system to execute requests. Common indicators include average response time, quantile such as P90/P95/P99. These indicators can well represent whether the system responds quickly or slowly, which is more intuitive.
Traffic - used to characterize the level of service busy, typical indicators include QPS and TPS.
• Errors - which are commonly seen in protocols such as 500 and 400 in HTTP protocols. If there are many errors, it may indicate that a problem has already occurred.
Saturation - refers to the resource water level. Generally speaking, services that are close to saturation are more prone to problems, such as when the disk is full, causing logs to be unable to be written, resulting in service response. Typical resources include CPU, memory, disk, queue length, number of connections, and so on.
In addition to the gold signal, we also need to focus on a resource indicator. Brandan Gregg, the famous performance analysis god, mentioned a USE method in his performance analysis methodology article. The USE method is analyzed from the perspective of resources. It checks utilization, saturation, and error for each resource, which together form the USE. Checking these three items can basically solve 80% of service problems, while you only need to spend 5% of the time.
After having the gold signal and resource indicators, what else do we need to pay attention to? As Branda mentioned in his methodology, 'We cannot just see the trees without the forest'. Zhuge Liang also said, "Those who do not plan for the overall situation are not enough to plan for a single territory. We should draw the system architecture and look at performance issues from a global perspective, rather than just looking at a particular resource or service. Taking everything into consideration, identifying bottlenecks, and systematically solving problems through design methods is also a more optimal approach. So, we need a combination of golden signals, resource indicators, and global architecture.
Slow call best practices
Next, I will talk about three cases. The first one is the problem of node CPU being full, which is also a typical resource problem caused by slow service, that is, the problem caused by the service's own resources. The second issue is the slow invocation of dependent service middleware. The third issue is poor network performance. The first case is to determine if there are any issues with the service itself; The second case is to determine the issue of downstream services; The third is to determine the network performance issues between oneself and the service.
Let's take an e-commerce application as an example. Firstly, the traffic entrance is Alibaba Cloud SLB, and then the traffic enters the microservices system. In microservices, we receive all traffic through the gateway, and the gateway will transfer the traffic to the corresponding internal services, such as ProductService, CartService, and PaymentService. Below, we rely on some middleware, such as Redis, MySQL, etc. We will use Alibaba Cloud's ARMS Kubernetes monitoring product to monitor the entire architecture. In terms of fault injection, we will inject different types of exceptions such as CPU full, network exceptions, etc. through Chaosblade.
Case 1: Node CPU Full Problem
What kind of problem will a node CPU become full? After the node CPU is full, the Pod above may not be able to apply for more CPUs, causing all threads inside to be in a waiting state for scheduling, resulting in slow calls. In addition to the nodes, in addition to the CPU, we also have some resources such as disks, memory, and so on.
Next, let's take a look at some characteristics of CPU in Kubernetes' cluster. Firstly, the CPU is a compressible resource. Looking at these configurations on the right in Kubernetes, there are several common configurations, such as Requests, which are mainly used for scheduling. Limits are used to set a runtime limit, beyond which it will be restricted. So, our experimental principle is to fully inject the CPU of the node, which results in the Pod not being able to apply for more memory, thereby causing the service to slow down.
Before the official start, we identify key links through topology diagrams and configure some alarms on them. For example, for gateways and payment links, we will configure alarms such as average response time P90 and slow calls. After the configuration is completed, I will inject a node with a CPU full fault. The node selected for this is the gateway node. After waiting for about five minutes, we can receive the alarm, which is the validation of the effectiveness of the alarm in the second step.
Next, let's move on to root cause localization. Firstly, we will enter the application details of the viewing gateway. The first step is to check the relevant golden signal, which is the response time. We can see that the response time is very intuitive and shows a sudden increase. Below is the number of slow calls, which is over a thousand. The number of slow calls suddenly increases, and P90/P95 shows a significant increase, exceeding one second, indicating that the entire service has also slowed down.
Next, we need to analyze resource indicators. In the Pod CPU usage chart, it can be seen that Pod usage has increased rapidly during this period, indicating the need to apply for more memory from the host or node. Let's take a closer look at the CPU usage of the node or host. We can see that during this period, the usage rate is close to 100%, and Pod cannot apply for more CPUs, which further leads to slow service and a significant increase in average response time.
After identifying the problem, we can think of specific solutions. Configure elastic scaling through CPU usage. Because we don't know the relevant traffic or resources, and we don't know when it suddenly becomes insufficient. The best way to deal with this scenario is to configure elastic scaling for resources and nodes, mainly to ensure that resources can dynamically expand when the load increases. In order to configure elastic scaling for applications, we can provide CPU indicators and configure an expansion action to increase the number of replicas to share traffic. In this process, we can configure the maximum number of replicas to be ten, the minimum number of replicas to be three, and so on.
The effect is as follows: When a CPU slow fault is injected, the slow call will increase, and after the increase is completed, elastic scaling will be triggered, which means that the CPU utilization rate exceeds the threshold, such as 70%. So, it will automatically expand some replicas to share the traffic, and we can see that the slow call count gradually decreases until it disappears, indicating that our elastic scaling has played a role.
Case 2: The problem of slow invocation of dependent service middleware
Next, let's take a look at the second case. First of all, let's introduce the preparation work. In the picture on the left, we can see that two downstream services have dropped from the gateway, one is MySQL and the other is ProductService. Therefore, an alarm greater than one second is directly configured on the gateway, with an average response time of P99 greater than one second. The second step is to see that this product is also on a critical link. I will provide it with a P99 alarm greater than one second, and the third one is MySQL, which also has an alarm greater than one second. After configuration, I will inject a MySQL slow query fault into the product service. After about two minutes, we can see that the alarms are triggered one after another, There is a red dot and a gray dot on both the gateway and the product, which is actually the fault reported and the alarm event reported. Kubernetes monitoring will automatically match this alarm event to this node through the namespace application, so that it can quickly identify which services and applications are abnormal, so as to quickly locate the problem. After we receive the alarm now, the next step is to perform a root cause localization.
First of all, let's talk about the process of updating the positioning. Alarm driven, because prevention is always better than remediation, we adopt the process of configuring alarms first and then updating the positioning. Then we will use topology for visual analysis, as topology can be used for architecture awareness, analysis of upstream and downstream, and visual analysis. After receiving an alarm, you can view what has happened to the corresponding application based on the alarm. The first one we look at is the gateway, and we see that the P99 of the gateway has increased to over 1800 milliseconds, triggering an alarm greater than the 1 second threshold. We can also see that several quantile are rising. Then we can take a closer look at another service that has alarms, namely Product. After clicking on this node, we can see from the panel that this Product has also had a slow call. P99 and P95 have all had slow calls to varying degrees, mostly more than one second, Then at this point, we can take a look at the resource usage of the Product, as there may be a problem with the Product itself. We looked at the downstream of Product, one is Nacos and the other is MySQL. When we looked at the interaction with MySQL, we found that there were a large number of slow calls. After seeing these slow calls, we clicked on these details to drill down and see what happened when it called. After further examining this data, we found that Product executed a very complex command when calling MySQL in SQL, Join multiple tables with one SQL statement. From calling Trace, we can see that it takes a lot of time, so we can locate that it is basically a problem caused by this SQL statement.
To summarize our entire process, first we will identify critical paths through architecture awareness, and then configure alarms on this critical path to proactively detect anomalies. After discovering anomalies, we locate the problem by applying our own resource indicators such as gold signals. If there are no issues with ourselves, then we can track downstream and look at downstream resource indicators to use this method to locate a dependency issue with slow calls, such as middleware calls.
Case 3: Poor Network Performance
Next, let's talk about the last example of poor network performance. Kubernetes' network architecture is quite complex, including communication between containers, Pods, Pods and services, external and service communication, and so on. So the complexity is relatively high, and the learning curve is also steep, which brings certain difficulties to the localization problem. So, how do we deal with this situation? If key network environmental indicators are used to detect network anomalies, what are the key environmental indicators? The first is rate and bandwidth, the second is throughput, the third is latency, and the fourth is RTT.
Firstly, I will configure an alarm here to indicate the high packet loss rate of the node where MySQL is injected. After waiting for a few minutes, we will receive a slow call alarm, and the response time of both the gateway and the product will be greater than one second. Next, let's take a look at the root cause localization. We can see that the gateway experienced a sudden increase in response time for slow calls to P99, and then we can see that the product also experienced a sudden increase in average response time. That is, the slow call of the service just now. Then, we further look at the downstream of the product, relying on the three services of Nacos, Redis, and MySQL, and we can see that slow calls are relatively obvious, Then when we looked at its downstream, we found that Product had a serious slow call when calling MySQL, and its RTT and retransmission phenomena were also obvious.
Under normal circumstances, RTT is very stable. It reflects the round-trip time between upstream and downstream. If it rises very quickly, it can be basically considered as a network problem. Therefore, we can see that there are three of them, including Gateway, Product, and MySQL. From here, we can summarize that this method of identifying critical paths and configuring alarms on the topology can quickly locate the problem, There is no need to verify a lot of scattered information in various places. We just need to go and check the corresponding performance indicators, network indicators, etc. on this topology to quickly locate the problem. So, this is the best practice of our golden signal+resource indicators+resource topology positioning for anomalies like slow calls.
Finally, summarize the best practices for this session:
1. Proactively discover anomalies through default alarms, with default alarm templates covering RED and common resource type indicators. In addition to the default alarm rules, users can also customize configurations based on templates.
2. Detect and locate anomalies through gold signals and resource indicators, while cooperating with Trace to drill down and locate root causes.
3. By conducting upstream and downstream analysis, dependency analysis, and architecture awareness through topology diagrams, it is beneficial to examine the architecture from a global perspective, obtain the optimal solution, achieve continuous improvement, and build a more stable system.
Slow invocation is a very common exception during software development. The potential hazards of slow calls include:
Front end business dimension: Firstly, slow calling may cause slow front-end loading, which may further lead to high application offloading rate and affect brand reputation.
Dimension of project delivery: Due to slow interfaces, the SLO cannot be achieved, resulting in project delays.
Business architecture stability: When interface calls are slow, it is very easy to cause timeouts. When other business services rely on this interface, it will trigger a large number of retries, leading to resource depletion and ultimately resulting in an avalanche of partial or entire service unavailability.
So, seemingly insignificant slow calls may hide huge risks, and we should be vigilant. It is best not to ignore slow calls, but to analyze the underlying reasons as much as possible in order to control risks.
What are the reasons for slow calls? There are countless reasons for slow calls, which can be summarized into five common ones.
The first issue is the high utilization of resources, such as CPU memory, disks, network cards, and so on. When these usage rates are too high, it is very easy to cause slow service.
The second issue is code design. Generally speaking, if SQL is associated with many tables and created many tables, it will greatly affect the performance of SQL execution.
The third issue is dependency. The service itself has no issues, but when calling downstream services, downstream returns are slow, and the service itself is in a waiting state, which can also lead to slow service calls.
The fourth issue is design, for example, if a table with a large amount of data is very large, and there is no database or table for querying billion level data, it can easily cause slow queries. Similar situations also involve time-consuming operations without caching.
The fifth issue is network related issues, such as cross continent calls, where the physical distance is too large, resulting in longer round-trip times and slow calls. Or the network performance between two points may be poor. For example, there is a problem of packet loss and high retransmission rate.
Today's example revolves around these five aspects, let's take a look together.
What are the general steps or best practices for locating slow calls? I have summarized three aspects here: golden signal+resource indicators+global architecture.
Let's take a look at the gold signal first. Firstly, the Golden Signal comes from the book Site Reliability Engineering in the Google SRE Bible. A set of minimum indicators used to characterize the health of a system, including:
Delay - used to describe the time taken by the system to execute requests. Common indicators include average response time, quantile such as P90/P95/P99. These indicators can well represent whether the system responds quickly or slowly, which is more intuitive.
Traffic - used to characterize the level of service busy, typical indicators include QPS and TPS.
• Errors - which are commonly seen in protocols such as 500 and 400 in HTTP protocols. If there are many errors, it may indicate that a problem has already occurred.
Saturation - refers to the resource water level. Generally speaking, services that are close to saturation are more prone to problems, such as when the disk is full, causing logs to be unable to be written, resulting in service response. Typical resources include CPU, memory, disk, queue length, number of connections, and so on.
In addition to the gold signal, we also need to focus on a resource indicator. Brandan Gregg, the famous performance analysis god, mentioned a USE method in his performance analysis methodology article. The USE method is analyzed from the perspective of resources. It checks utilization, saturation, and error for each resource, which together form the USE. Checking these three items can basically solve 80% of service problems, while you only need to spend 5% of the time.
After having the gold signal and resource indicators, what else do we need to pay attention to? As Branda mentioned in his methodology, 'We cannot just see the trees without the forest'. Zhuge Liang also said, "Those who do not plan for the overall situation are not enough to plan for a single territory. We should draw the system architecture and look at performance issues from a global perspective, rather than just looking at a particular resource or service. Taking everything into consideration, identifying bottlenecks, and systematically solving problems through design methods is also a more optimal approach. So, we need a combination of golden signals, resource indicators, and global architecture.
Slow call best practices
Next, I will talk about three cases. The first one is the problem of node CPU being full, which is also a typical resource problem caused by slow service, that is, the problem caused by the service's own resources. The second issue is the slow invocation of dependent service middleware. The third issue is poor network performance. The first case is to determine if there are any issues with the service itself; The second case is to determine the issue of downstream services; The third is to determine the network performance issues between oneself and the service.
Let's take an e-commerce application as an example. Firstly, the traffic entrance is Alibaba Cloud SLB, and then the traffic enters the microservices system. In microservices, we receive all traffic through the gateway, and the gateway will transfer the traffic to the corresponding internal services, such as ProductService, CartService, and PaymentService. Below, we rely on some middleware, such as Redis, MySQL, etc. We will use Alibaba Cloud's ARMS Kubernetes monitoring product to monitor the entire architecture. In terms of fault injection, we will inject different types of exceptions such as CPU full, network exceptions, etc. through Chaosblade.
Case 1: Node CPU Full Problem
What kind of problem will a node CPU become full? After the node CPU is full, the Pod above may not be able to apply for more CPUs, causing all threads inside to be in a waiting state for scheduling, resulting in slow calls. In addition to the nodes, in addition to the CPU, we also have some resources such as disks, memory, and so on.
Next, let's take a look at some characteristics of CPU in Kubernetes' cluster. Firstly, the CPU is a compressible resource. Looking at these configurations on the right in Kubernetes, there are several common configurations, such as Requests, which are mainly used for scheduling. Limits are used to set a runtime limit, beyond which it will be restricted. So, our experimental principle is to fully inject the CPU of the node, which results in the Pod not being able to apply for more memory, thereby causing the service to slow down.
Before the official start, we identify key links through topology diagrams and configure some alarms on them. For example, for gateways and payment links, we will configure alarms such as average response time P90 and slow calls. After the configuration is completed, I will inject a node with a CPU full fault. The node selected for this is the gateway node. After waiting for about five minutes, we can receive the alarm, which is the validation of the effectiveness of the alarm in the second step.
Next, let's move on to root cause localization. Firstly, we will enter the application details of the viewing gateway. The first step is to check the relevant golden signal, which is the response time. We can see that the response time is very intuitive and shows a sudden increase. Below is the number of slow calls, which is over a thousand. The number of slow calls suddenly increases, and P90/P95 shows a significant increase, exceeding one second, indicating that the entire service has also slowed down.
Next, we need to analyze resource indicators. In the Pod CPU usage chart, it can be seen that Pod usage has increased rapidly during this period, indicating the need to apply for more memory from the host or node. Let's take a closer look at the CPU usage of the node or host. We can see that during this period, the usage rate is close to 100%, and Pod cannot apply for more CPUs, which further leads to slow service and a significant increase in average response time.
After identifying the problem, we can think of specific solutions. Configure elastic scaling through CPU usage. Because we don't know the relevant traffic or resources, and we don't know when it suddenly becomes insufficient. The best way to deal with this scenario is to configure elastic scaling for resources and nodes, mainly to ensure that resources can dynamically expand when the load increases. In order to configure elastic scaling for applications, we can provide CPU indicators and configure an expansion action to increase the number of replicas to share traffic. In this process, we can configure the maximum number of replicas to be ten, the minimum number of replicas to be three, and so on.
The effect is as follows: When a CPU slow fault is injected, the slow call will increase, and after the increase is completed, elastic scaling will be triggered, which means that the CPU utilization rate exceeds the threshold, such as 70%. So, it will automatically expand some replicas to share the traffic, and we can see that the slow call count gradually decreases until it disappears, indicating that our elastic scaling has played a role.
Case 2: The problem of slow invocation of dependent service middleware
Next, let's take a look at the second case. First of all, let's introduce the preparation work. In the picture on the left, we can see that two downstream services have dropped from the gateway, one is MySQL and the other is ProductService. Therefore, an alarm greater than one second is directly configured on the gateway, with an average response time of P99 greater than one second. The second step is to see that this product is also on a critical link. I will provide it with a P99 alarm greater than one second, and the third one is MySQL, which also has an alarm greater than one second. After configuration, I will inject a MySQL slow query fault into the product service. After about two minutes, we can see that the alarms are triggered one after another, There is a red dot and a gray dot on both the gateway and the product, which is actually the fault reported and the alarm event reported. Kubernetes monitoring will automatically match this alarm event to this node through the namespace application, so that it can quickly identify which services and applications are abnormal, so as to quickly locate the problem. After we receive the alarm now, the next step is to perform a root cause localization.
First of all, let's talk about the process of updating the positioning. Alarm driven, because prevention is always better than remediation, we adopt the process of configuring alarms first and then updating the positioning. Then we will use topology for visual analysis, as topology can be used for architecture awareness, analysis of upstream and downstream, and visual analysis. After receiving an alarm, you can view what has happened to the corresponding application based on the alarm. The first one we look at is the gateway, and we see that the P99 of the gateway has increased to over 1800 milliseconds, triggering an alarm greater than the 1 second threshold. We can also see that several quantile are rising. Then we can take a closer look at another service that has alarms, namely Product. After clicking on this node, we can see from the panel that this Product has also had a slow call. P99 and P95 have all had slow calls to varying degrees, mostly more than one second, Then at this point, we can take a look at the resource usage of the Product, as there may be a problem with the Product itself. We looked at the downstream of Product, one is Nacos and the other is MySQL. When we looked at the interaction with MySQL, we found that there were a large number of slow calls. After seeing these slow calls, we clicked on these details to drill down and see what happened when it called. After further examining this data, we found that Product executed a very complex command when calling MySQL in SQL, Join multiple tables with one SQL statement. From calling Trace, we can see that it takes a lot of time, so we can locate that it is basically a problem caused by this SQL statement.
To summarize our entire process, first we will identify critical paths through architecture awareness, and then configure alarms on this critical path to proactively detect anomalies. After discovering anomalies, we locate the problem by applying our own resource indicators such as gold signals. If there are no issues with ourselves, then we can track downstream and look at downstream resource indicators to use this method to locate a dependency issue with slow calls, such as middleware calls.
Case 3: Poor Network Performance
Next, let's talk about the last example of poor network performance. Kubernetes' network architecture is quite complex, including communication between containers, Pods, Pods and services, external and service communication, and so on. So the complexity is relatively high, and the learning curve is also steep, which brings certain difficulties to the localization problem. So, how do we deal with this situation? If key network environmental indicators are used to detect network anomalies, what are the key environmental indicators? The first is rate and bandwidth, the second is throughput, the third is latency, and the fourth is RTT.
Firstly, I will configure an alarm here to indicate the high packet loss rate of the node where MySQL is injected. After waiting for a few minutes, we will receive a slow call alarm, and the response time of both the gateway and the product will be greater than one second. Next, let's take a look at the root cause localization. We can see that the gateway experienced a sudden increase in response time for slow calls to P99, and then we can see that the product also experienced a sudden increase in average response time. That is, the slow call of the service just now. Then, we further look at the downstream of the product, relying on the three services of Nacos, Redis, and MySQL, and we can see that slow calls are relatively obvious, Then when we looked at its downstream, we found that Product had a serious slow call when calling MySQL, and its RTT and retransmission phenomena were also obvious.
Under normal circumstances, RTT is very stable. It reflects the round-trip time between upstream and downstream. If it rises very quickly, it can be basically considered as a network problem. Therefore, we can see that there are three of them, including Gateway, Product, and MySQL. From here, we can summarize that this method of identifying critical paths and configuring alarms on the topology can quickly locate the problem, There is no need to verify a lot of scattered information in various places. We just need to go and check the corresponding performance indicators, network indicators, etc. on this topology to quickly locate the problem. So, this is the best practice of our golden signal+resource indicators+resource topology positioning for anomalies like slow calls.
Finally, summarize the best practices for this session:
1. Proactively discover anomalies through default alarms, with default alarm templates covering RED and common resource type indicators. In addition to the default alarm rules, users can also customize configurations based on templates.
2. Detect and locate anomalies through gold signals and resource indicators, while cooperating with Trace to drill down and locate root causes.
3. By conducting upstream and downstream analysis, dependency analysis, and architecture awareness through topology diagrams, it is beneficial to examine the architecture from a global perspective, obtain the optimal solution, achieve continuous improvement, and build a more stable system.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00