How to spot anomalies in services and workloads in Kubernetes

Kubernetes has pain points in abnormal localization

In the current internet architecture, more and more companies are adopting microservices+Kubernetes architecture, which has the following characteristics:

1. Firstly, the application layer is based on microservices, which are composed of several decoupled services that call each other. Generally, services have clear responsibilities and boundaries, resulting in a simple product with dozens or even hundreds of microservices. The dependencies and calls between each other are very complex, which brings significant costs to positioning problems. At the same time, the owners of each service may come from different teams, different developers, and may use different languages. The impact on our monitoring is that monitoring tools need to be integrated for each language, resulting in a low return on investment. Another characteristic is that there are multiple protocols, and almost every middleware (Redis, MySQL, Kafka) has its own unique protocol. How to quickly observe these protocols is a significant challenge.

2. Although Kubernetes and containers mask the complexity of the underlying applications, the two results are: the infrastructure layer is getting higher and higher; Another is that the information between the upper level applications and infrastructure has become increasingly complex. For example, users feedback that website access is slow, and administrators check access logs, service status, and resource water levels and find that there are no issues. At this point, they do not know where the problem occurs. Although they suspect that there is a problem with the infrastructure, they can only identify it one by one, which is inefficient. The root cause of the problem is the lack of problem correlation between the upper level application and the infrastructure, making it impossible to achieve end-to-end connectivity.

The last pain point is scattered data, multiple tools, and lack of communication. For example, if we receive an alarm and use Grafana to view the indicators, which can only be described roughly, we need to check the logs. At this time, we need to go to the SLS log service to see if there are corresponding logs, and the logs are also fine. At this time, we need to log in to the machine to check, but logging in to the container may cause the logs to disappear due to restarting. After checking a wave, we may think that the problem may not be in this application, so we will go to see if there is a downstream issue with link tracking. In summary, a lot of tools were used, and the browser opened more than ten windows, resulting in low efficiency and poor experience.

These three pain points can be summarized as cost, efficiency, and experience. In response to these pain points, let's take a look at the data system monitored by Kubernetes and see how to better solve the three major problems of cost, efficiency, and experience.

How Kubernetes monitoring detects anomalies

The following diagram shows the density or level of detail of information from top to bottom, with the information becoming more detailed as it moves downwards. Starting from the bottom, Trace is the application layer protocol data collected through eBPF technology in a non-invasive, multi-protocol, and multilingual manner, such as HTTP, MySQL, and Redis. The protocol data is further parsed into easily understandable request details, response details, and time consumption information at each stage.

The next level is the indicators, which are mainly composed of indicators in the gold index, network, and Kubernetes system. Among them, the gold indicators and network indicators are collected based on eBPF, so they are also non-invasive and support various protocols. With the gold indicators, we can know whether the service as a whole is abnormal, slow, and affecting users; Network indicators mainly refer to the support for sockets, such as packet loss rate, retransmission rate, RTT, etc., which are used to monitor whether the network is functioning properly. The indicators in the Kubernetes system refer to the cAdvisor/MetricServer/Node Exporter/NPD indicators in the original Kubernetes monitoring system.

On the next level, there are events that directly and clearly tell us what happened. Perhaps the most common ones we encounter are Pod restarts, mirror pull failures, and so on. We have persistence stored Kubernetes events for a period of time to facilitate problem location. Then, our inspections and health checks also support reporting in the form of events.

The top level is the alarm, which is the last loop of the monitoring system. When we discover certain specific abnormalities that may cause damage to the business, we need to configure the indicators and events for alarms. Currently, alarms support PromQL, and intelligent alarms support intelligent algorithm detection of historical data to detect potential abnormal events. The configuration of alarms supports dynamic thresholds, which can be configured by adjusting sensitivity to avoid writing dead thresholds. After having Traces, metrics, events, and alarms, we use topology diagrams to associate these data with Kubernetes entities. Each node corresponds to the services and workloads in the Kubernetes entity, and calls between services are represented by lines. With a topology map, we are like getting a map that can quickly identify anomalies in the topology map and further analyze them, analyzing upstream and downstream, dependencies, and impact surfaces, thereby gaining more comprehensive control over the system.

Best Practices&Scenario Analysis

Next, we will discuss the best practices for discovering anomalies in services and workloads in Kubernetes.

Firstly, there should be indicators that reflect the monitoring status of the service. We should collect as many indicators as possible, and the more comprehensive the better, not limited to gold indicators, USE indicators, Kubernetes native indicators, etc; Then, the indicator is macro data, and root cause analysis is required. We need to have Trace data, and in the case of multiple languages and protocols, we need to consider the cost of collecting these Traces, and try to support as many protocols and languages as possible; Finally, use a topology to summarize and concatenate indicators, Traces, and events to form a topology diagram for architecture awareness analysis, upstream and downstream analysis.

Through the analysis of these three methods, exceptions to services and workloads are usually exposed, but we should not stop moving forward and add this exception. If we come back next time, we need to do our work again. The best way is to configure corresponding alarms for such exceptions and manage them automatically.

We will elaborate on several specific scenarios:

(1) Network performance monitoring

Network performance monitoring takes retransmission as an example, which means that the sender believes that packet loss has occurred and resends these data packets. Taking the transmission process in the figure as an example:

1. The sender sends a package with number 1, and the receiver accepts it, returning ACK 2

2. The sender sends a packet with number 2, and the receiver returns ACK 2

3. The sender sends packets with numbers 3, 4, and 5, and the receiver returns ACK 2

4. Until the sender receives the same ACK three times, a retransmission mechanism will be triggered, and retransmission will cause an increase in latency

The code and logs cannot be observed, and in this case, it is ultimately difficult to find the root cause. In order to quickly locate this problem, we need a set of network performance indicators to provide positioning criteria, including the following indicators: P50, P95, and P99 to represent latency. Then, we need indicators such as traffic, retransmission, RTT, and packet loss to characterize the network situation.

Taking a certain service with high RT as an example: First, we see that the edge of the topology is red, and the red judgment logic is based on latency and errors. When this red edge is found, click on the edge above to see the corresponding golden indicator.

Click on the leftmost button at the bottom to view the list of network data for the current service. We can rank it in order of average response time, retransmission, and RTT. We can see that the first service invocation has a relatively high latency, which is as fast as a second's return time. At the same time, we can also see that retransmission is much higher than other services. This is actually a tool used to inject the fault of retransmission high, which seems more obvious. By analyzing it in this way, we know that it may be a network issue, and we can further investigate it. Experienced developers usually use information such as network indicators, service names, IP addresses, and domain names to go to their network colleagues for investigation, rather than just telling them that my service is slow. This way, the other party knows too little information and will not actively investigate because they do not know where to start. When we provide relevant data, the other party has a reference and will follow suit to further advance.

(2) DNS resolution exception

The second scenario is DNS resolution exception. DNS is usually the first step in protocol communication, such as HTTP requests. The first step is to obtain the IP first, which is commonly known as the service discovery process. If there is a problem in the first step, the entire call will directly fail, which is called 'critical path cannot drop chain'. In the Kubernetes cluster, all DNS is resolved through CoreDNS, so bottlenecks are prone to occur on CoreDNS. Once problems occur, the impact is also significant, and the entire cluster may become unusable. To give a vivid example of the incident that occurred two months ago, the famous CDN company, Akamai, experienced a DNS failure that caused many websites such as Airbnb to be inaccessible, and the accident lasted for an hour.

There are three core scenarios for DNS resolution in the Kubernetes cluster:

1. Call external API gateway

2. Call cloud services, which are generally public network

3. Call external middleware

Here is a summary of common issues with CoreDNS. You can refer to it and check if there are any similar issues on your own cluster:

1. Configuration issue (NDOTS issue), where NDOTS is a number indicating that if the number of points in the domain name is less than NDOTS, the search will prioritize the domains in the search list. This will result in multiple queries, which can have a significant impact on performance.

2. Because all Domain Name System of Kubernetes uses CoreDNS, it is very easy to become a performance bottleneck. Some people have calculated that when the qps is between 5000 and 8000, the performance problem should be concerned. Especially those that rely on external Redis and MySQL for high traffic.

3. The stability issue with lower versions of CoreDNS is also a concern.

4. In some languages, PHP does not support Connection pool very well, which leads to DNS resolution and connection creation every time. This phenomenon is also common.

Next, let's take a look at potential issues in Kubernetes CoreDNS. Firstly, there may be issues with the network between the application and CoreDNS; The second issue is the issue with CoreDNS itself, such as CoreDNS returning error codes such as SERVFAIL and REFUSE, and even incorrect return values due to incorrect Corefile configuration; The third point is when communicating with external DNS, there may be network interruption or performance issues; The last one is that external DNS is not available.

Summarize the following steps to investigate these issues:

Firstly, from the client side, take a look at the request content and return code. If the return is an error code, it indicates that there is a problem with the server. If it's slow parsing, you can take a look at the waterfall flow of time and see at which stage the time is spent.

Secondly, it is sufficient to check whether the network is functioning properly, such as traffic, retransmission, packet loss, and RTT.

Thirdly, check the server and take a look at the indicators such as traffic, errors, latency, and saturation. Then, look at the resource indicators such as CPU, memory, and disk to determine if there is a problem.

Fourthly, take a look at external DNS. Similarly, we can locate it through indicators such as request Trace, return code, network traffic, and retransmission.

Next, let's take a look at the topology. Firstly, the red line indicates abnormal DNS resolution calls. Click on this to see the golden indicator of the call; Click to view the list and a details page will pop up, where you can view the details of the request. The domain name was requested and went through three processes: sending, waiting, and downloading. It appears that the indicators are normal. Next, we click to see the response and find that the response is that the domain name does not exist. So at this point, we can further investigate whether there are any issues with external DNS, and the steps are the same. I will show it in the demo later, so we won't expand on it here.

(3) Full link pressure testing

The third typical scenario is full link pressure testing. For large promotion scenarios, the peak value is several times higher than usual. To ensure the stability of large promotion, a series of pressure tests are needed to verify system capabilities, evaluate system stability, conduct capacity planning, and identify system bottlenecks. Generally, there are several steps: preheating first, verifying whether the downlink is normal, gradually increasing the traffic until it reaches the peak, and then starting to increase the traffic. This means checking the maximum TPS that can be supported, and then increasing the traffic. At this time, the main thing is to see if the service can limit the traffic normally, because the maximum TPS has already been found, and increasing the power is destructive traffic. So what points do we need to focus on during this process?

Firstly, for our multilingual and multi-protocol microservice architecture, such as Java, Golang, and Python applications, as well as RPC, MySQL, Redis, and Kafka application layer protocols, we need to have gold indicators for various languages and protocols to verify system capabilities; Regarding the bottleneck and capacity planning of the system, we need to use indicators to determine whether to expand the capacity based on the saturation of resources under various traffic levels. For each increase in traffic, we need to look at the use indicators and adjust the capacity accordingly, gradually optimizing; For complex architectures, it is necessary to have a global map to help sort out upstream and downstream dependencies, full link architectures, and determine explosion alarms. For example, CheckoutService is a key service here, and if problems occur at this point, the impact will be significant.

Firstly, the golden indicators for communication in various languages and protocols can be further explored by viewing the list to see the details of the call

Secondly, click on node details to drill down to view CPU, memory, and other usage resource indicators

Thirdly, the entire topology map can reflect the shape of the entire architecture. With a global architectural perspective, it is possible to identify which services are prone to becoming bottlenecks, how large the explosion radius is, and whether high availability guarantees are needed.

(4) Accessing external MySQL

The fourth scenario is accessing external MySQL. First, let's take a look at the common issues with accessing external MySQL:

1. Firstly, it is a slow query. The delay indicator for slow query withdrawal is high. At this point, go to the trace to see what the detailed request is, which table is being queried, which fields, and then see if the query volume is too large, the table is too large, or there is no index built.

2. The query statement is too large. We know that too many query statements can lead to high transmission time, and a slight network jitter can cause failures and retries, as well as bandwidth consumption issues. Usually, it is caused by batch updates and insertions, and the delay indicator will skyrocket when this problem occurs. In this case, we can choose some Traces with higher RT to see how the statement is written and whether the length of the query statement is too large.

3. Error code return, for example, if the table does not exist, it is helpful to parse the error code inside. By further examining the details inside and looking at the statement, it is easier to locate the root cause.

4. Network issues have been discussed quite extensively, usually combined with latency metrics, RTT, retransmission, and packet loss to determine if there are any issues with the network.

Next, take a look at the topology diagram. The application highlighted in the middle relies on external MySQL services. Click on the topology line to further view the golden indicators, and click on the view list to further view the details of requests, responses, and more. At the same time, we can also take a look at network performance indicators. This table categorizes the network data in the current topology based on its source and target. You can view the number of requests, number of errors, average response time, socket retransmission, and socket rtt, respectively. Click the arrow above to sort them accordingly.

(5) Multi tenant architecture

The fifth typical case is a multi tenant architecture. Multi tenant refers to different tenants, workloads, and teams that share a cluster. Typically, each tenant corresponds to a namespace, while ensuring logical or physical isolation of resources without affecting or interfering with each other. Common scenarios include: internal users within an enterprise, one team corresponding to one tenant, unrestricted networks within the same namespace, and the use of network policies to control network traffic between different namespaces. The second type is the multi tenant architecture of SaaS providers, where each user corresponds to a namespace, while the tenant and platform are in different namespaces. Although the namespace feature of Kubernetes brings convenience to multi tenant architectures, it also presents two observable challenges: firstly, the large number of namespaces makes searching for information cumbersome, increasing management and understanding costs. The second is the requirement for traffic isolation between tenants, which often makes it difficult to accurately and comprehensively detect abnormal traffic when there are multiple namespaces. The third is Trace support for multiple protocols and languages. I once encountered a customer with over 400 namespaces in a cluster, which is very painful to manage. Moreover, the application is multi protocol and multi language, and to support Trace, it needs to be modified one by one.

This is the cluster homepage of our product. Kubernetes entities are divided into namespaces and support queries to locate the cluster I want to see. The bubble chart displays the number of entities in the corresponding namespace, as well as the number of entities with exceptions. For example, if there are PODs with exceptions in the three namespaces in the frame, click in to further examine the exceptions. Below the homepage is a performance overview sorted by golden indicators, which is the Top feature of the scenario. This allows you to quickly see which namespaces have exceptions.

In the topology map, if there are many namespaces, you can use the filtering function to view the desired namespace, or quickly locate the desired namespace through search. Due to the fact that nodes are grouped by namespaces, traffic from namespaces can be viewed through lines between namespaces, making it easy to see which namespaces have traffic coming from, whether there are any abnormal traffic, and so on.

We will summarize the above scenario introduction as follows:

1. Network monitoring: How to analyze service errors, slowdowns, and interruptions caused by the network, and how to analyze the impact of network problems

2. Service monitoring: How to determine whether a service is healthy through golden indicators? How to view details through Trace that supports multiple protocols?

3. Middleware and infrastructure monitoring: How to use gold indicators and traces to analyze anomalies in middleware and infrastructure, and how to quickly determine whether it is a network problem, a self problem, or a downstream service problem

4. Architecture awareness: How to perceive the entire architecture through topology, sort out upstream and downstream, internal and external dependencies, and then control the overall situation? How to ensure sufficient observability and stability of the architecture through topology analysis, and how to identify bottlenecks and explosion radius in the system through topology analysis.

Further list common cases from these scenarios: availability testing of networks and services, health checks; Observability guarantee for middleware architecture upgrade; Verification of new business launch; Service performance optimization; Middleware performance monitoring; Scheme selection; Full link pressure testing, etc.

Product Value

After the above introduction, we summarize the product value of Kubernetes as follows:

1. Collect service metrics and Trace data in a multi-protocol, multilingual, and non-invasive manner to minimize access costs while providing comprehensive coverage metrics and Trace;

2. With these indicators and Traces, we can scientifically analyze and drill down on services and workloads;

3. Associate these indicators and Traces into a topology diagram, enabling architecture awareness, upstream and downstream analysis, context correlation, etc. on a large graph, fully understanding the architecture, evaluating potential performance bottlenecks, etc., and facilitating further architecture optimization;

4. Provide a simple alarm configuration method to deposit empirical knowledge into alarms and achieve proactive detection.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us