Best Practices for Kubernetes Monitoring and Analysis

In recent years, Kubernetes has been the containerized orchestration platform of choice for cloud-native transformation of many companies. More and more development and operation and maintenance work is carried out around Kubernetes. Ensuring the stability and availability of Kubernetes is the most basic requirement, and the core of this is The most important thing is how to effectively monitor the Kubernetes cluster to ensure a good observability of the entire cluster. This issue will introduce how to carry out comprehensive monitoring and analysis on Kubernetes.

The monitoring of Kubernetes can be divided into four layers as a whole, including infrastructure monitoring, ServiceMesh monitoring, access layer monitoring, and business monitoring from bottom to top. K8s events can also monitor some status of business containers to achieve a certain degree of business monitoring; and the higher the level, the closer to the business, and the related indicators can better reflect whether the business is normal, such as the Ingress log of the access layer, You can directly see key indicators such as the success rate and delay of the current service, while the monitoring of business logs will be more targeted, so the value of monitoring will be higher.

From an implementation point of view, it is generally implemented gradually from the bottom up, and the monitoring at the bottom layer is relatively fixed. SLS provides some standard monitoring templates that can be used directly, and the deployment complexity will be low; while the business monitoring at the upper layer generally needs to rely on the business side The output of log/monitoring data, and the data formats of different companies and different technology stacks are different, requiring the implementer to do a lot of customization.

The above-mentioned monitoring architecture designs a lot of data sources and data formats, including hardware, operating system, Kubernetes system components, ServiceMesh, Ingress, business Pod, etc., and the data formats include logs (Logging), monitoring indicators (Metrics), link tracking ( Tracing) data. Alibaba Cloud SLS is needed to fully support a complete monitoring system. SLS provides a storage and query engine that supports Logging, Metrics, and Tracing data, and supports various analysis, visualization, and alarm methods, and all functions provide API interface calls, which can be The degree of customization is extremely high. Based on the DevOps data center provided by SLS, you can quickly build a Kubernetes monitoring solution suitable for your company.

Currently, SLS provides monitoring templates for basic monitoring, ServiceMesh, and access layer of Kubernetes, and a set of monitoring solutions can be quickly deployed with the help of templates. Below we will introduce how to deploy these monitoring templates.

Basic indicator monitoring - Prometheus

As we all know, Kubernetes is the first graduation project of CNCF, and it is also the hottest project, while Prometheus is the second graduation project of CNCF, and it is also the most popular project of CNCF besides Kubernetes. It is no exaggeration to say that Prometheus has become the de facto standard for monitoring in the cloud-native field. If the first step to enable cloud-native is to have a Kubernetes environment, then Prometheus is the first step in cloud-native monitoring.

Deploying Prometheus on Kubernetes is very simple. You only need to deploy a Prometheus Operator. At present, the Alibaba Cloud Kubernetes application market has a built-in Prometheus Operator, and you can choose to install it directly. For the specific operation method, please refer to: "Using Prometheus to collect Kubernetes monitoring data" . The overall steps are divided into:

*Create a Namespace named monitoring

*Create a confidential dictionary under monitoring, and fill in the applied AK with only SLS authority

*Install PrometheusOperator in the container service Kubernetes application market, and modify some parameters of RemoteWrite

* Configure Grafana to connect to SLS for visualization

Basic Event Monitoring - Kubernetes Event Center

In order to let users have a better understanding of the internal state of Kubernetes, Kubernetes introduces an event (Events) system. When Kubernetes resources change, they will be recorded in the APIServer in the form of events, and can be accessed through API or kubectl commands. View pieces.

Kubernetes events include the time of occurrence, components, levels (Normal, Warning, Error), types, and detailed information. Through events, we can know the entire life cycle of application deployment, scheduling, running, and stopping, and we can also use events to understand the system. Some exceptions are happening in . The event types that may be triggered by the component are defined in the source code of each component of Kubernetes, such as the event source code of kubelet.

In order to make it easier for everyone to use the Kubernetes event function, Alibaba Cloud Container Service Kubernetes and Log Service SLS jointly launched the Kubernetes Event Center, which supports the real-time collection of events in Kubernetes to the log service, and leverages the experience accumulated by Alibaba engineers in years of Kubernetes operation and maintenance. The event monitoring and alarm indicators are extracted to the event center, and the accumulated operation and maintenance experience can be obtained out of the box. For related articles, please refer to "Kubernetes Observability: All-round Event Monitoring".

The deployment method of the event center is very simple. By default, it has been checked when Alibaba Cloud Kubernetes is opened, and the event center will be automatically created after activation; if the event center is not checked, you can install ack-node in the Alibaba Cloud Kubernetes application market -problem-detector, for details, please refer to "Create and Use Kubernetes Event Center", and the event center will be automatically opened after installation.

Access layer monitoring - Ingress access log monitoring and analysis

In K8s, components expose services externally through Service, and common ones include NodePort, LoadBalancer, Ingress, etc. Among them, Ingress mainly provides HTTP layer (layer 7) routing function, which has many advantages compared with TCP (layer 4) load balancing (routing rules are more flexible, support canary, blue-green, A/B Test release mode, SSL support, log, monitoring, support for custom extensions, etc.), is currently the mainstream exposure method of HTTP/HTTPS services in K8s.

Ingress in K8s is just a declaration of API resources. The specific implementation needs to install the corresponding Ingress Controller. The Ingress Controller takes over the Ingress definition and forwards the traffic to the corresponding Service. At present, there are many implementations of Ingress Controller (for details, please refer to the official Ingress Controller document), the more popular ones are Nginx, Traefik, Istio, Kong, etc., and the most accepted one in China is Nginx Ingress Controller.

The analysis and monitoring solution of Ingress logs needs to build multiple modules (collection agent, data queue, index, visualization, and alarm light), which requires a huge workload. In order to simplify the threshold for users to analyze and monitor Ingress logs, Alibaba Cloud Container Service and Log Service have opened up Ingress logs (official documents), and only need to apply a yaml resource to complete a complete set of Ingress log solutions such as log collection, analysis, and visualization. deployment.

The deployment of the entire Ingress monitoring solution is extremely simple. By default, it is supported when Alibaba Cloud Kubernetes is enabled, and the Ingress solution will be automatically created after activation; if the installation is not checked, just apply a yaml according to the document to complete it. Ingress monitoring provides second-level monitoring information of various dimensions, including PV, UV, geographical distribution, success rate, average delay, P99/P9999 delay, etc. In addition, it also supports blue-green version comparison, which is convenient for comparing new and old versions in grayscale release key indicators of .

Now more and more enterprises are choosing to use ServiceMesh, among which Istio has gradually become the mainstream. At present, SLS already supports direct docking with Alibaba Cloud Service Mesh (ASM). Similar to the Ingress solution, the installation can be completed by directly checking the console or manually deploying yaml. For the detailed usage process, please refer to "Using Log Service to Collect Data Plane AccessLog" .

The best business monitoring method under Kubernetes is based on log analysis and monitoring. Log collection in Kubernetes is more complicated than traditional collection methods, and issues such as dynamics, multiple targets, and multiple log formats need to be considered. Mainstream collection software is difficult work steadily. At present, Kubernetes logs can be collected very stably with the help of Logtail provided by SLS. It supports the Operator extension mode of CRD and is very convenient to use. You only need to deploy a yaml to define the data source and target storage for collection, and support stdout, files, Host, Journal and other methods. For detailed functional advantages and features, please refer to "Tackling Pain Points Directly and Explaining K8s Log Collection Best Practices".

The most common method in Kubernetes is to use stdout and file collection. Related collection can be realized through CRD. For details, please refer to:

"Installing Kubernetes Log Collection Components"
"Kubernetes CRD collection standard output log"
"Kubernetes CRD Collection File Log"

After the data collection is complete, SLS supports various log viewing, analysis, visualization and monitoring methods. The recommended functions here are as follows:

Use log query, LiveTail, context, log clustering and other functions when troubleshooting;
Configure the display of user business indicators in visual reports;

Use the keyword alarm function to provide real-time alarms for Errors, Exceptions, etc. that appear in the log;

Use the business indicator alarm function to provide real-time alarms on key indicators such as business traffic, delay, and error rate.


Kubernetes provides powerful functions, which greatly reduce the complexity of our service release and operation and maintenance management. However, since the overall architecture has an additional orchestration layer, the monitoring solution needs to be configured separately for Kubernetes. The best way is to deploy a complete set of self-contained Bottom-up monitoring system. With the help of various templates and functions provided by SLS, you can quickly build Kubernetes monitoring adapted to your own business scenarios.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us