How should we monitor containers when containers are more and more widely used
Why do we need container monitoring
In the large-scale use of containers, facing highly dynamic and continuously monitored containerized environments, establishing a monitoring system is of great significance for maintaining stable operating environments and optimizing resource costs. Each container image may have a large number of running instances, and due to the fast introduction of new images and versions, faults can easily spread through the container, application, and architecture. This makes it crucial to immediately locate the root cause of the problem in order to prevent abnormal diffusion after it occurs. After extensive practice, we believe that monitoring the following components is crucial during container use:
• Host server;
• When the container is running;
• Orchestrator control plane;
• Middleware dependencies;
• Applications running inside containers.
Under a complete monitoring system, by gaining a deeper understanding of metrics, logs, and links, teams can not only understand what happens in the cluster, container runtime, and applications, but also provide data support for team business decisions, such as how to expand/reduce instances/tasks/Pods, and change instance types. DevOps engineers can also improve troubleshooting and resource management efficiency by adding automated alarms and related configurations, such as actively monitoring memory utilization and notifying the operations team to add additional nodes before the available CPU and memory resources are depleted when resource consumption approaches the set threshold. The value of this includes:
Early detection of issues to avoid system interruption;
Analyze container health status across cloud environments;
Identify clusters with excessive/insufficient allocation of available resources and adjust applications for better performance;
Create intelligent alerts to improve alarm accuracy and avoid false alarms;
Optimize with monitoring data to achieve optimal system performance and reduce operational costs.
But in the actual implementation process, the operation and maintenance team may feel that the above values are relatively shallow, as if existing operation and maintenance tools can achieve the above goals. However, for container related scenarios, if a corresponding monitoring system cannot be built, as business continues to expand, the following two very challenging targeted issues will have to be faced:
1. The obstacle clearance time is prolonged, and the SLA cannot be met.
It is difficult for the development and operation teams to understand the running content and its execution status. Maintaining applications, meeting SLAs, and troubleshooting are exceptionally difficult.
2. Scalability is hindered and elasticity cannot be achieved.
The ability to quickly expand applications or microservice instances on demand is an important requirement for containerized environments. The monitoring system is the only visual method for measuring requirements and user experience. Expanding too late leads to a decrease in performance and user experience; Reducing the scale too late can lead to waste of resources and costs.
Therefore, as the problems and value of container monitoring continue to accumulate and surface, more and more operation and maintenance teams are paying attention to the construction of container monitoring systems. However, in the actual process of monitoring landing containers, various unexpected problems are encountered.
For example, the difficulty of tracking caused by transient characteristics, due to the complexity of the container itself, the container not only contains the underlying code, but also all the underlying services required for the application to run. As new deployments are put into production and code and underlying services are changed, containerized applications are frequently updated, which increases the likelihood of errors. The characteristics of rapid creation and destruction make it extremely difficult to track changes in large-scale and complex systems.
For example, due to the monitoring difficulties caused by shared resources, it is difficult to monitor resource consumption on physical hosts due to the shared memory and CPU resources used by the container among one or more hosts, and it also makes it difficult to obtain good indicators of container performance or application health.
Finally, traditional tools are difficult to meet container monitoring needs. Traditional monitoring solutions often lack the metrics, tracking, and logging tools required for virtualized environments, especially for container health and performance metrics and tools.
Therefore, considering the above values, issues, and difficulties, when establishing a container monitoring system, we need to consider and design from the following dimensions:
Non invasiveness: monitoring whether the integration of SDK or probes into business code is invasive and affects business stability;
• Integrity: Can we observe the performance of the entire application in terms of business and technical platforms;
• Multi source: Can relevant indicators and log sets be obtained from different data sources for summary display, analysis, and alerts;
• Convenience: Can events and logs be associated to detect anomalies and proactively troubleshoot and reduce losses? Is the configuration of related alarm strategies convenient.
In the process of clarifying business requirements and designing monitoring systems, there are many open source tools for the operation and maintenance team to choose from, but the operation and maintenance team also needs to assess potential business and project risks. This includes:
There are unknown risks that may affect the stability of the business, and whether the monitoring service can be "traceless". Whether the monitoring process itself affects the normal operation of the system.
It is difficult to predict the manpower/time investment for open source or self research, and related components or resources need to be configured or built by oneself, lacking corresponding support and services. As the business continues to change, is it possible to consume more manpower and time costs. And can open-source or enterprise owned teams quickly address performance issues in large-scale scenarios.
Alibaba Cloud Kubernetes monitoring: making container cluster monitoring more intuitive and simple
Therefore, based on the above insights and extensive practical experience, Alibaba Cloud has launched the Kubernetes monitoring service. Alibaba Cloud Kubernetes Monitoring is a one-stop observability product developed for Kubernetes clusters. Based on indicators, application links, logs, and events under the Kubernetes cluster, Alibaba Cloud Kubernetes monitoring aims to provide IT development and operation personnel with an overall observability solution. Alibaba Cloud Kubernetes monitoring has the following six major features:
Code non-invasive: Through bypass technology, network performance data can be obtained without the need for code buried points.
• Multi language support: network protocol resolution through the inner nuclear layer, supporting arbitrary languages and frameworks.
Low consumption and high performance: Based on eBPF technology, obtain network performance data with extremely low consumption.
Resource automatic topology: Through network topology, resource topology displays the correlation of related resources.
Multidimensional presentation of data: supports various types of observable data (monitoring indicators, links, logs, and events).
Create a closed loop of correlations: complete observable data related to the architecture layer, application layer, container operation layer, container control layer, and basic resource layer.
At the same time, compared to open-source container monitoring, Alibaba Cloud Kubernetes monitoring has a more differentiated value that is closer to business scenarios:
• Unlimited data volume: indicators, links, logs and other data are stored independently, and low-cost and high-capacity storage is ensured with the help of cloud storage capabilities.
Efficient resource association interaction: By monitoring network requests, a complete network topology is constructed to facilitate viewing service dependency status and improve operation and maintenance efficiency. In addition to network topology, the 3D topology feature supports simultaneous viewing of network topology and resource topology, improving problem localization speed.
• Diversified data combination: indicators, links, logs and other data are visually displayed and freely combined, and operation and maintenance optimization points are mined.
Build a complete monitoring system: Together with other sub products that apply real-time monitoring services, build a complete monitoring system. Application monitoring focuses on application language runtime, application framework, and business code; Kubernetes monitoring focuses on the container runtime, container control layer, and system calls of containerized applications. Both monitoring services serve the application and focus on different levels of the application. The two products complement each other. Prometheus is the infrastructure for collecting, storing, and querying indicators. Both application monitoring and Kubernetes monitoring rely on Prometheus for indicator data.
Based on the above product characteristics and differentiated value, we apply it in the following scenarios:
Detect anomalies in nodes, services, and workloads through default or custom patrol rules monitored by Kubernetes. Kubernetes monitoring conducts abnormal inspections of nodes, services, and workloads from three dimensions: performance, resources, and management. The analysis results are visually displayed through specific colors such as normal, warning, and critical states, helping operations personnel to intuitively perceive the operational status of user nodes, services, and workloads.
Use Kubernetes to monitor the root cause of failure in locating services and workload responses. Kubernetes monitors and stores detailed information on failed requests by analyzing network protocols, and uses the failure request details associated with the failure request indicator to locate the cause of failure.
Use Kubernetes to monitor the root cause of slow response to location services and workloads. Kubernetes monitors indicators such as DNS resolution performance, TCP retransmission rate, and network packet rtt by capturing key path indicators of network links. Utilize the indicators of critical paths in network links to identify the reasons for slow response and optimize related services.
Use Kubernetes monitoring to explore application architecture and discover unexpected network traffic. Kubernetes monitoring supports viewing the topology map constructed by global traffic, and supports configuring static port identification for specific services. Utilize the intuitive and powerful interaction of topology diagrams to explore application architecture, verify whether the traffic meets expectations, and whether the architecture form is reasonable.
Use Kubernetes to monitor and identify issues with uneven resource utilization among nodes, and allocate node resources in advance to reduce business operational risks.
In the large-scale use of containers, facing highly dynamic and continuously monitored containerized environments, establishing a monitoring system is of great significance for maintaining stable operating environments and optimizing resource costs. Each container image may have a large number of running instances, and due to the fast introduction of new images and versions, faults can easily spread through the container, application, and architecture. This makes it crucial to immediately locate the root cause of the problem in order to prevent abnormal diffusion after it occurs. After extensive practice, we believe that monitoring the following components is crucial during container use:
• Host server;
• When the container is running;
• Orchestrator control plane;
• Middleware dependencies;
• Applications running inside containers.
Under a complete monitoring system, by gaining a deeper understanding of metrics, logs, and links, teams can not only understand what happens in the cluster, container runtime, and applications, but also provide data support for team business decisions, such as how to expand/reduce instances/tasks/Pods, and change instance types. DevOps engineers can also improve troubleshooting and resource management efficiency by adding automated alarms and related configurations, such as actively monitoring memory utilization and notifying the operations team to add additional nodes before the available CPU and memory resources are depleted when resource consumption approaches the set threshold. The value of this includes:
Early detection of issues to avoid system interruption;
Analyze container health status across cloud environments;
Identify clusters with excessive/insufficient allocation of available resources and adjust applications for better performance;
Create intelligent alerts to improve alarm accuracy and avoid false alarms;
Optimize with monitoring data to achieve optimal system performance and reduce operational costs.
But in the actual implementation process, the operation and maintenance team may feel that the above values are relatively shallow, as if existing operation and maintenance tools can achieve the above goals. However, for container related scenarios, if a corresponding monitoring system cannot be built, as business continues to expand, the following two very challenging targeted issues will have to be faced:
1. The obstacle clearance time is prolonged, and the SLA cannot be met.
It is difficult for the development and operation teams to understand the running content and its execution status. Maintaining applications, meeting SLAs, and troubleshooting are exceptionally difficult.
2. Scalability is hindered and elasticity cannot be achieved.
The ability to quickly expand applications or microservice instances on demand is an important requirement for containerized environments. The monitoring system is the only visual method for measuring requirements and user experience. Expanding too late leads to a decrease in performance and user experience; Reducing the scale too late can lead to waste of resources and costs.
Therefore, as the problems and value of container monitoring continue to accumulate and surface, more and more operation and maintenance teams are paying attention to the construction of container monitoring systems. However, in the actual process of monitoring landing containers, various unexpected problems are encountered.
For example, the difficulty of tracking caused by transient characteristics, due to the complexity of the container itself, the container not only contains the underlying code, but also all the underlying services required for the application to run. As new deployments are put into production and code and underlying services are changed, containerized applications are frequently updated, which increases the likelihood of errors. The characteristics of rapid creation and destruction make it extremely difficult to track changes in large-scale and complex systems.
For example, due to the monitoring difficulties caused by shared resources, it is difficult to monitor resource consumption on physical hosts due to the shared memory and CPU resources used by the container among one or more hosts, and it also makes it difficult to obtain good indicators of container performance or application health.
Finally, traditional tools are difficult to meet container monitoring needs. Traditional monitoring solutions often lack the metrics, tracking, and logging tools required for virtualized environments, especially for container health and performance metrics and tools.
Therefore, considering the above values, issues, and difficulties, when establishing a container monitoring system, we need to consider and design from the following dimensions:
Non invasiveness: monitoring whether the integration of SDK or probes into business code is invasive and affects business stability;
• Integrity: Can we observe the performance of the entire application in terms of business and technical platforms;
• Multi source: Can relevant indicators and log sets be obtained from different data sources for summary display, analysis, and alerts;
• Convenience: Can events and logs be associated to detect anomalies and proactively troubleshoot and reduce losses? Is the configuration of related alarm strategies convenient.
In the process of clarifying business requirements and designing monitoring systems, there are many open source tools for the operation and maintenance team to choose from, but the operation and maintenance team also needs to assess potential business and project risks. This includes:
There are unknown risks that may affect the stability of the business, and whether the monitoring service can be "traceless". Whether the monitoring process itself affects the normal operation of the system.
It is difficult to predict the manpower/time investment for open source or self research, and related components or resources need to be configured or built by oneself, lacking corresponding support and services. As the business continues to change, is it possible to consume more manpower and time costs. And can open-source or enterprise owned teams quickly address performance issues in large-scale scenarios.
Alibaba Cloud Kubernetes monitoring: making container cluster monitoring more intuitive and simple
Therefore, based on the above insights and extensive practical experience, Alibaba Cloud has launched the Kubernetes monitoring service. Alibaba Cloud Kubernetes Monitoring is a one-stop observability product developed for Kubernetes clusters. Based on indicators, application links, logs, and events under the Kubernetes cluster, Alibaba Cloud Kubernetes monitoring aims to provide IT development and operation personnel with an overall observability solution. Alibaba Cloud Kubernetes monitoring has the following six major features:
Code non-invasive: Through bypass technology, network performance data can be obtained without the need for code buried points.
• Multi language support: network protocol resolution through the inner nuclear layer, supporting arbitrary languages and frameworks.
Low consumption and high performance: Based on eBPF technology, obtain network performance data with extremely low consumption.
Resource automatic topology: Through network topology, resource topology displays the correlation of related resources.
Multidimensional presentation of data: supports various types of observable data (monitoring indicators, links, logs, and events).
Create a closed loop of correlations: complete observable data related to the architecture layer, application layer, container operation layer, container control layer, and basic resource layer.
At the same time, compared to open-source container monitoring, Alibaba Cloud Kubernetes monitoring has a more differentiated value that is closer to business scenarios:
• Unlimited data volume: indicators, links, logs and other data are stored independently, and low-cost and high-capacity storage is ensured with the help of cloud storage capabilities.
Efficient resource association interaction: By monitoring network requests, a complete network topology is constructed to facilitate viewing service dependency status and improve operation and maintenance efficiency. In addition to network topology, the 3D topology feature supports simultaneous viewing of network topology and resource topology, improving problem localization speed.
• Diversified data combination: indicators, links, logs and other data are visually displayed and freely combined, and operation and maintenance optimization points are mined.
Build a complete monitoring system: Together with other sub products that apply real-time monitoring services, build a complete monitoring system. Application monitoring focuses on application language runtime, application framework, and business code; Kubernetes monitoring focuses on the container runtime, container control layer, and system calls of containerized applications. Both monitoring services serve the application and focus on different levels of the application. The two products complement each other. Prometheus is the infrastructure for collecting, storing, and querying indicators. Both application monitoring and Kubernetes monitoring rely on Prometheus for indicator data.
Based on the above product characteristics and differentiated value, we apply it in the following scenarios:
Detect anomalies in nodes, services, and workloads through default or custom patrol rules monitored by Kubernetes. Kubernetes monitoring conducts abnormal inspections of nodes, services, and workloads from three dimensions: performance, resources, and management. The analysis results are visually displayed through specific colors such as normal, warning, and critical states, helping operations personnel to intuitively perceive the operational status of user nodes, services, and workloads.
Use Kubernetes to monitor the root cause of failure in locating services and workload responses. Kubernetes monitors and stores detailed information on failed requests by analyzing network protocols, and uses the failure request details associated with the failure request indicator to locate the cause of failure.
Use Kubernetes to monitor the root cause of slow response to location services and workloads. Kubernetes monitors indicators such as DNS resolution performance, TCP retransmission rate, and network packet rtt by capturing key path indicators of network links. Utilize the indicators of critical paths in network links to identify the reasons for slow response and optimize related services.
Use Kubernetes monitoring to explore application architecture and discover unexpected network traffic. Kubernetes monitoring supports viewing the topology map constructed by global traffic, and supports configuring static port identification for specific services. Utilize the intuitive and powerful interaction of topology diagrams to explore application architecture, verify whether the traffic meets expectations, and whether the architecture form is reasonable.
Use Kubernetes to monitor and identify issues with uneven resource utilization among nodes, and allocate node resources in advance to reduce business operational risks.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00