×
Community Blog Observable Monitoring Solutions: SLS Full Stack Monitoring

Observable Monitoring Solutions: SLS Full Stack Monitoring

This blog describes the importance of monitoring solutions and the entire Log Service (SLS) full stack monitoring process.

1. Preface

Since the advent of computers, monitoring has been one of the necessary and traditional methods used by a company's IT infrastructure. After decades of development, the current IT technology and architecture landscape has undergone significant changes, including the development model, system architecture, deployment model, and infrastructure. Currently, the mainstream technologies are microservices, containerization, cloud, and DevOps.

1

With these changes in the architecture, the entire system has become more complex with new, dynamic, and uncertain deployment models and operating environments. Today, development depends more on people and departments. Due to this complexity, the IT industry has reached the stage that requires more systematic observation with fast monitoring capabilities. Moreover, monitoring systems have also changed and are evolving towards cloud-native, data fusion, and intelligence.

2. Monitoring System Development History

2

The entire development process of IT monitoring can be divided into the following four stages:

  • Unix Era: With the popularity of Unix and Linux, we have IT systems in a true sense. In the 1980s and 1990s, applications were usually deployed on a single machine, and the deployment process was very simple. To locate standalone application problems, many metrics have been added to Unix, such as CPU, memory, and IO usage. At the same time, to obtain these metrics quickly, Unix and Linux provide many command-line tools, such as top, vmstat, and iostat. They also offer many graphical tools for people who use desktop systems. This is also known as the earliest line chart application in IT monitoring. People do not pay much attention to performance and user experience at this stage and care more about service availability and functionality.
  • Data Center Era: In the 1990s, more companies began building their own data centers with computers ranging from a few to hundreds. During those times, there were dedicated IT O&M specialists. Later, Simple Network Management Protocol (SNMP) was developed to manage and monitor the status of every single machine in the data center. The monitoring architecture was mainly implemented in a standalone manner. SNMP protocol was used to monitor each host's network and hardware information. Cross-host applications and web applications that provide external services were also available. While some monitoring systems used to pay attention to the network latency, they were not capable enough to handle the actual user request latency.
  • Distributed Era: After the 21st century, the Internet became popular, and application scenarios changed and diversified. Single machines were gradually unable to support the surging number of requests. Therefore, layered distributed architectures were gradually adopted. The division in the hierarchical pattern of the monitoring system was also increasingly evident, such as host monitoring, network monitoring, middleware monitoring, and application monitoring. Among them, application monitoring became a new category. Application monitoring requires paying attention to application availability, monitoring, and solving performance issues. The architecture of the monitoring system at this stage also became distributed and complex. The backend consists of multiple machines and modules, such as data processing, storage, and alerting. Each module can further be divided into distributed streaming processing and distributed databases.
  • Cloud-native Era: With the maturity of cloud computing and containerization technologies, many companies began to adopt microservice and containerization technologies to develop applications. Today, they also choose public cloud or private cloud for the deployment environment of applications. The cloud-native scenarios offer dynamic environments and a higher degree of virtualization, due to which traditional monitoring methods are no longer suitable. Instead, a monitoring system that can connect to Kubernetes, microservices, and cloud resources will be highly preferred. The purpose of monitoring is to focus on improving user experience and enhancing troubleshooting efficiency. Moreover, in addition to collecting accurate monitoring information, a monitoring system must perform association analysis with other observable data (such as Logs/Traces) to quickly locate problems. At the same time, AI-enabled technologies are leveraged to automate exception discovery, location, and repair.

3. Monitoring Solutions in the Cloud-native Era

In the cloud-native era, a monitoring solution must be upgraded to a higher level and include extended monitoring capabilities to offer fast operations. It must consist of some of the following features listed below:

  1. Wide coverage: Monitoring solutions must be able to support infrastructure, containers and Kubernetes, cloud vendors, middleware, and databases.
  2. Unified view: Every level of data can have a unified entry and view.
  3. Unified alarm: Alarms are an important part of monitoring. They must be managed in a unified manner and include advanced features, such as intelligent noise reduction, dynamic shift schedules, and alarm merging/routing to reduce management and usage costs.
  4. Intelligence: The number of components involved in an enterprise's IT system is large, and therefore, a monitoring system must be intelligent enough to notify of problems that require immediate assistance. Time-series anomaly detection methods for heuristic AIOps must be used to automatically detect abnormal curves and report alarms.
  5. Data fusion analysis: The monitoring solutions must be able to conveniently perform association analysis with other observable data, such as Trace, logs, and events to quickly locate and solve problems.

4. SLS Full Stack Monitoring

3

Being an observable data engine, Log Service (SLS) provides comprehensive collection and storage of observable data logs, metrics, distributed tracing analysis, and events. To help users quickly access and monitor business systems, SLS provides the full stack monitoring app, which collects all types of monitoring data into one instance for unified management and monitoring. Full stack monitoring is based on capabilities such as collection, storage, analysis, and visualization of monitoring data, alerting, and AIOps. The detailed features are as follows:

4

  • Real-time monitoring of various system components, such as hosts, Kubernetes, databases, and middleware.
  • Support one-click installation of ECS, Kubernetes, and graphical monitoring configuration management. You do not need to log on to the host to configure the collect monitoring items.
  • A report summary from experienced O&M staff, including dozens of reports, such as resource overview, water-level monitoring, highlights analysis, and detailed metrics.
  • Support custom analysis and multiple analysis grammars, such as PromQL and SQL-92.
  • Support AIOps metric inspection and detect abnormal metrics by using machine learning.
  • Support custom alert configuration and report alerts directly to message center, text messages, emails, voice calls, DingTalk, and custom Webhooks.

5. Full Stack Monitoring Feature Overview

5

5.1 Host Monitoring

Dashboard Description
Resource overview Real-time display of configuration and metric data of hosts in a visualized manner. The data includes the number of CPU cores, total disk space, average CPU utilization, and average memory usage.
Host list Real-time display of each host's configuration data and metric data in a visualized manner. The data includes the number of CPU cores, memory size, CPU utilization, and memory usage.
Hotspot analysis Real-time display of resource usage information of hotspot hosts in a visualized manner. The resources include CPUs and memory. The information includes the distribution of CPU utilization among hotspot hosts, distribution of memory usage among hotspot hosts, top CPU utilization, and top memory usage.
Standalone metrics-simplified Real-time display of resource usage trends of a host in a visualized manner. The resources include CPUs and memory. The usage information includes CPU, disk space, and memory usage.
Standalone metrics-detailed Real-time display of usage trends of host resources in different states in visualized manner. The resources include CPUs and memory. A CPU can have the following usage trends: Total, System, User, and IOWait. Memory can be in: Total, Available, and Used.

6
7

5.2 Kubernetes Monitoring

Dashboard Description
Resource overview Display the resource usage in Kubernetes in a visualized manner in real-time. The resources include Pod, Host, Service, and Deployment.
Water level monitoring Display the resource usage information in Kubernetes in a visualized manner in real-time. The information includes the number of running Pods, total number of CPUs, and file system usage.
Runtime monitoring Display information about running resources in Kubernetes in a visualized manner in real-time. The information includes the number of running Deployments and the number of running DaemonSets.
Core components monitoring Display information about the core components in Kubernetes in a visualized manner in real-time. The information includes the number of etcd objects and the queries per second (QPS) of etcd.
Node list Display overall information about nodes, and the configuration data and metric data of each node in a visualized manner in real-time. The information includes the total number of nodes and the total number of running Pods.
Node metrics Display the metric data of a node in a visualized manner in real-time. The data includes the number of requested Pods and CPU utilization.
Pod tab Display overall information about Pods, the configuration data, and metric data of each Pod in a visualized manner in real-time. The information includes the total number of Pods that can be requested.
Pod metrics Display the metric data of Pods in real-time. The data includes the basic information about Pods and the containers.
Deployment tab Display the configuration data and metric data of each Deployment in a visualized manner in real time. The data includes the namespace and cluster to which a Deployment belongs.
Deployment metrics Display the metric data of Deployment in a visualized manner in real time. The data includes the CPU Limit usage and Memory Limit usage.
StatefulSet tab Display the configuration data and metric data of each StatefulSet in a visualized manner in real-time. The data includes the namespace and cluster to which a StatefulSet belongs.
StatefulSet metrics Display the metric data of a StatefulSet in a visualized manner in real-time. The data includes the CPU Limit usage and Memory Limit usage.
DaemonSet tab Display the configuration and metric data of each DaemonSet in a visualized manner in real-time. The data includes the namespace and cluster to which a DaemonSet belongs.
DaemonSet metrics Display the metric data of a DaemonSet in a visualized manner in real-time. The data includes the CPU Limit usage and Memory Limit usage.

k8s_1
8
9

5.3 Database Monitoring

Dashboard Description
MySQL monitoring Display the metric data of the MySQL database in a visualized manner in real-time. The data includes the startup time, number of Query operations, and number of connections.
Redis monitoring Display the metric data of the Redis database in a visualized manner in real-time. The data includes the number of cluster instances that are enabled, Redis runtime, and connected clients.
Elasticsearch monitoring Display the metric data of Elasticsearch in a visualized manner in real-time. The data includes the cluster health and Node.
ClickHouse Monitoring Display the metric data of ClickHouse databases in a visualized manner in real-time. The data includes Query and Merge.
MongoDB monitoring Display the metric data of MongoDB databases in a visualized manner in real-time. The data includes Available Connections and Query Operations.

10

5.4 Middleware Monitoring

Dashboard Description
JVM monitoring Display the metric data of the JVM in a visualized manner in real-time. The data includes the running time, total memory, heap memory, and CPU utilization.
Nginx monitoring Display the metric data of Nginx in a visualized manner in real-time. The data includes the number of processed connections and QPS.
Tomcat monitoring Display the metric data of Tomcat in a visualized manner in real-time. The data includes the running time, QPS, number of errors, and CPU utilization.
Kafka monitoring Display the metric data of Kafka in a visualized manner in real-time. The data includes the status of the controller, the total number of topics, and the number of messages per second.
NVIDIA GPU monitoring Display the metric data of NVIDIA GPU in a visualized manner in real-time. The data includes GPU utilization and memory utilization.

11

6. Coming soon

12

At this stage, full stack monitoring provides host monitoring, Kubernetes monitoring, database monitoring, and middleware monitoring. The subsequent horizontal and vertical function extensions will be available soon. For example:

  1. Cloud resource monitoring, including monitoring of various cloud services, ranging from Alibaba Cloud to other cloud providers.
  2. Add more features to the host, such as process-level monitoring, monitoring of the kernel, and process/kernel profile capabilities.
  3. Kubernetes adds monitoring capabilities, such as performance, change, and service topology; databases add diagnostics and plan monitoring; middleware supports more types
  4. Add monitoring capabilities related to user experience and applications, such as dial-up, frontend monitoring, and mobile monitoring.

References

  1. Full stack monitoring: https://sls.console.aliyun.com/lognext/app/monitor
  2. Monitoring data access: https://www.alibabacloud.com/help/en/doc-detail/354229.html
  3. Description: https://www.alibabacloud.com/help/en/doc-detail/364237.html
  4. Intelligent exception diagnosis: https://www.alibabacloud.com/help/en/doc-detail/356467.html
  5. Alert configuration: https://www.alibabacloud.com/help/en/doc-detail/209951.html
0 0 0
Share on

DavidZhang

12 posts | 1 followers

You may also like

Comments