How to Build an End-to-End Observable System

Observable past life and present life

The observability and fault analysis of a system, as important metrics in system operation and maintenance, have encountered significant challenges as the system evolves in terms of architecture, resource units, resource acquisition methods, and communication methods. And these challenges are also driving the development of operation and maintenance related technologies. Before officially starting today's content, let's talk about the observable past and present. Throughout the entire development process of operation and maintenance monitoring, monitoring and observability have been developed for nearly 30 years.

In the late 1990s, as computing gradually shifted from mainframes to desktop computers, the application of client server architecture became popular, and people began to pay attention to network performance and host resources. In order to better monitor the application of this CS, the first generation APM software was born. The operation and maintenance team valued network performance and host performance during this period, as the application architecture at this time was still very simple. At this point, we also refer to these tools as monitoring tools.

By the year 2000, the internet had developed rapidly, and browsers had become new user interfaces. The application has evolved into a browser based three-tier Browser App DB architecture. At the same time, Java, as the first programming language of enterprise software, has become popular. The concept of write once, run anywhere has greatly improved the productivity of code. However, the Java Virtual Machine also shields the details of code operation, making it more difficult to optimize and troubleshoot, So, code level tracking, diagnosis, and database tuning have become new concerns, giving birth to a new generation of monitoring tool APM (Application Performance Monitoring).

After 2005, distributed applications became the first choice for many enterprises, with applications based on SOA architecture and ESB becoming popular. At the same time, virtualization technology has gradually become popular, and the traditional physical unit of servers has gradually faded into an invisible and intangible virtual resource mode. Tripartite components such as message queues and caching are also beginning to be applied in production environments. In such a technological environment, the emergence of a new generation of APM software has led to the need for enterprises to conduct full link tracking, while monitoring virtual resources and third-party component monitoring, thus deriving the core capabilities of the new generation of APM.

After 2010, as cloud native architecture began to be implemented and applied, application architecture gradually transformed from single systems to microservices, and the business logic in it became calls and requests between microservices. At the same time, virtualization is becoming more thorough, container management platforms are increasingly accepted by enterprises, third-party components are gradually evolving into cloud services, and the entire application architecture has become a cloud native architecture. The service invocation path becomes longer, making the flow direction uncontrollable and the difficulty of troubleshooting increase. A new observable capability is needed to continuously analyze various observable data (indicators, logs, links, events) covering the entire stack in the development, testing, operation, and maintenance of the entire application lifecycle process.

It can be seen that observability has become a cloud native infrastructure. The entire observable capability has evolved from a simple operation and maintenance state to a testing and development state. The observable purpose has further expanded from supporting the normal operation of the business to accelerating business innovation and enabling rapid iteration of the business.

Monitoring&APM&Observable Cognitive Similarities and Differences

From the above process, we can see that the transition from monitoring to APM to observable is a constantly evolving process. Next, we will talk about the specific relationship between these three. For a better explanation, a classical cognitive model is introduced here. For all things in the world, we usually divide them into two dimensions: "awareness" and "understanding", namely "perception" and "understanding".

So, first of all, what we know and understand is called a fact. Falling into the topic just discussed, this part corresponds to monitoring. For example, when conducting operations and maintenance work, it is designed to monitor the CPU utilization of the server from the beginning, whether it is 80% or 90%, which is an objective fact. This is what monitoring solves, that is, based on knowing what to monitor, developing and collecting corresponding indicators, and establishing a monitoring system.

Next, there are things we know but don't understand. For example, monitoring the CPU utilization rate reaching 90%, but why is it so high, and what is the reason for it? This is a verification process. Through APM, application performance on the host can be collected and analyzed, and it is found that a high latency log framework during application link calls has caused a surge in CPU utilization on the host. This is the reason behind the high CPU utilization discovered through application layer analysis using APM.

Then, there are things we understand but don't know. The scenario of high CPU utilization is still the case. If we predict a surge in CPU utilization at a certain point in the future through learning historical data and related events, we can achieve early warning.

Finally, there are things we don't know and don't understand. In the same example as above, if CPU usage skyrocketed through monitoring and APM was used to discover the cause of the application log framework. However, further analysis of user access data during this time period reveals that in the Shanghai region, requests accessed through Apple terminals have a response time 10 times longer than other situations, and this type of request generates a massive amount of Info logs due to the configuration of the logging framework, leading to a CPU surge in some machines. This is an observable process. Observability is something that needs to be addressed that you did not know in advance (Apple terminal access performance issues from Shanghai) and did not understand (incorrect configuration of the logging framework generates massive information logs)

To summarize, in the field of monitoring, we focus on indicators that may be concentrated at the infrastructure layer, such as the performance indicators of machines and networks. Then, based on these indicators, corresponding signage and alarm rules are established to monitor events within a known range. After the monitoring discovered the problem, APM used diagnostic tools such as application level links, memory, and threads to locate the root cause of abnormal monitoring indicators.

Observable is application centric, and by correlating and analyzing various observable data sources such as logs, links, indicators, and events, the root cause can be identified more quickly and directly. And provide an observable interface, allowing users to flexibly and freely explore and analyze these observable data. At the same time, observability is connected to cloud services, instantly enhancing the elastic scaling and high availability capabilities of applications, enabling faster resolution of related issues and restoration of application services when problems are discovered.

Key points for building an observable system

Observability not only brings enormous business value, but also poses significant challenges in system construction. This is not only a selection of tools or technologies, but also an operation and maintenance concept. This includes three parts: the collection, analysis, and value output of observable data.

Observable data collection

The widely promoted observable data in the industry currently consists of three pillars: logging, tracing, and metrics, among which there are some commonalities that need to be addressed.

1) Full stack coverage

The basic layer, container layer, cloud service applications built above, as well as the corresponding observable data of user terminals and corresponding indicators, links, and events need to be collected.

2) Unified standards

The entire industry is promoting standardization, starting with metrics. Prometheus, as the indicator data standard in the cloud native era, has formed a consensus; The standards for link data have gradually become mainstream with the implementation of OpenTracing and OpenTelemetry; Although the structured level of data in the field of logging is relatively low and it is difficult to form a standard for data, open source newcomers such as Fluentd and Loki have also emerged in terms of collection, storage, and analysis; On the other hand, Grafana has become increasingly clear as a display standard for various observable data.

3) Data quality

As an important part that is easily overlooked, data quality requires the definition of data standards for different monitoring systems' data sources to ensure the accuracy of analysis. On the other hand, the same event may result in a large number of duplicate indicators, alarms, logs, etc. Analyzing data with analytical value through filtering, noise reduction, and aggregation is an important component of ensuring data quality. This is often where there is a relatively large gap between open source tools and commercial tools. For example, when we collect an application's call link, how deep is the collection? What is the strategy for calling link sampling? Can all of them be collected when errors or slowdowns occur? The quality of observable data collection is determined by whether the sampling strategy can be dynamically adjusted based on certain rules, and so on.

Observable data analysis

1) Horizontal and vertical correlation

In the current observable system, application is a very good analytical entry point. Firstly, applications are interrelated and can be linked through call chains. This includes how microservices are called, how applications and cloud services, and how third-party components are called, all of which can be associated through links. At the same time, the application can also be vertically mapped to the container layer and resource layer. Centered around the application, global observable data associations are formed horizontally and vertically. When problems arise that require localization, a unified analysis can be conducted from an application perspective.

2) Domain knowledge

How to discover problems more quickly and accurately in the face of massive data. In addition to application centered data association, it is also necessary to locate the domain knowledge of the analysis problem. For observable tools or products, the most important thing is to continuously accumulate the best troubleshooting path, common problem localization, root cause decision link methods, and solidify relevant experience. This is equivalent to equipping the operation and maintenance team with experienced operation and maintenance engineers to quickly identify problems and locate root causes. This is also different from traditional AIOps capabilities.

Observable value output

1) Unified presentation

As mentioned above, observable data needs to cover various levels, and each level has corresponding observable data. However, currently observable related tools are very scattered, and how to uniformly present the data generated by these tools has become a major challenge. The unification of observable data is actually a relatively difficult task, including formatting, encoding rules, dictionary values, and other issues. But it is possible to present data results uniformly, and the current mainstream solution is to use Grafana to build a unified monitoring system.

2) Collaborative processing

After unified display and alarm, how to use collaborative platforms such as DingTalk and Enterprise WeChat to more efficiently discover problems and handle tracking ChartOps has gradually become a necessity.

3) Cloud service linkage

Observability has become a cloud native infrastructure, and when observable platforms discover and locate problems, they need to quickly interact with various cloud services to quickly scale up or load balance, in order to solve problems faster.

Prometheus+Grafana Practice

Thanks to the booming development of the cloud native open source ecosystem, we can easily build a monitoring system, such as using Prometheus+Grafana to build basic monitoring, SkyWalking or Jaeger to build tracking systems, and ELK or Loki to build logging systems. However, for the operation and maintenance team, different types of observable data are scattered and stored in different backend systems, and troubleshooting problems still need to jump between multiple systems, which cannot guarantee efficiency. Based on the above, Alibaba Cloud also provides enterprises with a one-stop observable platform called ARMS (real-time monitoring service for applications). As a product family, ARMS includes multiple products in different observable scenarios, such as:

For the infrastructure layer, Prometheus monitoring services monitor various cloud services including ECS, VPC, containers, and third-party middleware.

For the application layer, the application monitoring based on Alibaba Cloud's self-developed Java probe fully meets the application monitoring needs. Compared to open source tools, there has been a significant improvement in data quality and other aspects. And through link tracking, even using open-source SDK or probes, data can still be reported to the application monitoring platform.

Targeting the user experience layer, it comprehensively covers the user experience and performance on different terminals through modules such as mobile monitoring, front-end monitoring, and cloud testing.

Unified alarm, which performs unified alarm and root cause analysis on the data and alarm information collected by each layer, and directly presents the discovery results through Insight.

A unified interface, whether it is reported data from ARMS and Prometheus, or various data sources such as log services, ElasticSearch, MongoDB, etc., can be presented through fully hosted Grafana services for unified observable data, establishing a unified monitoring system, and linking with various cloud services of Alibaba Cloud to provide CloudOps capabilities.

As mentioned above, ARMS, as a one-stop product, has many capabilities. At present, the enterprise has built some similar capabilities to ARMS or adopted some products from ARMS, such as application monitoring and front-end monitoring. However, a complete observable system is still crucial for enterprises, and they hope to build an observable system that meets their own business needs based on open source. In the following examples, we will focus on how Prometheus and Grafana can construct observable systems.

Fast data access

In ARMS, we can quickly establish an exclusive instance of Grafana, and ARMS Prometheus, SLS log services, and CMS cloud monitoring data sources can all conveniently synchronize data. Open Configuration to quickly view the corresponding data source. Reduce the workload of daily data source management as much as possible while quickly accessing various data sources in time.

Preset data disk

After the data is connected, Grafana automatically creates the corresponding data disk for everyone. Taking application monitoring and container monitoring as examples, basic data such as the Golden Three indicators and interface changes will be provided by default.

It can be seen that although Grafana helps everyone build various data dashboards, what she sees is still a fragmented market. In the daily operation and maintenance process, it is also necessary to create a unified platform based on business domains or applications, which can display data from the infrastructure layer, container layer, application layer, and user terminal layer in the same platform, thus achieving overall monitoring.

Unified overall inventory for the entire stack

When establishing a unified overall stack, we prepare based on dimensions such as user experience, application performance, container layer, cloud services, and underlying resources.

1) User experience monitoring

Common key data such as PV, UV data, JS error rate, first rendering time, API request success rate, and TopN page performance will be presented in the first instance.

2) Application performance monitoring

Request volume, error rate, and response time represented by the three golden indicators. And differentiate according to different applications and services.

3) Container layer monitoring

The performance and usage of each Pod, as well as the departments that run on these applications, are also listed. These deployment related Pod performance information are presented in this section.

4) Cloud service monitoring

In addition, it is related to cloud service monitoring, taking message queue Kafka as an example, such as common data indicators of message services such as consumption accumulation and consumption volume.

5) Host node monitoring

For the entire host node, CPU, running Pod, and other data.

In this way, this large scale covers the overall performance monitoring from the user experience layer to the application layer to the infrastructure container layer. More importantly, the entire market contains relevant data for all microservices. When switching to a certain service, the performance data associated with the service will be independently displayed. Filter at different levels such as containers, applications, and cloud services. Let me briefly mention how it is done. When Prometheus monitors and collects these cloud services, it will also collect all the tags on the cloud services. By labeling tags, these cloud services can be distinguished based on different business dimensions or applications. When doing our unified market, we will definitely encounter many data source management issues. Here we provide the globalview capability to aggregate all Prometheus instances under this username and perform unified queries. Whether it's information from the application layer or cloud services.

With the help of the above scenario, we propose a design direction for observability platform: integrating and analyzing different data in the backend from the perspective of system and service observation, rather than deliberately emphasizing that the system supports separate queries for the three types of observability data. We strive to shield users from the fragmentation of Metrics, Tracing, and Logging in terms of product functionality and interaction logic. Establish a complete and observable closed-loop system, from pre accident anomaly detection, fault diagnosis during accidents, to active warning and monitoring after accidents, providing an integrated platform for continuous business monitoring and optimizing service performance.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us