Observability is a process in which you can infer the internal running status of a system based on external outputs.
This concept was first proposed by Hungarian-American engineer Rudolph Kalman for linear dynamic systems. Judging by the signal flow graph, if all internal states can be output to the output signal, this system is observable.
Let's use a real-world case as an example:
Cars are equipped with various sensors to measure various metrics, such as mileage, speed, engine speed, and oil. Safety sensors, such as the airflow meter, ABS sensor, oil pressure sensor, water temperature sensor, and collision sensor, are also included. These are all necessary metrics to measure whether the car is safe. Just think, if a car had no metrics, would you still dare to drive it?
As IT technologies are applied gradually to all aspects of production, system stability is becoming increasingly important. The experience related to electrical engineering is replicated to the IT field to monitor all aspects of the system, which mainly starts with network monitoring based on the simple network management protocol (SNMP) and system (CPU, memory, disk, and other basic metrics) monitoring. As IT systems become increasingly large and fields become more segmented, a large number of open-source monitoring software (such as Zabbix, Nagios, and RRDTool) and commercial companies (such as NewRelic, DataDog, and Dynatrace) have emerged. Monitoring is more focused on business and service.
In recent years, with the popularization of cloud-native technologies, the implementation of PaaS and SaaS has increased. Traditional monitoring systems are evolving towards observability systems. From the data perspective, observability covers a wider scope than monitoring. Observability not only involves system metrics for monitoring alarms but also records the internal operations of the system. From a practical point of view, traditional monitoring data can tell you whether an exception has occurred in the system and reflect at most the module where the exception occurs. Based on observability-related data, you can quickly locate the module where a problem occurs and find the root cause.
The preceding figure is very familiar to everyone. It is an excerpt from a blog post published by Peter Bourgon after he attended the 2017 Distributed Tracing Summit and briefly introduced the definitions and relationships of Metrics, Tracing, and Logging. Each of the three types of data plays a role in observability, and each type of data cannot be used for other purposes.
The following figure shows a typical troubleshooting process described in Grafana Loki:
The preceding example illustrates how to use Metrics, Tracing, and Logging for joint troubleshooting. Different combination solutions can be used in different scenarios. For example, a simple system can directly trigger alerts based on error messages from Logging and locate problems. It can also trigger alerts based on Metrics (latency and error code) extracted from Tracing. On the whole, a system with good observability must have the three types of data listed above.
When we learned that we needed to expose objective data, such as Metrics, Tracing, and Logging for the system, we started to find some open-source and commercial solutions on the Internet. Various systems and solutions came into view. We found that for each type of data, several solutions are available. For example:
After reading these plans, you might feel confused and do not know where to start. However, the reality is somewhat more chaotic. Each scheme and company has defined its protocol format and data type, making compatibility and inter-operability with different schemes difficult. It is acceptable if all the components used in a project use the same solution. For example, all components in the company use OpenTracing, and all associated open-source and commercial components are also based on OpenTracing. However, hybrid solutions are often used in a project. Therefore, developers had to develop various types of adapters to ensure compatibility, which is a headache.
This problem is especially prominent in the Tracing field. The purpose of Tracing is to connect all modules and components in a system to interact with each other using a Trace ID. If the Trace format of some components is different, the call records of the component are lost, and the Trace is useless. At the very beginning, OpenTracing specifications defined the data format of Trace, and companies could implement Trace based on the specifications. Components based on different implementations can be fully compatible with each other when they are finally combined.
However, the community also used Google's OpenCensus, which defines Metrics in addition to Trace. OpenTracing and OpenCensus were eventually merged into OpenTelemetry under the influence of the Cloud-Native Computing Foundation (CNCF) and have become a quasi-standard protocol for observability today.
OpenTelemetry will provide us with a unified standard for Metrics, Tracing, and Logging. This standard is under development, and the LogModel has already been defined. All three have the same metadata structure and can be easily correlated to each other. In addition, OpenTelemetry has brought us many benefits:
As shown in the preceding figure, OpenTelemetry covers the specification definition, API definition, specification implementation of various Telemetry data types, and data acquisition and transmission. In the future, you can only use one SDK to produce all types of data. You only need to deploy an OpenTelemetry Collector in a cluster to collect all types of data. In addition, Metrics, Tracing, and Logging have the same meta information and can be associated seamlessly.
Everything looks good so far, but from the perspective of the overall solution of observability, OpenTelemetry only produces unified data. There is no clear definition of how to store data or how to use it. As a result, the following problems stand out:
From the analysis above, you can see the orientation of OpenTelemetry is the infrastructure for observability and the solution for data specification and acquisition problems. Subsequent implementations rely on vendors. The best way is to have a unified engine to store all Metrics, Logging, and Tracing, and a unified platform to analyze, display, and correlate the data.
These coincide with the development direction of Alibaba Cloud Log Service (SLS). Since 2017, SLS has been supporting cloud-native-related observability work and has entered the CNCF Landscape as a log solution.
Currently, SLS also supports other schemes of the CNCF observability domain (Observability and Analysis), including programs officially maintained by CNCF, such as Prometheus, Fluentd, and Jaeger. The goal is to be compatible with a variety of observable data sources and provide a unified storage and analysis service for data, and build a business observability platform with higher efficiency, higher reliability, lower costs, and lower requirements.
OpenTelemetry, as the unified standard for Logging, Tracing, and Metrics in the future, is currently the most active project in CNCF besides Kubernetes. SLS is also continuously following the progress of the community. Now, Alibaba has joined the official Collectors as a Vendor. You can use Collectors to store various observability data of OpenTelemetry directly in the backend of SLS. SLS can provide the following benefits for observability:
SLS MetricStore adopts storage-computing separation architecture. Data is distributed to multiple servers for distributed storage using shards. The QueryEngine module of Prometheus is integrated at the computing layer. It reads data from each shard in parallel through the internal high-speed network and supports operator pushdown to easily cope with the pressure of ultra-large-scale data.
It is worth mentioning that the officially recommended backend storage for OpenTelemetry Metrics is Prometheus. SLS MetricStore is fully compatible with the data format and query syntax of Prometheus and solves the single point of failure (SPOF) problem of the standard Prometheus. For more information, please see Cloud-Native Prometheus Solution: High Performance, High Availability, and Zero O&M. It is the best Metrics storage solution for OpenTelemetry.
Back in 2018, the connection to Jaeger that met the OpenTracing specifications was already supported. Currently, Jaeger is a Tracing implementation officially maintained by CNCF and the storage backend recommended by OpenTelemetry. SLS serves as a storage backend for Jaeger and has many advantages, including high reliability, high performance, zero O&M, and low costs.
The capabilities of SLS in the log field will not be introduced here. If you are interested, you can search for "Log Service SLS" online. SLS does not completely define a specific log model. It can be regarded as a universal log storage engine. Currently, the Logging of OpenTelemetry is in the incubating state, but it is developing quickly. LogModel is already developed. Related protocols and SDK implementation are coming soon, which will be supported by SLS as soon as possible.
The OpenTelemetry project is currently in the incubation stage, and the SDKs and collectors of various languages are not in the Production Ready state. However, from the popularity of the community (OpenTelemetry is the most active project in CNCF except Kubernetes), you can see that major companies think very highly of OpenTelemetry. We believe OpenTelemetry will achieve unification in the field of observability. In the future, we will continue to pay attention to the development of OpenTelemetry and share different ways to implement system observability based on OpenTelemetry.
DavidZhang - April 30, 2021
Alibaba Developer - August 2, 2021
Alibaba Clouder - April 12, 2021
DavidZhang - December 30, 2020
DavidZhang - June 24, 2021
Alibaba Clouder - July 15, 2020
A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.Learn More
Build business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilitiesLearn More
Plan and optimize your storage budget with flexible storage servicesLearn More
A secure image hosting platform providing containerized image lifecycle managementLearn More
More Posts by DavidZhang