A Unified Solution for Observability - Make SLS Compatible with OpenTelemetry

By Yuanyi

Observability in IT

Observability is a process in which you can infer the internal running status of a system based on external outputs.

This concept was first proposed by Hungarian-American engineer Rudolph Kalman for linear dynamic systems. Judging by the signal flow graph, if all internal states can be output to the output signal, this system is observable.

Let's use a real-world case as an example:

Cars are equipped with various sensors to measure various metrics, such as mileage, speed, engine speed, and oil. Safety sensors, such as the airflow meter, ABS sensor, oil pressure sensor, water temperature sensor, and collision sensor, are also included. These are all necessary metrics to measure whether the car is safe. Just think, if a car had no metrics, would you still dare to drive it?

As IT technologies are applied gradually to all aspects of production, system stability is becoming increasingly important. The experience related to electrical engineering is replicated to the IT field to monitor all aspects of the system, which mainly starts with network monitoring based on the simple network management protocol (SNMP) and system (CPU, memory, disk, and other basic metrics) monitoring. As IT systems become increasingly large and fields become more segmented, a large number of open-source monitoring software (such as Zabbix, Nagios, and RRDTool) and commercial companies (such as NewRelic, DataDog, and Dynatrace) have emerged. Monitoring is more focused on business and service.

In recent years, with the popularization of cloud-native technologies, the implementation of PaaS and SaaS has increased. Traditional monitoring systems are evolving towards observability systems. From the data perspective, observability covers a wider scope than monitoring. Observability not only involves system metrics for monitoring alarms but also records the internal operations of the system. From a practical point of view, traditional monitoring data can tell you whether an exception has occurred in the system and reflect at most the module where the exception occurs. Based on observability-related data, you can quickly locate the module where a problem occurs and find the root cause.

Metrics, Tracing, and Logging

The preceding figure is very familiar to everyone. It is an excerpt from a blog post published by Peter Bourgon after he attended the 2017 Distributed Tracing Summit and briefly introduced the definitions and relationships of Metrics, Tracing, and Logging. Each of the three types of data plays a role in observability, and each type of data cannot be used for other purposes.

The following figure shows a typical troubleshooting process described in Grafana Loki:

At the beginning, exceptions are detected through various preset alarms, which are usually from Metrics and Logging.
After an exception is found, open the monitoring dashboard to find the exception curve and identify the exception module based on various queries and statistics (Metrics).
Perform query and statistical analysis on this module and the associated logs to find the core error information through Logging.
Finally, locate the code that caused the exception based on the detailed data of Tracing.

The preceding example illustrates how to use Metrics, Tracing, and Logging for joint troubleshooting. Different combination solutions can be used in different scenarios. For example, a simple system can directly trigger alerts based on error messages from Logging and locate problems. It can also trigger alerts based on Metrics (latency and error code) extracted from Tracing. On the whole, a system with good observability must have the three types of data listed above.

The Past and Present of OpenTelemetry

When we learned that we needed to expose objective data, such as Metrics, Tracing, and Logging for the system, we started to find some open-source and commercial solutions on the Internet. Various systems and solutions came into view. We found that for each type of data, several solutions are available. For example:

Metric: Zabbix, Nagios, Prometheus, InfluxDB, OpenFalcon, and OpenCensus
Tracing: Jaeger, Zipkin, SkyWalking, OpenTracing, and OpenCensus
Logging: ELK, Splunk, SumoLogic, Loki, and Loggly

After reading these plans, you might feel confused and do not know where to start. However, the reality is somewhat more chaotic. Each scheme and company has defined its protocol format and data type, making compatibility and inter-operability with different schemes difficult. It is acceptable if all the components used in a project use the same solution. For example, all components in the company use OpenTracing, and all associated open-source and commercial components are also based on OpenTracing. However, hybrid solutions are often used in a project. Therefore, developers had to develop various types of adapters to ensure compatibility, which is a headache.

This problem is especially prominent in the Tracing field. The purpose of Tracing is to connect all modules and components in a system to interact with each other using a Trace ID. If the Trace format of some components is different, the call records of the component are lost, and the Trace is useless. At the very beginning, OpenTracing specifications defined the data format of Trace, and companies could implement Trace based on the specifications. Components based on different implementations can be fully compatible with each other when they are finally combined.

However, the community also used Google's OpenCensus, which defines Metrics in addition to Trace. OpenTracing and OpenCensus were eventually merged into OpenTelemetry under the influence of the Cloud-Native Computing Foundation (CNCF) and have become a quasi-standard protocol for observability today.

OpenTelemetry will provide us with a unified standard for Metrics, Tracing, and Logging. This standard is under development, and the LogModel has already been defined. All three have the same metadata structure and can be easily correlated to each other. In addition, OpenTelemetry has brought us many benefits:

Unified SDK: OpenTelemetry provides a corresponding SDK for each common language. In the future, only one SDK will be needed to record three types of observability data in your system.
Automatic Code Injection Technology: OpenTelemetry has also begun to provide the implementation of automatic code injection and currently supports automatic injection of various mainstream Java frameworks.
Independence from Vendors: OpenTelemetry helps collectors collect data sent by various SDKs and supports connections to various backend storage systems.
Cloud-Native: OpenTelemetry was designed from the beginning with cloud-native features in mind. It provides Kubernetes operators for rapid deployment and use.

Is OpenTelemetry Omnipotent?

As shown in the preceding figure, OpenTelemetry covers the specification definition, API definition, specification implementation of various Telemetry data types, and data acquisition and transmission. In the future, you can only use one SDK to produce all types of data. You only need to deploy an OpenTelemetry Collector in a cluster to collect all types of data. In addition, Metrics, Tracing, and Logging have the same meta information and can be associated seamlessly.

Everything looks good so far, but from the perspective of the overall solution of observability, OpenTelemetry only produces unified data. There is no clear definition of how to store data or how to use it. As a result, the following problems stand out:

Storage Methods of Various Data Types: Metrics support Prometheus, InfluxDB, and various commercial software. Tracing functions can be connected to Jaeger, OpenCensus, and Zipkin. It is difficult to choose and maintain these backends.
Data Analysis: How can the collected data be analyzed uniformly after the storage problem is solved? You need to analyze different data separately in different software an additional database to store the intermediate analysis results to continue processing the intermediate results.
Visualization and Association: An observability system is made to be observable, so visualization and interaction are important. A lot of customized development work is required to display Metrics, Logging, and Tracing on one platform and implement the associated jump between the three.
Fault Detection and Diagnosis: After solving the basic problem, quality improvement becomes the main pursuit. Therefore, how to implement more effective fault detection and root cause diagnosis in the system is included in the research. At this time, OpenTelemetry data needs to be integrated into AIOps-related technologies.

SLS Integrated with OpenTelemetry

From the analysis above, you can see the orientation of OpenTelemetry is the infrastructure for observability and the solution for data specification and acquisition problems. Subsequent implementations rely on vendors. The best way is to have a unified engine to store all Metrics, Logging, and Tracing, and a unified platform to analyze, display, and correlate the data.

These coincide with the development direction of Alibaba Cloud Log Service (SLS). Since 2017, SLS has been supporting cloud-native-related observability work and has entered the CNCF Landscape as a log solution.

Currently, SLS also supports other schemes of the CNCF observability domain (Observability and Analysis), including programs officially maintained by CNCF, such as Prometheus, Fluentd, and Jaeger. The goal is to be compatible with a variety of observable data sources and provide a unified storage and analysis service for data, and build a business observability platform with higher efficiency, higher reliability, lower costs, and lower requirements.

OpenTelemetry, as the unified standard for Logging, Tracing, and Metrics in the future, is currently the most active project in CNCF besides Kubernetes. SLS is also continuously following the progress of the community. Now, Alibaba has joined the official Collectors as a Vendor. You can use Collectors to store various observability data of OpenTelemetry directly in the backend of SLS. SLS can provide the following benefits for observability:

Unified Storage: SLS-native supports Tracing and Logging storage, with dozens of Petabytes of data written every day. A time series storage engine was released and put into large-scale use. Therefore, all Telemetry data can be stored only with SLS.
Unified Analysis: SLS provides the SQL92 analysis method for all data. The unified analysis of Metrics, Logging, and Tracing can be implemented using this SQL syntax. Union queries across Stores are also supported. Therefore, an SQL statement can associate Metrics, Logging, and Tracing at the same time, making data connections more convenient.
Lower Costs: From the perspective of labor costs, SLS is fully service-oriented, without maintaining instances. From the perspective of usage, SLS adopts the pay-as-you-go model, without purchasing computers and disks separately for data computing and storage.
Higher Speed: The SLS architecture with storage and computing separated gives full play to the cluster capabilities. The end-to-end speed of large amounts of data is improved significantly.
More Intelligent Algorithms: SLS provides a variety of AIOps algorithms, such as multi-period estimation, prediction, anomaly detection, and time series classification, allowing you to build an intelligent alerting and diagnosis platform that is suitable for your business.
A More Complete Ecosystem: Telemetry data can be connected to Stream Computing for faster alarm capability, data warehouses for offline statistical analysis, and OSS for archiving and storage using the upstream and downstream ecosystem of SLS.

Metrics

SLS MetricStore adopts storage-computing separation architecture. Data is distributed to multiple servers for distributed storage using shards. The QueryEngine module of Prometheus is integrated at the computing layer. It reads data from each shard in parallel through the internal high-speed network and supports operator pushdown to easily cope with the pressure of ultra-large-scale data.

It is worth mentioning that the officially recommended backend storage for OpenTelemetry Metrics is Prometheus. SLS MetricStore is fully compatible with the data format and query syntax of Prometheus and solves the single point of failure (SPOF) problem of the standard Prometheus. For more information, please see Cloud-Native Prometheus Solution: High Performance, High Availability, and Zero O&M. It is the best Metrics storage solution for OpenTelemetry.

Tracing

Back in 2018, the connection to Jaeger that met the OpenTracing specifications was already supported. Currently, Jaeger is a Tracing implementation officially maintained by CNCF and the storage backend recommended by OpenTelemetry. SLS serves as a storage backend for Jaeger and has many advantages, including high reliability, high performance, zero O&M, and low costs.

Logging

The capabilities of SLS in the log field will not be introduced here. If you are interested, you can search for "Log Service SLS" online. SLS does not completely define a specific log model. It can be regarded as a universal log storage engine. Currently, the Logging of OpenTelemetry is in the incubating state, but it is developing quickly. LogModel is already developed. Related protocols and SDK implementation are coming soon, which will be supported by SLS as soon as possible.

Notes for the Future

The OpenTelemetry project is currently in the incubation stage, and the SDKs and collectors of various languages are not in the Production Ready state. However, from the popularity of the community (OpenTelemetry is the most active project in CNCF except Kubernetes), you can see that major companies think very highly of OpenTelemetry. We believe OpenTelemetry will achieve unification in the field of observability. In the future, we will continue to pay attention to the development of OpenTelemetry and share different ways to implement system observability based on OpenTelemetry.

Community

A Unified Solution for Observability - Make SLS Compatible with OpenTelemetry

Observability in IT

Metrics, Tracing, and Logging

The Past and Present of OpenTelemetry

Is OpenTelemetry Omnipotent?

SLS Integrated with OpenTelemetry

Metrics

Tracing

Logging

Notes for the Future

References

Read previous post:

Read next post:

DavidZhang

You may also like

Comments

DavidZhang

Related Products

Managed Service for Grafana

Bastionhost

Managed Service for Prometheus

Application Real-Time Monitoring Service