Community Blog Technical Practice of Alibaba Cloud Observability Data Engine

Technical Practice of Alibaba Cloud Observability Data Engine

The article explains the concept of IT system observability and introduces the technical practice and architecture of the Alibaba Cloud observability engine.

1. Preface

The concept of observability first appeared in the electrical engineering field in the 1970s. A system is said to be observable if, for any possible evolution of state and control vectors, the current system state can be estimated using the information from its outputs.


Compared with traditional alarms and monitoring, observability can help us understand a complex system in a more "white box" manner, giving us an overview of the system operation to locate and solve problems quickly. Consider the example of an engine. The alarm only tells whether there is a problem with the engine and certain associating dashboards containing speed, temperature, and pressure. However, it doesn't help us accurately determine which engine part contains the problem. We need to observe each component's sensor data to locate the problem accurately.

2. Observability of IT Systems

The age of electricity originated from the Second Industrial Revolution in the 1870s. The main signs are the broad application of electricity and internal combustion engines. But why was the concept of observability proposed after nearly 100 years? Was there no need to rely on the output of various sensors to locate and troubleshoot faults and problems before? Obviously not. The need to troubleshoot has always been there. But as modern systems are becoming more complicated, we need a more systematic way to support troubleshooting. This is why observability was proposed. Its core points include:

  • The system is more complicated: In the past, a car needed an engine, conveyor belt, and brakes to run. Now, at least hundreds of components and systems are on any car, making it more challenging to locate faults.
  • Development involves more people: With the advent of globalization, the division of labor among companies is becoming finer. This means that the development and maintenance of a system require more departments and people to cooperate. Therefore, the cost of coordination is increasing.
  • Various operating environments: Under different operating environments, the working conditions of each strategy are changing. We need to effectively record the system's status at any stage to help us analyze problems and optimize products.


After decades of rapid development, the IT system's development model, system architecture, deployment model, and infrastructure have undergone several rounds of optimization. The optimization has brought higher efficiency for system development and deployment. However, the entire system has also become more complex: the development requires more people and departments today, and the deployment model and operating environment are more dynamic and uncertain. Therefore, the IT industry has reached a stage that requires more systematic observation.


Implementing the IT system's observability is similar to that of electrical engineering. The core is to observe the output of systems and applications and judge the overall working status through data. Generally, we classify these outputs into traces, metrics, and logs. We will discuss the characteristics, application scenarios, and relationships of these three data types in detail in the upcoming sections.


3. Evolution of IT Observability


The observability technology of IT systems has been developing rapidly. From a broad perspective, observability-related technologies can be applied not only in IT operation and maintenance (O&M) scenarios but also in general and special scenarios related to the company.

  1. IT operation & maintenance scenario: From the horizontal and vertical perspective, the observation target has changed from the basic computer room and network to the user end. The observation scenario has also changed from focusing only on errors and slow requests to the actual product experience of users.
  2. General scenario: Observation is essentially a general behavior. Besides O&M scenarios, it applies to the company's security, user behavior, operation growth, and transactions. We can build application forms such as attack detection, attack traceability, ABTest, and advertisement effect analysis for these scenarios.
  3. Special scenarios: In addition to general scenarios within the company, we can also build observation scenarios and applications according to their unique characteristics for different industries. For example, Alibaba Cloud's City Brain observes information such as road congestion, traffic lights, and traffic accidents, as well as control traffic light time and make a travel plan for drivers to reduce the overall congestion.

4. How to Implement Pragmatic Observability


In terms of implementing the observability scheme, we may not be able to build an observable engine that applies to every industry at this stage. Still, we can focus more on DevOps and general corporate business. The two core tasks include:

  1. The data coverage is wide enough: It can include different data types in various scenarios. In addition to the log, monitoring, and trace, it also needs to include CMDB, change data, customer information, order/transaction information, network flow, API call, and so on.
  2. Data association and unified analysis: The discovery of data value is not realized through one kind of data. Often, we need to use various data associations to achieve the purpose. For example, with user information tables and access logs, we can analyze the behavior characteristics of users of different ages and genders and then make targeted recommendations. By using login logs and CMDB in conjunction with the rule engine, we can realize security attack detection


We can divide the whole process of the observability work into four parts:

  1. Sensors: The prerequisite for obtaining data is to have enough sensors to generate data. Various types of sensors in the IT field include SDK, tracking point, external probe, etc.
  2. Data: After the sensor generates data, we need to have enough capacity to collect, classify, and analyze all data types.
  3. Computing power: The core of observable scenarios is to gather enough data. Since the data volume is humongous, the system must have enough computing power to calculate and analyze this data.
  4. Algorithms: The ultimate application of observable scenarios is the value discovery of data. Therefore, various algorithms need to be used, including some basic numerical algorithms, different AIOps-related algorithms, and combinations of these algorithms.

5. The Classification of Observability Data


Logs, traces, and metrics can meet various requirements such as monitoring, alerting, analysis, and troubleshooting of the IT system. However, in actual scenarios, we often confuse the applicable scenarios of these three data types. Outlined below are the characteristics, transformation methods, and applicable areas of observability telemetry data:

  • Logs: A broader definition of logs is the carrier that records the changes in events. Common text types such as access logs, transaction logs, kernel logs, and generic data such as GPS, audio, and video all belong to it. We can convert logs into a trace by structuring them in the call chain scenario. After the aggregation and down sampling operations are performed, logs become metrics.
  • Metrics: It is the aggregated value, which is relatively discrete and consists of name, labels, time, and values. A metric generally has a small data volume, lower cost, and faster query.
  • Traces: It is the most standard call log. In addition to defining the parent-child relationship of a call (usually through TraceID, SpanID, and ParentSpanID), it also defines the service, method, attribute, status, duration, and other details of an operation. Trace can replace some of the functions of logs, and the aggregation of traces can help us obtain the metrics of each service and method.

6. "Divided" Observability Solutions


Various observability-related products are available to monitor complex systems and software environments, including many open source and commercial projects. Some of the examples include:

  1. Metrics: Zabbix, Nagios, Prometheus, InfluxDB, OpenFalcon, and OpenCensus
  2. Traces: Jaeger, Zipkin, SkyWalking, OpenTracing, and OpenCensus
  3. Logs: ELK, Splunk, SumoLogic, Loki, and Loggly

Combining these solutions can help us solve one or several specific types of problems. However, when using such solutions, we also encounter various problems:

  1. Absence of a unified solution: We may need to use at least three solutions, namely metrics, logging, and tracing. The maintenance cost of such solutions is also high.
  2. Data shareability: The data generated by these commercial and open-source tools is generally non-sharable with other vendor solutions under different scenarios. Hence, the data value is not given full play.

In a scenario where multiple solutions are adopted, troubleshooting needs to deal with various systems. If these systems belong to different teams, we need to cooperate with these teams to solve the problem. Therefore, it is better to utilize a solution that collects, stores and analyzes all types of observability data.


7. Observability Data Engine Architecture

Based on the above discussion, let's return to the essence of observability. Our observability engine meets the following requirements:

  1. Comprehensive data coverage: The engine covers all types of observable data and supports data collection from all ends and systems.
  2. Unified system: The engine helps prevent data fragmentation. It supports unified storage and analysis of traces, metrics, and logs in a single location.
  3. Data correlation: The engine can correlate each type of data internally, and it also facilitates cross-data type correlation and can use one analysis language to perform data fusion analysis.
  4. Large computing power: Our engine is distributed, scalable, and has enough capacity to analyze PB-level data.
  5. Flexible and intelligent algorithms: In addition to the basic algorithms, the engine includes AIOps-related exception detection and prediction algorithms and supports the orchestration of these algorithms.

The overall architecture of the observability data engine is shown in the following figure. The four layers from the bottom to the top basically conform to the guiding ideology of the scenario landing: sensor + data + computing power + algorithm:

  • Sensor: The data source is based on OpenTelemetry and supports the collection of various data forms, devices /ends, and data formats, with wide enough coverage.
  • Data + computing power: The collected data first goes to our pipeline system (similar to Kafka) to build different indexes based on various data types. Currently, dozens of PB of new data are written and stored on our platform every day. In addition to common query and analysis capabilities, we have built-in ETL functions responsible for cleaning and formatting data and supporting connection to external stream computing and offline computing systems.
  • Algorithms: In addition to basic numerical algorithms, we currently support more than a dozen exception detection /prediction algorithms along with streaming exception detection. Furthermore, we support the orchestration of data using Scheduled SQL, which helps us generate more new data.
  • Value discovery: The value discovery process is mainly realized through visualization, alarms, interactive analysis, and other human-computer interaction methods. At the same time, our observability engine provides OpenAPI to connect to external systems or for users to realize some custom functions.


8. Data Source and Protocol Compatibility


After Alibaba's full adoption of cloud-native technologies, we began to solve compatibility issues with open source and cloud-native protocols and solutions in the observable field. Compared with the closed mode of protocols, being compatible with open source and standard protocols allows you to capture data from various data sources seamlessly. Our platform can effectively optimize this, reducing "reinventing the wheel" work. The preceding figure shows the overall progress of our compatibility with external protocols and agents:

  • Traces: In addition to the internal Apsara trace and Hawkeye trace, we also support open-source traces, including Jaeger, OpenTracing, Zipkin, SkyWalking, OpenTelemetry, and OpenCensus.
  • Logs: Logs have fewer protocols, but many log collection agents exist. Besides the in-house Logtail of Alibaba, our platform is also compatible with Logstash, Beats (FileBeat, AuditBeat), Fluentd, and Fluent bits. At the same time, it also supports the Syslog protocol. Routers and switches can use the Syslog protocol to report data to servers.
  • Metrics: At the beginning of its design, the new version of the time series engine is compatible with Prometheus and supports data access such as Telegraf, OpenFalcon, OpenTelemetry Metrics, and Zabbix.

9. Unified Storage Engine

For storage engines, our primary design goal is unification. We are focusing on using a set of engines to store various types of observable data. Our second pursuit is speed. Writing and querying speed can be applied to ultra-large-scale scenarios inside and outside Alibaba Cloud (tens of petabytes of data written per day).


In the case of observability telemetry data, the formats and query features of logs and traces are similar; therefore, we will explain them together:

  • Logs/Traces

    • The query method allows us to query by keyword/TraceID and filter according to certain tags, such as hostname, region, and app.
    • The number of hits per query is relatively small, especially when using the TraceID query method, and the hit data is very likely to be discrete.
    • Generally, this type of data is suitable for storage in search engines, and the core technology is inverted indexing.
  • Metrics

    • Usually, metrics use range queries each time a single metric/timeline is queried, or a set of timelines is aggregated for the query. For example, the average CPU of all machines of an application is aggregated.
    • Time series queries generally have high QPS mainly because such queries have many alert rules. To adapt to high QPS queries, we need to improve data aggregation.
    • This type of data is usually supported by special time series engines. Currently, mainstream time series engines are implemented with ideas similar to LSM Tree to adapt to high-throughput writes and queries (Update and Delete operations are rare).

At the same time, observability data also has some common features, such as high throughput (high traffic, QPS, and burst), ultra-large-scale query capabilities, and time access features (hot and cold features, access locality, etc.).


We designed a unified observable data storage engine for the above feature analysis. Its overall architecture is as follows:

  1. The access layer supports writing various protocols. The written data first enters a FIFO pipeline, similar to the MQ model of Kafka. The pipeline supports data consumption to connect to various downstream.
  2. There are two sets of index structures on top of the pipeline: inverted index and SortedTable, which provide fast query capabilities for traces/logs and metrics, respectively.
  3. Mechanisms of the two index structures are shared, except their structures are different. The shared mechanisms include storage engine, FailOver logic, cache policy, and hot and cold data tiering policy.
  4. The above data is implemented in the same process, thus significantly reducing the O&M and deployment cost.
  5. The entire storage engine is implemented based on a purely distributed framework and supports scale-out. A single store supports up to a PB level of data writing per day.

10. Unified Analysis Engine


If the storage engine is compared to fresh food materials, then the analysis engine is the knife for processing these food materials. Based on this analogy, for different food items, we need different types of knives to achieve the best results, such as cutting knives for vegetables, bone-cutting knives for ribs, and peeling knives for fruits. Similarly, there are corresponding analysis methods for different types of observable data and scenarios:

  1. Metrics: We can use metrics for alarms and graphical displays. Metrics can be obtained directly or supplemented by simple calculations, such as PromQL and TSQL.
  2. Traces/Logs: The simplest and most direct way is the keyword query, including the trace ID query, which is a special case of the keyword query.
  3. Data analysis (generally for traces and logs): Traces and logs are useful for data analysis and mining. Therefore, we will use Turing-complete languages, of which programmers widely accept SQL.

The above analysis methods have corresponding applicable scenarios. Using a particular syntax/language to implement all functions is difficult while ensuring good convenience. Although the capabilities similar to PromQL and keyword query can be realized by extending SQL, a simple PromQL operator may need a large string of SQL statements to implement. Therefore, our analysis engine chooses to be compatible with keyword query and PromQL syntax. At the same time, to facilitate the association of various types of observable data, we have realized the capability to connect keyword queries, PromQL, external DB, and ML models based on SQL. It makes SQL a top-level analysis language, realizing the fusion capability for observable data.


Here are a few application examples of our query/analysis. The first three examples are simple and can be used with pure keyword query, PromQL, or together with SQL. The last one shows an example of fusion analysis in actual scenarios:

  • Background: There are payment failure errors found online. We need to analyze whether there are any problems with the CPU indicators of these machines with payment failure errors.
  • Implementation
  • First, query the CPU metrics of the machines

    • Associate the Region information of machines (we need to check whether there are problems with a certain Region)
    • Join with the machines whose logs contain payment failure, and focus only on these machines
    • Finally, use the time-series anomaly detection algorithm to analyze the CPU metrics of these machines quickly
    • Visualize the final results with line charts, making the results more intuitive

In the preceding example, LogStore and MetricStore are queried at the same time, and the CMDB and ML models are associated. One statement achieves a complex analysis effect, which is common in actual scenarios, especially for analyzing complex applications and exceptions.


11. Data Orchestration


Compared with traditional monitoring, the advantage of observability lies in its stronger capability of data value discovery. Observability allows us to infer the operating state of a system according to output. Therefore, it is similar to data mining, where we collect all complex data types. After formatting, preprocessing, analyzing, and testing the collected data, it "tells stories" based on the conclusions reached. Therefore, when constructing the observability engine, we focus on the capability of data orchestration. This capability can make the data "flow" continuously, providing high-value data from the raw logs. In the end, observability tells us about a system's operational state and helps find answers to questions like "why the system or application is not working." To enable data to "flow," we have developed several functions:

  1. Data processing: The function of T (namely transform) in big data ETL (extract, transform, and load) can help us convert unstructured and semi-structured data into structured data for easier analysis.
  2. Scheduled SQL: As its name implies, it is SQL that runs regularly. The core idea is to simplify large amounts of data to facilitate queries. For example, we can use AccessLog regularly to calculate website access requests every minute, aggregate CPU and memory metrics by app and Region granularity, and periodically calculate trace topology to facilitate queries.
  3. AIOps inspection: The inspection capability based on the time series anomaly algorithm is specially developed for time series data. It uses machines and computing power to help us detect the exact indicator with problems and the particular dimension of the indicator having errors.

12. Observability Engine Application Practice

Our data engine currently has over 100,000 internal and external users. It also processes over 40PB of data daily. Many companies are building their own observable platforms based on our data engine to carry out full-stack observability and business innovation. Outlined below are some common scenarios that our engine supports:

12.1 Observability in the Comprehensive Procedure

Observability in the comprehensive procedure has always been an important step in DevOps. In addition to the usual monitoring, alarm, and problem troubleshooting, it also undertakes functions, such as user behavior playback/analysis, version release verification, and A/B Test. The following figure shows the comprehensive-procedure observable architecture of one of the products of Alibaba Cloud.

  1. Data sources include mobile end, web end, and back-end data, as well as monitoring system data and third-party data.
  2. Data collection is achieved through Logtail and TLog of SLS.
  3. Based on the online-offline hybrid data processing, the data is preprocessed, including tagging, filtering, association, distribution, and so on.
  4. All types of data are stored in the SLS observable data engine, which mainly uses SLS's indexing, query, and aggregate analysis capabilities.
  5. The upper layers build the comprehensive-procedure data display and monitoring system based on the SLS interface.


12.2 Observable Cost

The priority of a commercial company is always revenue and profitability. We all know that profitability is revenue minus cost, and the cost in the IT sector is usually huge, especially for Internet companies. Now, after Alibaba Cloud's full cloudification, Alibaba Cloud's internal teams should also closely observe the IT cost and work hard to reduce costs as much as possible. The following example shows the monitoring system architecture of a customer of Alibaba Cloud. In addition to monitoring the IT infrastructure and business, the system is also responsible for analyzing and optimizing the IT costs of the entire company. The main data gathered include:

  1. Collect fees for each product (such as virtual machine, network, storage, database, SaaS) on the cloud, including detailed billing information
  2. Collect monitoring information for each product, including usage, utilization, etc.
  3. Create a Catalog/CMDB, including the business unit, team, and usage to which each resource/instance belongs

Using Catalog and product billing information, we can calculate the IT cost of each department. Similarly, based on each instance's usage and utilization information, we can calculate the IT resource utilization of each department, such as the CPU and memory usage of each ECS. Finally, we can also determine the reasonable degree of the use of IT resources by each department/team as a whole. Based on this information, we can create operation reports to promote the optimization of departments/teams with low reasonable degrees.


12.3 Observable Trace

With the implementation of cloud-native and microservices in various industries, distributed tracing analysis (trace) is adopted by more and more companies. For trace, its most basic capability is to record the propagation of a user request in a distributed system and determine the dependency among multiple services. In terms of its features, a trace is a regular, standardized access log with dependency. Therefore, we can use trace for the calculation to mine more value.

The following is the implementation architecture of the SLS OpenTelemetry trace. The core idea here is to calculate trace raw data through data orchestration, as well as obtain aggregated data and implement additional features of various types of traces based on the interfaces provided by SLS. These additional features include:

  1. Dependency: This is a feature supported by most trace systems. Based on the parent-child relationship in the trace, an aggregate calculation is performed to obtain trace Dependency.
  2. Service/port golden indicators: Trace records the call latency and status code of both service and port. We can calculate the QPS, latency, error rate, and other golden indicators based on these data.
  3. Upstream and downstream analysis: Based on the dependency information, aggregation is performed based on a service, and the upstream and downstream metrics on which the service depends are unified.
  4. Middleware analysis: In a trace, calls to middleware (such as database and MQ.) are recorded as Spans. Based on the statistics of these Spans, we can obtain the middleware's QPS, latency, and error rate.
  5. Alarm-related: Monitoring and alarming are usually set based on the golden metrics of the service/interface. Alternatively, we can focus only on the alarms of the overall service entry (generally, spans with an empty parent are considered service entry calls).


12.4 Orchestration-based Root Cause Analysis

In the early stage of observability, a lot of work requires manual execution. Ideally, we need an automated system to help us automatically diagnose exceptions based on the observed data when problems occur. It should also determine a reliable root cause and automatically fix issues according to the root cause diagnosis. At this stage, automatic exception recovery is difficult to achieve, but the location of the root cause can be identified through some algorithms and orchestration methods.


The following figure shows the observation abstraction of a typical IT system architecture. Each application will have its golden metrics, business access log, error log, basic monitoring metrics, call middleware metrics, and associated middleware metrics and logs. Concurrently, tracing can help determine the dependency between upstream and downstream apps and services. With such data and some algorithms and orchestration, we can perform automatic root cause analysis to some extent. The core dependencies are as follows:

  1. Correlation: We can use tracing to calculate the dependencies between apps and services and CMDB to obtain the dependencies between apps, PaaS, and IaaS. Based on the correlation, we can find out the underlying cause of the problem.
  2. Time series anomaly detection algorithm: It automatically detects whether a specified or a group of the curve(s) is abnormal, including ARMA, KSigma, and Time2Graph. For detailed information about the algorithms, see anomaly detection algorithm and streaming anomaly detection.
  3. Log clustering analysis: It helps aggregate logs with high similarity and extract common patterns to get an overall picture of the logs. We can also use the comparison functionality of Pattern to compare the patterns in normal and abnormal periods to find exceptions in logs.


The exception analysis of time series and logs can help us determine whether there is a problem with a component, and the correlation can enable us to find out the cause of the problem. Combining these three core functionalities can help us build a root cause analysis system for exceptions. The following figure is a simple example: First, analyze the golden indicator of the entry from the alarm, and then analyze the data of the service, the dependency middleware indicator, and the application Pod / virtual machine indicator. We can use trace dependency to recursively analyze whether there is a problem with the downstream dependency. Further, some change information can be associated to quickly locate whether the change causes an exception. The abnormal events found are concentrated on the timeline for analysis. Alternatively, we can rely on O&M and development staff to determine the root cause.


13. Summary

The concept of observability is not a "black technology" invented overnight but a word that "evolved" from our daily work, similar to monitoring, problem troubleshooting, and prevention. Likewise, at first, we only worked on the log engine (Log Service of Alibaba Cloud), then we gradually optimized and upgraded it to an observability engine. For "observability," we have to put aside the concept itself to discover its essence, which is often related to business. For example, the goals of observability include:

  1. Make the system more stable, and the user experience better
  2. Observe IT expenditure to eliminate unreasonable use and save more costs
  3. Observe trading behavior and find click farming and cheating behaviors in time to prevent further losses
  4. Use automatic means such as AIOps for problem discovery, thus saving manpower and improving O&M efficiency

For the R&D of the observability engine, our main concern is how to serve more departments and companies for the rapid and effective implementation of observability solutions. We have made continuous efforts in sensors, data, computing, and algorithms of the engine. The achievements of our efforts include more convenient eBPF collection, data compression algorithms with higher compression ratios, parallel computing with higher performance, and root cause analysis algorithms with lower recall rates. We will continue updating our work on the observability engine for everyone. Stay tuned.

1 1 1
Share on


12 posts | 1 followers

You may also like


Dikky Ryan Pratama May 4, 2023 at 5:28 pm

Your post is very inspiring and inspires me to think more creatively., very informative and gives interesting new views on the topic., very clear and easy to understand, makes complex topics easier to understand, very impressed with your writing style which is smart and fun to work with be read. , is highly relevant to the present and provides a different and valuable perspective.


12 posts | 1 followers

Related Products

  • Bastionhost

    A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.

    Learn More
  • Managed Service for Grafana

    Managed Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.

    Learn More
  • Alibaba Cloud Linux

    Alibaba Cloud Linux is a free-to-use, native operating system that provides a stable, reliable, and high-performance environment for your applications.

    Learn More
  • IT Services Solution

    Alibaba Cloud helps you create better IT services and add more business value for your customers with our extensive portfolio of cloud computing products and services.

    Learn More