Observability Development Direction Under Cloud Native


I am very lucky to participate in the Cloud Native Community Meetup Beijing Station, and have the opportunity to discuss the technologies and applications related to Cloud Native with the bulls in many industries. At this Meetup, I shared with you the topics related to the observability of Cloud Native. The relevant videos can be viewed from "Video Playback of Station B: Observability of Cloud Native". This article is mainly a textual summary of the video. Welcome to leave a message for discussion.

Origin of observability

Observability first came from the field of electrical engineering. The main reason is that with the gradual development of the system, it is necessary to have a set of mechanisms to understand the internal operation status of the system for better monitoring and problem repair. Therefore, engineers have designed many sensors and instrument panels to represent the internal status of the system.

Electrical engineering has developed for more than a hundred years, and the observability of various sub-fields is being improved and upgraded. For example, vehicles (cars/planes, etc.) are also regarded as the integrator of observability. Apart from the super engineering of aircraft, there are hundreds of sensors inside a small car that can be used to detect various internal/external conditions of the car, so that the car can run stably, comfortably and safely.

The future of observability

With the development of more than one hundred years, the observability under electrical engineering has not only been used to assist people to check and locate problems. From the perspective of automotive engineering, the development of the observability has gone through several processes:

Blindness: On January 29, 1886, German Carl Benz invented the first car in human history. At that time, the car only had the most basic ability to drive, and there was nothing related to observability.

Sensor: As cars began to officially enter the market later, people need to better know whether the car is out of oil and water, so the basic sensor dashboard was invented.

Warning: In order to better ensure the formal safety of the vehicle, people began to use self-checking and real-time warning systems to actively notify the driver of some abnormal information, such as battery power failure, high water temperature, low tire pressure, brake pad wear, etc.

Auxiliary: Although the alarm can be sent immediately, sometimes people still have no time to deal with it or do not want to deal with it. At this time, the auxiliary system can be used, such as cruise control, active safety, autonomous parking, etc. These auxiliary systems are a combination of sensors and automatic control, which can partially solve the problems that the driver may not do or do not want to do.

Automatic driving: The above functions ultimately need people to participate, and automatic driving can completely eliminate the need for people to participate, and directly the observability system+control system can make the car run automatically.

Core elements of automatic driving

As the peak of observability in electrical engineering, automatic driving brings all kinds of internal and external data obtained by the car into full play. To sum up, there are several core elements:

Abundant data sources: multiple laser/image radars are distributed around the vehicle, which can realize high frame rate real-time observation of surrounding objects and their status; Internally, it can know the current speed, wheel angle, tire pressure and other information in real time, so as to know the other and know the other.

Data centralization: relative to the auxiliary driving ability, a core breakthrough of automatic driving is to be able to gather all the data inside and outside the vehicle for processing, and truly play the value of the data, rather than the data of each module operating independently as an island.

Powerful computing power: centralized data also means the rapid expansion of data volume. No matter which autopilot is backed by a powerful chip, only enough computing power can ensure sufficient computing in the shortest time.

Software iteration: computing power+algorithm constitute the ultimate goal of intelligence. However, the algorithm cannot be perfect. We will continue to update the algorithm based on the accumulated automatic driving data, so that the software system can be continuously upgraded to achieve better automatic driving effect.

Observability of IT system

With the development of decades, monitoring and troubleshooting in IT systems are gradually abstracted into observable projects. At that time, the most mainstream method was to use the combination of Metrics, Logging and Tracing.

The above picture is very familiar to everyone. This is a blog post published by Peter Bourgon after attending the 2017 Distributed Tracing Summit. It briefly introduces the definition and relationship of Metrics, Tracing and Logging. These three types of data have their own space in the observability, and each type of data cannot be completely replaced by other data.

From a typical troubleshooting process introduced in Grafana Loki:

1. At first, we found exceptions through various preset alarms (usually Metrics/Logging)

2. After finding the abnormality, open the monitoring board to find the abnormal curve, and find the abnormal module (Metrics) through various queries/statistics

3. Query/analyze the module and associated logs to find the core error message (Logging)

4. Finally, locate the code causing the problem (Tracing) through detailed call chain data

The above example describes how to use Metric, Tracing, and Logging to jointly troubleshoot problems. Of course, there can be different combination schemes according to different scenarios. For example, a simple system can directly use the error information of the log to alarm and directly locate the problem, and can also trigger the alarm according to the basic indicators extracted from the call chain (Latency, ErrorCode). But on the whole, a system with good observability must have the above three kinds of data.

Observability of cloud origin

What cloud native brings is not only that application deployment can be deployed on the cloud, but also that the whole definition is a new set of IT system architecture upgrade, including the evolution and iteration of development mode, system architecture, deployment mode and infrastructure.

Higher efficiency requirements: with the popularization of DevOps mode, planning, development, testing and delivery The efficiency requirement of is higher and higher, and the problem brought by it is that we need to know whether the release is successful, what problems have occurred, where the problems are, and how to solve them quickly.

The system is more complex: the architecture has developed from the initial integration mode to the hierarchical mode to the current micro-service mode. The upgrade of the architecture has brought advantages such as development efficiency, release efficiency, system flexibility, robustness, etc., but the complexity of the system will be higher, and the positioning of the problem will be more difficult.

Enhanced dynamic environment: One of the features brought by both the microservice architecture and the containerized deployment mode is that the dynamic environment will be enhanced, and the life cycle of each instance will be shorter. After the problem occurs, the site will often be destroyed, and the way to log in and troubleshoot the problem no longer exists.

More dependence on upstream and downstream: the positioning of problems will ultimately be checked from upstream and downstream. In the environment of microservices, cloud and K8s, there will be more upstream and downstream, including various other business applications, various products used on the cloud, various middleware, K8s itself, container runtime, virtual machines, etc.

Savior: OpenTelemetry

I believe that many readers will have a deep understanding of these problems, and the industry has also withdrawn various observability-related products, including many open source and commercial projects. For example:

The combination of these projects can more or less solve one or several kinds of targeted problems, but you will find various problems when you really apply them:

Interleave multiple schemes: at least Metrics, Logging and Tracing schemes may be used, and the maintenance cost is huge

Data is not interconnected: although it is the same business component and the same system, the data generated by it is difficult to communicate with each other in different schemes, and the data value cannot be fully realized

Vendor binding: no matter from data collection, transmission, storage, calculation, visualization, alarm, etc., may be bound by the vendor. Once the observability system is online, the cost of replacement is huge

Cloud native is not friendly: many of these solutions are aimed at traditional systems, and the support for cloud native is relatively weak, and the deployment and use costs of the solutions themselves are high, which does not conform to the one-click deployment and out-of-the-box use of "cloud native".

In this context, the OpenTelemetry project was born under the CNCF of the Cloud Native Foundation, aiming at unifying logging, Tracing, and Metrics to achieve data interoperability.

Create and collect telemetry data from your services and software, then forward them to a variety of analysis tools.

The core function of OpenTelemetry is to generate and collect observable data, and support transmission to various analysis software. The overall architecture is shown in the figure below, where Library is used to generate observable data in a unified format; The Collector is used to receive these data and support the transmission of data to various types of back-end systems.

The revolutionary progress that OpenTelemetry has brought to cloud origin includes:

Unified protocol: OpenTelemetry brings us a unified standard of Metric, Tracing, and Logging (under development, LogModel has been defined). All three have the same metadata structure, which can easily be associated with each other

Unified agent: use one agent to complete the collection and transmission of all observable data, and do not need to deploy various agents for each system, which greatly reduces the system resource occupation and makes the overall observable system architecture simpler

Cloud native friendliness: OpenTelemetry was born in CNCF, which is more friendly for the support of various cloud native systems. In addition, many cloud manufacturers have announced support for OpenTelemetry, which will be more convenient to use on the cloud in the future

Manufacturer irrelevant: this project is completely neutral and does not favor any manufacturer, so that everyone can have full freedom to choose/change the service provider suitable for themselves without receiving monopoly or binding from some manufacturers

Compatibility: OpenTelemetry is supported by various observability schemes under CNCF. In the future, OpenTracing class, OpenCensus, Prometheus, Flintd, etc. will have very good compatibility, which can facilitate seamless migration to OpenTelemetry schemes.

OpenTelemetry restrictions

From the above analysis, OpenTelemetry is positioned as an observable infrastructure to solve the problem of data specification and acquisition, and the subsequent part depends on various Vendors. Of course, the best way is to have a unified engine to store all Metrics, Logging, Tracing, and a unified platform to analyze, display, and correlate these data. At present, no manufacturer can support the unified backend of OpenTelemetry very well. Now we still need to use the products of various manufacturers to implement it. Another problem brought by this is that the association of various data will be more complex, and it is also necessary to solve the problem of data association between each manufacturer. Of course, this problem is sure to be solved in 1-2 years. Now many manufacturers are trying to achieve a unified solution for all types of data in OpenTelemetry.

The future direction of observability

Our team has been responsible for monitoring, logging, distributed link tracking and other observability-related work since the beginning of the Flying Sky 5K project in 2009. We have experienced some architectural changes from minicomputers to distributed systems to microservices and cloudization, and the relevant observability schemes have also undergone many changes. We feel that the development of observability on the whole is very consistent with the setting of automatic driving level.

There are six levels of automatic driving, of which level 0-2 is mainly decided by people. After reaching level 3, you can drive unconsciously, that is, you can temporarily ignore driving with your hands and eyes. When reaching level 5, you can completely get rid of the boring work of driving and move freely in the car.

In terms of the observability of the IT system, it can be similarly divided into six levels:

Level 0: manual analysis, which relies on basic dashboard, alarm, log query, distributed link tracking and other methods to conduct manual alarm and analysis, which is also the scenario used by most companies at present

Level 1: intelligent alarm, which can automatically scan all observable data, use machine learning to identify some abnormalities and conduct automatic alarm, and avoid manual setting/adjusting of various baseline alarms

Level 2: exception association+unified view. For automatically identified exceptions, context association can be performed to form a unified business view, which is convenient for quickly locating problems

Level 3: Root cause analysis+problem self-healing, automatically locate the root cause of the problem directly according to the exception and the CMDB information of the system. After the root cause location is accurate, the problem self-healing can be done there. This stage is equivalent to a qualitative leap. In some scenarios, people can achieve self-healing without participation.

Level 4: Fault prediction. There will always be losses when a fault occurs, so the best thing is to avoid the occurrence of the fault. Therefore, the fault prediction technology can better ensure the reliability of the system, and make use of some fault precursor information accumulated before to achieve "prediction"

Level 5: change impact prediction. We know that most of the failures are caused by changes. So if we can simulate the impact of each change on the system and the possible problems, we can evaluate whether we can allow this change in advance.

AliCloud SLS works on observability

At present, SLS is carrying out the work of cloud native observability. Based on OpenTelemetry, the future standard of cloud native observability, it realizes the unified collection of all kinds of observability data, covering all kinds of data sources and data types, and achieves multi-language support, multi-device support, and type unification; Upward, we will provide unified storage and computing capabilities that can support all kinds of observable data, support petabyte storage, ETL, stream computing, and 10-billion-level data second-level analysis, and provide strong computing power support for the upper algorithm; The problems of IT systems are very complex, especially involving different scenarios and architectures, so we combine algorithms and experience to analyze exceptions. The algorithms include basic statistics, logical algorithms, and AIOp-related algorithms. The experience includes expert knowledge entered manually, various problem solutions accumulated on the Internet, and some externally generated events; At the top level, we will provide some auxiliary decision-making functions, such as alarm notification, data visualization, Webhook, etc. In addition, we will provide rich external integration capabilities, such as docking with third-party visualization/analysis/alarm systems, and providing OpenAPI for integration of different applications.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us