How to build an observability data center on the cloud?

Feitian is a huge software system. It has many modules and runs on tens of thousands of physical machines. How to make distributed software run efficiently is inseparable from monitoring, performance analysis and other links. Therefore, the first research and development of Feitian We started to develop the Feitian monitoring system "Shennong" at the same time. By collecting a large amount of system data, Shennong helps us better understand the relationship behind the complexity of system and software collaboration. At the same time, Shennong is also growing with the growing Feitian operating system with more and more clusters at home and abroad, supporting the business of Alibaba Group and Alibaba Cloud. In 2015, after some thinking, we decided to abstract Shennong into a lower-level service - SLS (log service, the main focus is on the log scenario on the first day), hoping to support more Ops scenarios through SLS , including AIOps (Intelligent Analytics Engine).

1. The background of building observability middle platform
Let me talk about the changes seen from the perspective of an engineer: For an engineer, the work 5 years ago was very subdivided, and the job of R&D is to develop the code well. However, with the development of the Internet, the scope of business systems is getting larger and larger, requiring higher requirements for quality, availability, and operability. And in order to ensure that our business is continuously improved, more operational factors must be involved in the work, such as statistical system access, retention and experience.

A trend can also be found from a personal perspective to an industry perspective: more than a decade ago, R&D time would be spent in three parts: innovation (coding), deployment + online, observation + analysis, and deployment + online will cost a lot of time time. In recent years, the rise of cloud computing and cloud native has liberated the energy of development and maintenance in deployment, online and environment standardization. However, the high requirements of the business require a larger scope in each link, and look at the problem from multiple perspectives. There will be a large amount of fragmented data analysis work behind it.

5-year change in an engineer's career

If we split the specific data analysis work, it can be disassembled into a simple black box. The left side of the black box is the data source, and the right side is our actions after observing and judging the data source. For example:

· In security scenarios, security operation engineers will collect logs of firewalls, hosts, systems, etc., model the logs based on experience, identify high-risk operations, generate key events, and the system will issue alarms based on multiple events.

· In monitoring and operations scenarios, the process is similar. It is nothing more than replacing the data source and modeling method.

So we can see that although the role of each scene is different and the data source is different, in terms of mechanism, we can establish a systematic analysis framework to carry this kind of observability requirements.

2. Technical challenges in China and Taiwan
The idea of building a middle platform seems very straightforward. What are the challenges in doing this?

We can analyze from the three processes of data source, analysis and discrimination:

· The first major challenge comes from data source access. Taking monitoring scenarios as an example, the industry has different visualization, collection, and analysis tools for different data sources. In order to establish a monitoring observability system, a large number of vertical systems need to be introduced. These systems have different storage formats, inconsistent interfaces, and different software experiences, and it is often difficult to form a synergy.

· The second biggest challenge comes from performance and speed. The process of data analysis is actually a process of accumulating expert experience (Domain Knowledge), and Ops scenarios are generally Mission Critical processes, so very fast analysis speed and WYSIWYG capabilities are required.

· The third major challenge comes from analytical capabilities. After accessing enough data, we often face problems such as too many monitoring items, too much data, and too many clues. We need a set of methods to help us reduce dimensionality, discover, correlate, and reason. AIOps algorithms currently focus on this layer.

The first two issues are essentially a system issue, while the latter two issues and algorithms are related to computing power. The launch of Zhongtai can solve the first and second problems.

3. Alibaba Cloud SLS, self-developed and self-used observability middle platform
In 2015, we developed SLS. After several years of tempering and evolution, we are developing towards a unified observability middle platform. SLS connects with various open source protocols and data sources downwards, and provides support for various scenarios upwards. The core capability lies in providing unified storage and computing capabilities around various monitoring data of observability. The platform can be summed up in four words "1, 2, 3, 4".

· "1" represents a middle station.

· "2" means to provide two basic storage models: Logstore and MetricStore, which are respectively oriented to log storage (Logstore) suitable for Trace/Log type, and time series storage (MetricStore) suitable for monitoring data Metric type. These two kinds of storage are not isolated, they are based on a unified storage concept, and can be converted into each other very flexibly.

"3" represents three types of analysis engines: data processing engine (DSL), SQL query analysis engine (SQL), and intelligent analysis engine (AIOps). DSL is mainly oriented to data processing and preprocessing scenarios to solve the problem of diverse formats; the SQL query analysis engine provides cleaning and computing capabilities for stored data; and the embedded AIOps can provide intelligent algorithms for specific problems.

The platform always provides support capabilities to users, is compatible with various data sources and protocols, supports business but does not make business products.

SLS builds observability data center 1-2-3

1. Storage design

In order to build an observability middle platform, let's first look at the current status of the storage system. In the process of building an AIOps system in the field of operation and maintenance, four types of storage systems coexist for a long time, namely:

· Hadoop/Hive: store historical logs, metrics and other data, the storage cost is cheap, the analysis ability is strong, but the delay is high.

· ElasticSearch: Stores Trace and Log information that needs real-time access. The retrieval speed is fast, but the cost is high. It is suitable for near-line hot data and has moderate analysis capabilities.

· NoSQL: Used to store aggregated index data. TSDB is a NoSQL storage extension. It retrieves the aggregated index quickly and at a relatively low cost. The disadvantage is that the analysis ability is weak.

· Kafka: used to import and export various data for routing, mainly storing temporary data, with rich upstream and downstream interfaces, and no analysis ability.

These four separate categories of storage systems do a good job of addressing four different types of needs, but there are two major challenges:

· Data mobility

After the data is stored, it can support the service capability of a certain scene, but the problem that comes with it is liquidity. Data exists in multiple systems, and it is necessary to move data when doing data association, comparison, and integration, which often takes a lot of time.

· Interface ease of use

The interfaces for different storage objects are not uniform. For example, Logs are generally packaged with ES APIs, while Metrics are generally called directly through the Prometheus protocol or through NoSQL interfaces. In order to integrate data, different APIs and interaction methods are often involved, which increases the overall complexity of the system.

The current status of the four storage systems requires a long period of data use and a certain amount of development, which limits AIOps, DataOps and other scenarios from playing a greater role.

2. How to abstract storage

If we abstract the monitoring data generation process, we can find that it generally consists of two processes: change + status. All things are a process of continuous change. For example, the state of a table in a database at a certain moment (such as 2 o'clock) is actually the result of the accumulation of all changes in history. The same is true in the field of monitoring. We can save (or sample) the changes in the system state as much as possible through Log, Trace, etc. For example, if a user performs 5 operations within 1 hour, we can capture the logs or traces of these 5 operations. When we need a status value (such as what is the system status at 2 o'clock), we can play back all these operation logs to form a summary value at a certain point in time, for example, within a window size of 1 hour , the operating QPS is 5. Here is a simple Log-to-Metric relationship. We can use other logic, such as doing an Avg on the Latency field in the Log to obtain the Latency of the window.

In the process of SLS storage design, we also followed the objective laws:

· The bottom layer provides a FIFO Binlog queue, and data writing and reading are sequential, with strict writing time (Arrival Time) as the ordering.

On top of Binlog, we can select certain fields to generate a Logstore, which can be considered as a table of the database: it has a Schema, at least the field EventTime (the original time when the event occurred), and the type and name of the column can be specified . In this way, we can retrieve the content in the Logstore through keywords and SQL.

In addition, we can generate multiple metric storages for certain columns in the Logstore according to requirements. For example, according to Host+Method+Time, we can build a monitoring data storage table with Host+Method as Instance, so that the The data is fished out.

Let's look at an example: The following is a visit record of a site, which experienced 4 visits in 1 second.

When these data are written into the Logstore, it is equivalent to writing into a database for storing logs, and any field in it can be queried and analyzed through SQL. For example, "select count(1) as qps" to get the current aggregated QPS.

You can also pre-define some dimensions. For example, if you want to build the minimum monitoring granularity through the combination of host+method, and get QPS, Latency and other data every 1 second, then we can define the following MetricStore. When the data is written, it can automatically According to the rules, the following results are generated:

In this way, we can store and aggregate raw data to form logs and metric transfers in one storage.

3. Computational design

According to the usual scenarios, we abstract the calculation of monitoring data into three types of problems:

How unstructured data becomes structured data

· In the face of complex systems, can you design a WYSIWYG low-threshold language for data analysis

· In the face of massive information, is there a dimensionality reduction algorithm to reduce the complexity of the problem?

We have constructed three types of computing methods to deal with the above problems respectively:

The first problem is actually a problem of business complexity, rooted in the gap between the people who generate the data and the people who use the data. In most of the development process, it is the developer who writes the log, but it is the O&M and operation who analyze the log. There is not enough predictability in the process of writing the log, so the data cannot be used directly. Here we need a set of low-code development languages to do various data conversion, distribution, and enrichment, and to simplify data in different formats from multiple business systems. To this end, we have designed a set of language (DSL) oriented to data processing (ETL) scenarios, which provides more than 300 commonly used operators and various intractable diseases in the autocratic log format.

For example, in the original log, there is only one project_id field in the access url parameter, and we cannot get the design corresponding to the ip field. In the DSL language of SLS, we only need to write 3 lines of code to extract parameters from the url and enrich them with the fields in the database. The seemingly useless access logs are revitalized immediately, and the access relationship between the host and the user can be analyzed.

The second problem is the integration of multiple languages. Our choice is to use SQL as the query and analysis framework, and integrate PromQL and various machine learning functions into the framework. In this way, the subquery + main query can be nested to calculate and predict the results.

Here is an example of a complex analysis:

· First get the monitoring value of the host per minute by calling the promql operator

· Downsample raw data by window function, e.g. to values per second

Predict the query results through the outer prediction function

The third problem is the algorithm. We have built in a large number of AI-based algorithms for inspection, prediction, clustering, and root cause analysis, which can be directly used in manual analysis and automatic inspection alarms. These algorithms are provided to users through SQL/DSL functions and can be used in various scenarios.

4. Middle-Taiwan Support Cases
SLS has tens of thousands of users inside and outside the Alibaba Group, and it is widely used in various AIOps data analysis scenarios. Here are two interesting cases.

Case 1: Traffic Solution

Traffic log is the most common type of access log. Whether it is Ingress, Nginx or CDN Access Log, it can be abstracted into an access log type. In the SLS solution:

· Collect a copy of the original log and keep it for 7 days (LogStore) for query, and back it up to the object storage (OSS) for a longer period of time.

· Perform data processing + aggregation of each dimension on the log through SLS native SQL, such as Group By according to the microservice interface.

· Time-series type storage (MetricStore) for the aggregated data.

Perform intelligent inspection on thousands of interfaces through the AIOps inspection function and generate alarm events.

The entire process from acquisition to configuration to operation only takes 5 minutes, which meets diverse requirements.

Case 2: Cloud cost monitoring and analysis

Alibaba Cloud users are faced with a large amount of billing data every day. The cloud cost center has developed a cost steward application using SLS collection, analysis, visualization and AIOps functions. The reason for the exception.

5. Write at the end
Although we did not directly implement AIOps applications in the past few years, by combining data capabilities and AI capabilities Centralization, on the contrary, supports more users and scenarios. Finally, a brief summary of the experience of doing observability in the past two years:

AIOps = AI + DevOps/ITOps/SecOps/BusinessOps…

At present, most people think that AIOps solves the problem of operation and maintenance. In fact, this set of methods can be seamlessly switched to various OPs scenarios, such as DevOps, SecOps (Bigdata Security), and AI for operation and maintenance and user growth. , the method is general. Like any field of AI, data is the foundation, computing power is the foundation, and algorithms are the core, all of which are indispensable.

Domain Knowledge is the Key to AIOps Implementation

An experienced ops or analyst with deep insights and modeling experience with the system. Therefore, in order to implement AIOps, we must respect the accumulated experience of expert systems, such as through templating, knowledge representation and reasoning, or the use of transfer learning in some scenarios.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us