Observability system practice of cloud native gateway


The term "observability" comes from the control theory and refers to the extent to which the system can infer its internal state from its external output. With the development of the IT industry for decades, the monitoring, alarm, troubleshooting and other fields of the IT system have gradually matured, and the IT industry has also abstracted it into a whole set of observable engineering system. At present, observability is not only a specific tool or technology, but also an idea. It has become a key component of the successful management of complex distributed systems, and provides the ability to understand, explore and schedule the system when it is running.

Cloud native gateway is a managed gateway product under Alibaba Cloud's microservice engine (MSE). It integrates the traditional traffic gateway and microservice gateway. This article will describe how to build the observability system of gateway scenarios based on cloud native gateway.

Difficulties in gateway scenario observability construction

As the gateway of business flow, the observability construction of the gateway is closely related to the stability of the overall business. At the same time, due to the many user scenarios and functions of the gateway, and the complex network environment, it also brings many difficulties to the observability construction of the gateway. The main difficulties are described below.

Many roles focus on gateway observability

The core of observability is to meet the needs of different roles and understanding the system status by observing data. As a traffic entry, the gateway, business, R&D, SRE and other roles will pay attention to the status of the gateway, and the observability system of the gateway can only be completed on the premise of in-depth understanding of the needs of different roles. The following figure briefly describes the simple life cycle of the overall observable data. The observed data is generated through App, stored after intermediate processing, and then provided with query service. The observation data serve different types of people, such as product users, business, R&D, and SRE. Different people use these data through different forms.

Basic life cycle of observability

The burying point is not accurate enough, and the statistical consumption is large

The points are not accurate enough. It is not difficult to bury points, but how to determine which data is consistent with the use scenario. This requires designers to have rich working experience, or to constantly iterate and polish during the online process.

The cost of statistical collection is high. The realization of observability is often a trade-off between time, space and granularity. If the time granularity of statistics is too dense, it will cause the expansion of storage capacity. If the time granularity of statistics is too coarse, it is not conducive to locating the problem. This has brought difficulties to the realization of observability.

The network environment is complex, and problem troubleshooting is difficult

In the traffic gateway scenario, due to the complex public network environment, the gateway traffic is huge, and the troubleshooting of accidental problems is very difficult.

Observability Practice of Cloud Native Gateway

At present, the industry has three common pillars in the construction of observability capability: logging, distributed link tracking and metrics.

• Metrics is the quantitative information of each dimension recorded over a period of time, used to observe some states and trends of the system

• Logs are records of some discrete events generated during program operation

• Traces is a record of the call links in the whole life cycle of a request from receiving to processing

Based on these three pillars, the cloud native gateway has built the observability capability of the cloud native gateway foundation.

Determine the gateway core indicators and build the observability foundation

Core indicators are indicators that can accurately describe the internal operation of the system. In the cloud native gateway scenario, core indicators are indicators that can accurately describe the operation of the gateway at this time, such as qps, rt, and success rate. The cloud native gateway integrates prometheus and sls at the same time. Users can obtain more detailed and accurate data through the etl processing of the access log of the gateway, and also obtain real-time monitoring of the gateway through prometheus.

Dashboard after etl processing based on access log

To solve the problem of large consumption of statistical collection, the cloud native gateway will use etl to process access logs for some indicators with large consumption of collection to reduce the consumption of collection, and will use the method of embedded points in the program to ensure the real-time of statistical indicators that need more real-time.

Grafana market of cloud native gateway

According to different roles' different requirements for gateway observability, cloud native gateway provides different dimensions of data representation. For enterprise users who need further detailed analysis, they can also further process data through SLS.

Divide the system boundary and quickly locate the problem

The gateway usually has a large number of requests. At the same time, in the micro-service scenario, the call link is complex. It is difficult to confirm the failure reason of a request under such conditions. For this scenario, the cloud native gateway is connected to the out-of-the-box ARMS distributed link tracking service, and also supports the delivery of trace data to the user-built skywalking to avoid the locking of cloud products.

For users who have not access to link tracking, the cloud native gateway provides a detailed explanation of the log details, visualizes the reason for the request failure into a specific chart, helps users confirm the problem boundary, and reduces the time for troubleshooting.

Failure request error reason details

Risk management and regular inspection of risks

The cloud native gateway integrates the user instance, specification, performance and other data, gives the existing risks of the current instance, and gives suggestions for improvement, which greatly improves the automation of the maintenance of the allowed native gateway instance and reduces the customer's use cost.

Automatic risk screening for risk management

Future planning for cloud native gateway observability

At present, the cloud native gateway has built a basic observability system, and its data visualization, monitoring and other capabilities have been relatively improved.

Users can quickly find and locate problems based on the current observable system.

In line with the development direction of the industry, the cloud native gateway has the following plans in the observable field:

• As far as the three data pillars of observability are concerned, in order to solve the problems of complex cross-platform solutions and data interoperability in deployment, the development of the observability acquisition framework of Metrics, Logs and Traces is the trend of the times, and supporting the unified observability framework such as opentelemetry is the next priority

• In terms of root cause analysis, we are also paying attention to the dynamics of the most advanced algorithms in the industry and continue to explore the practice of intelligent root cause analysis.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us