Observability Best Practices Based on Cloud Native Gateways

Why should we build observability?

Observability is not a new word. It comes from control theory and refers to the extent to which the system can infer its internal state from its external output. With the development of the IT industry for decades and the gradual maturity of the IT system monitoring, alarm, problem troubleshooting and other fields, the IT industry has also abstracted it into a set of observability engineering system. The reason why the word has become more popular in recent years is largely due to the continuous popularity of cloud native, micro-service model, DevOps and other technologies, which pose a greater challenge to observability.

The micro-service and DevOps models advocated by the cloud native architecture have brought about efficiency and availability improvements, but also greatly increased the complexity of the system. Therefore, enhancing observability has become the only means to reduce complexity.

Special attention should be paid here that observability is not monitoring. Traditional monitoring can only achieve passive discovery of problems. The core of observability is based on the exploration of undefined attributes and patterns. It is not only to discover problems, but also focuses on the ability to provide understanding, detection and scheduling of the system during operation.

The cornerstone indicators, logs, events, and link data of observability can help us better understand the operating system, and provide an important decision-making basis for pre-prevention, in-process processing, and post-recovery. At the same time, the observability system can also accelerate the continuous delivery of applications, which is why we need to build observability.

How to build observability for gateway scenarios

Determine the objective of the construction of the observability system

In the state of no monitoring or chaotic monitoring, developers measure the operation status of the system by some fragmented indicators. Blind people touch the elephant and cannot see the overall situation. When problems occur, it is mostly the veteran developers who build the global business state from multiple indicators through their own experience. This experience is often not reusable.

Therefore, we need to establish the observability of the system through technical means, so that we can clearly "see" the comprehensive and detailed status of the system operation, reduce the experience threshold and uncertainty, and make timely and effective decisions. The observability solution should achieve the following objectives:

• Observability system can determine whether to implement service degradation or service interruption

• Be able to quickly discover when the service is unavailable, degraded or failed.

• It can help debug when the service is unavailable or fails.

• Identify long-term trends in capacity planning and business objectives.

• Expose unexpected side effects of changed or added functions.

Build gateway universal observability index

The goal of building the observability system is clear, but only using some tools and adding some monitoring is not enough to achieve the goal. Based on different business scenarios, engineers with different experience may use different types of monitoring systems, but the core is that the monitoring system used by the observability system can correctly reflect the real situation of the system.

In the construction of gateway observability system, the main monitoring is divided into black-box monitoring and white-box monitoring

• Black-box monitoring: a sampling based method. The black box system will monitor the same system responsible for user requests. The most common method is to use dial test to simulate normal user requests to access services.

• White box monitoring: monitoring and observability depend on the signals sent from the workload under monitoring to the monitoring system. This can usually take the form of three most common components: indicators, logs, and traces

In white-box monitoring, the selection of indicators is a relatively subjective matter. Whether the selected indicators can accurately reflect the real situation of the system seriously affects the realization of the overall observable system objectives.

Here we may return to the most basic function of the gateway. Because of the proxy function of the gateway, the gateway will naturally precipitate common logic such as authentication, but the essence is still to forward traffic.

Here, we call the request initiator the downstream of the gateway, and the destination service of the request forwarding the upstream. The downstream requester is the most aware of the overall system situation. Therefore, we take the downstream success rate, request volume, and RT of the gateway service type indicators as the core indicators to measure the entire gateway.

Of course, these three indicators are core indicators in most systems. After determining the core indicators, we need to determine the path indicators of the system, that is, when the core indicators change, we can quickly locate the cause of the problem by looking at the path indicators. For example, when the percentage of CPU usage continues to increase, the gateway time will increase year on year. Therefore, we will change the system indicators (CPU, memory, network traffic, number of connections), The upstream and downstream dependencies in the service indicators (such as the endpoint changes of the back-end service) are used as the secondary indicators of the gateway.

Through the above indicators, we have basically determined the observability indicators of the gateway, but the gateway itself is only a part of the business system. The complete observability of the business needs to be built in combination with the business scenario, for example, using the gateway log to record the business-side status code to build the business indicator.

Best practice of observability construction based on cloud native gateway

The cloud native gateway is a managed gateway product under Alibaba Cloud's microservice engine (MSE). It integrates the traditional traffic gateway with the microservice gateway. As a cloud product, it seamlessly supports the observable products ARMS and SLS on the cloud, striving to achieve zero threshold for customers. At the same time, based on open source, the gateway is also compatible with zipkin, skywalking, prometheus and other observable open source products. Based on the idea of building the observability system described above, the observability system is built based on the cloud products, and the use of the observability of the gateway scenario is explained by the actual scenario.

Observability in grayscale publishing scene

We take the scenario of the release of the new version of the service as a concrete example of the observability of the cloud native gateway.

In the release process of the above figure, the front-end traffic will switch between service v1 and v2. In this scenario, our core is that observability can timely ensure that the exceptions caused by application release can be observed in time during the application release process.

Before deploying a new application, we should first turn on the observability capability provided by the gateway by default, first turn on the basic indicators monitoring (cpu, memory, overall success rate) of the gateway, and then turn on the service-level monitoring in the alarm configuration for this scenario, so that we can find out in time when the success rate of the httpbin service drops.

Before we start, we have deployed the httpbin v1 version.

After that, we started to officially deploy httpbin v2 applications in Alibaba Cloud ACK cluster, which has previously deployed httpbin v1 applications.

In the service details of the gateway, add v2 sub-version for the go-httpbin service.

By modifying the routing configuration, the traffic of the go-httpbin service v2 is cut to 10%. At this time, the gateway is called using http requests. You can view the distribution of the gateway traffic between versions.

The built-in monitoring of the cloud native gateway routing interface can effectively help us observe the running status of the new version and the old version, and the alarm can also timely detect exceptions when there is a problem with the new service release.

Use ARMS dial test to verify the business in the real environment

In a real environment, the operation indicators of the gateway itself can not fully confirm the operation status of the entire service. DNS hijacking, network failure and other problems are likely to cause some users to be completely unavailable. In this case, we need to fully simulate users' use of black-box indicators to verify the stability of the system. Here we use the cloud dial test products provided by ARMS to observe the real situation of the service.

Create a scheduled dial test task in Alibaba Cloud to observe the real situation of users accessing services in different regions by setting different observation points.

After a period of time, we can get the details of the dial test results.

Combining the observability system of gateway to construct the observability system of service

Due to the difference between different services, the gateway can only provide the observability of the gateway itself, but through the expansion capability provided by the gateway, we can combine the observability of the gateway with the business to build the observability system of the business itself.

For example, for a bookstore service, how do we monitor the access of different users at the gateway layer? Here we can use structured logs and log consumption provided by SLS. The gateway can extract the necessary information through the custom request header and deliver it to the log service.

In practical applications, the most common way is to extract the user ID in the request header into the log, so as to realize the association between the userID and the gateway request. When opening the custom log, we can extract the original log from the delivered log using the data processing of the log service.

In this way, we can not only associate meaningful business information with the gateway, but also discard unnecessary access information to reduce costs

Summary and prospect

This paper mainly introduces the best practices of constructing observability capability based on cloud native gateway, and covers white box observation, black box observation, and building business observability based on gateway through the three practices introduced. For observability, although users can quickly find and locate problems based on the current observability system, we still have a lot to do at present.

In line with the development direction of the industry, the cloud native gateway has the following plans in the observable field:

• As far as the three data pillars of observability are concerned, in order to solve the problems of complex cross-platform solutions and data interoperability in deployment, the development of the observability acquisition framework of Metrics, Logs and Traces is the trend of the times, and supporting the unified observability framework such as opentelemetry is the next priority

• In terms of root cause analysis, we are also paying attention to the dynamics of the most advanced algorithms in the industry and continue to explore the practice of intelligent root cause analysis.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us