Building Observable Capabilities on Elastic Computing Clouds

On July 4, 2022, [Observable, Reliable - Cloud Automated Operation and Maintenance CloudOps Series Salon_ First Bullet] was officially launched. Achieving and ensuring the high reliability and stability of the system are the two most important indicators that people pay attention to after going to the cloud. How to continuously improve reliability and stability through the automated CloudOps product system on the cloud is an important direction for R&D and O&M to work together; Continuous improvement of observability is one of the most direct and powerful means to achieve the goal.

Alibaba Cloud Cloud Cloud Ops series salons also take "observability and reliability" as the first theme. The live broadcast of this salon covers four days. The first guest to share is Yang Zeqiang, a technical expert of Alibaba Cloud's elastic computing SRE. The topic he brings is "Building the Observability Capability on Elastic Computing Cloud". The following is the summary of his speech for everyone to read.

01 Why Observe?

Observability first originated from meteorological observation in the agricultural era, and then there were observable products in the electrical era and the automation era. Observability in control theory refers to the extent to which a system can infer its internal state from its external output. Observability and controllability of a system are dual concepts in mathematics.

Take the car as an example, we can't directly perceive the internal state of the car system during driving, but we can know the current engine speed, speed, fuel volume and other system operation state through the instrument panel.

In software engineering, we can understand the internal state of the system by collecting the three dimensions of logs, metrics and traces, which is called observability.

Observability is of great value to the whole life cycle of software. You can view the current system load, abnormal links, abnormal conditions and alarms through observability; Early warning can be made based on observability, and then analyzed based on early warning, so as to minimize the fault perception time and location time, and finally shorten the fault MTTR.

Observability is the basis of software system stability assurance. From the perspective of software engineering, observability can provide far more than stability assurance.

In the earliest demand analysis stage of software engineering, capacity budget can be evaluated through observability; In CI stage, R&D quality control can be carried out, such as construction success, test coverage, etc; During the delivery process, the delivery quality can be guaranteed through observability; At the same time, cost and safety can also be effectively controlled through observability.

From the perspective of the whole software life cycle, there is no standard answer to how to build your own ideal observability model. However, from the perspective of typical software architecture such as single application or distributed application to microservice, a standard model can be abstracted, as shown in the figure above, which is divided into five layers from bottom to top:

⚫ Resource layer: including host, storage, network, runtime, etc.

⚫ Platform layer: including RPC, DB, message, cache, scheduling, etc.

⚫ Application layer: including availability, delay, error number, traffic, saturation, log, etc.

⚫ Product level: including order quantity, order success rate, production success rate, production time, etc.

⚫ Customer layer: including business continuity, SLA, dial test, etc. The customer layer is the most neglected but valuable layer. We need to pay more attention to the continuity of customer business and how to use the observability from the user's perspective.

At present, the ecosystem and products of the observability technology system are very rich. Logging side includes Logstash, iLogtail, SLS, etc. Metrics side includes Prometheus, Grafana, Kibana, etc. Tracing side includes Elastic, Opentemelery, Skywalking, etc.

There is no standard answer to how to select the model in the process of building observability. You need to choose according to your actual needs.

02 Observability on Cloud

In 2011, at the initial stage of the elastic computing observability cloud, the observability system was missing, mainly in the single application mode, with only a few monitoring and warning; In 2016, early warning was connected to Alibaba's self-developed monitoring platform, mainly through traditional monitoring system and metrics collection and display system based on time series database; In 2019, Alibaba began to gradually move ECS core applications to the cloud and build a system based on the cloud, including cloud monitoring and SLS; In 2021, Alibaba started cloud native transformation and has completed about 90% of the transformation. In addition, we have also changed the technology system to open source standard technology system on the basis of cloud native.

The flexibility, reliability and natural multi-region isolation of the cloud provide great advantages for the monitoring platform. In addition, the cloud native technology has always followed the latest open source standards in the industry, and its solution is universal and advanced enough, which is also incomparable with the pace of research and development of the self-developed platform.

Infrastructure monitoring is mainly carried through cloud monitoring and ARMS products. ARMS is an APM tool that is responsible for collecting node indicators of the machine.

The platform layer includes many middleware, databases, etc. Previously, it was often necessary to connect many different systems. On the cloud, the native technology eBPF provided by ARMS can seamlessly collect all indicators, such as the golden three indicators, database, MySQL, and finally generate standard metrics data.

The golden three indicators, availability, latency, error rate, and call times of the application layer can be built by complementing ARMS and SLS.

The production success rate, time consumption and other data of ECS instances at the business level are based on the trace and metrics capabilities of ARMS and SLS.

The customer layer is mainly based on ARMS and SLS. We need to change our cognition. Observability can be provided not only to PM, but also to operators, financial personnel and managers. It has different meanings and values for different roles, which is also the value of observability.

The above figure shows the overall upgrade scheme of ECS.

Monitor and Sunfire of the old platform are migrated to cloud native, mainly to migrate basic monitoring to CMS cloud monitoring, and business monitoring and application monitoring to ARMS. The trace capability will also be migrated from the most original log orchestration service to ARMS trace and SLS-based log database orchestration capability.

In addition to cloud native open source technology standards, we have also developed an automated operation and maintenance system based on basic capabilities, such as alarm, fault diagnosis and rapid recovery. The underlying capabilities are built using SLS and ARMS external Open APIs.

The observation perspective with good observability is the application perspective. In addition, we also build observability from the business dimension based on the particularity of some businesses. For example, if ECS has clusters, we build observability based on the cluster dimension.

After building all the observability, we also built a unified early warning platform and automated operation and maintenance capabilities.

The original Monitor and Sunfire on the left are built based on log, and the latest observability system on the right is built based on the open source method of Prometheus and Grafana standards. It has realized the transformation from customized diversity to cloud native, from complexity to standard simplification.

Take monitoring and early warning as an example. The left side of the picture above shows the monitoring and early warning system before the cloud, which is composed of Alibaba Group monitoring, Sunfire monitoring platform and SLS alarm.

The monitoring and warning system after cloud is shown on the right. Based on Metrics data, standardized data calculation can be generated, such as calculation of influence surface. The data in the operation and maintenance operation can also be obtained through dynamic calculation. You can view the original stack, on-site key indicators, changes, etc. In addition, we have built the capabilities of owner's precise push, early warning and standardized operation based on the native API.

03 Beyond Observability

From the perspective of software life cycle, there will be daily construction and testing as well as real-time management in the CI process; The automated press conference is based on the Prometheus metrics data format to make automated card points; Automatic operation and maintenance and Chat Ops are provided during operation; At present, we are building safety measurement and cost control based on observability.

Taking efficiency and quality as an example, the two core links in DevOps ring are continuous integration and continuous delivery, which are also two factors that directly affect the quality of software engineering.

In terms of continuous integration, there is CI Dashboard, which will trigger CI every time code is entered, and calculate the code line coverage, branch coverage, full complexity, success or failure status, as shown on the right side of the figure above.

In terms of continuous delivery, the difficulty of automated release is how to extend credit, so we realized canary release. In addition, we believe that observability is also needed in the release process, so we will show some core indicators in the release stage through the metrics dashboard. At the same time, it will cooperate with the metrics of the application dimension to provide atomic capabilities and card points of the release system to realize the release automation.

The essence of the above functions is to move the observability to the left, from the observability in the software engineering operation period to the code release and delivery stage, to ensure the quality of the delivered code.

Most application scenarios of observability are operation and maintenance scenarios, such as viewing capacity, water level, early warning and other indicators. The above figure shows the application of observability in operation and maintenance after cloud.

In addition, observability can also be applied to chaos engineering, which is the core dependence of chaos engineering. The core of chaos engineering is the fault drill mechanism, which injects faults into the service to check whether there are exceptions. The premise of fault injection is that the system is stable, which depends on the observability system to check the metrics index to ensure. Secondly, when exploring the factors that may lead to instability, we need to compare the differences through observability, and find the real internal hidden dangers by checking which systems and links have problems, so as to realize early fault hidden dangers mining.

Cost management and safety observability are two scenarios we are exploring.

How to reduce costs is one of the focuses of managers. First of all, we need to clarify the current cost. There may be mixed cloud scenarios or multi-cloud scenarios after the cloud is launched. We need to view financial and data from multiple places. The system water level, resource consumption and other conditions can be obtained through observability for further cost optimization. For example, there will be a lot of log storage when using SLS. You can view SLS usage rate and which indexes consume resources through observability.

Security observability requires that security be put in front. Through observability, potential security risks can be found in advance, such as whether there are abnormal traffic and abnormal attacks, and the security system can be improved through gradual iteration.

04 Future

In the future, the development trend of observability is standardization and diversification.

Observability will gradually transform from diversified products to standardized open source standard models. The open source and commercialized products of Loging, Metris and Tracing are very rich. However, the greater the choice space, the more difficult it is to make decisions or best practices. Therefore, we believe that in the future, we need to establish a standard system of observability as soon as possible, such as OpenTelemetry of Tracing, Prometheus of metrics, and Grafana of multiple data source display. Alibaba Cloud also proposed the OPLG model, where O is OpenTelemetry, P is Prometheus, L is Loki, and G is Grafana.

Diversity means that the application fields of observability will develop in diversity. From the initial observation of single application, to the later monitoring of APM, various monitoring scenarios have been gradually derived to realize that everything can be monitored and observed.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us