Building Observable Capacity on Elastic Computing Cloud

On July 4, 2022, [Observable, Reliable - CloudOps Series Salon for Cloud Automation Operation and Maintenance the First Bullet] was officially launched. Achieving and ensuring the high reliability and stability of the system are the two important indicators that people pay most attention to after going to the cloud. How to continuously improve reliability and stability through the automated CloudOps product system on the cloud is an important direction for R&D and O&M to work together; Continuous improvement of observability is one of the most direct and powerful means to achieve the goal.

Alibaba Cloud Elastic Computing CloudOps Series Salon also takes "observability and reliability" as the theme of the first bullet. This salon live broadcast covers four days. The first guest to share is Yang Zeqiang, an Alibaba Cloud elastic computing SRE technical expert. The theme shared by him is "Building Observability Capacity on the Elastic Computing Cloud". The following is a summary of his speech for everyone to read.

01 Why Observe?

Observability originated from meteorological observation in the agricultural era, and then there were observability products in the electrical era and the automation era. Observability in control theory refers to the degree to which a system can infer its internal state from its external output. Observability and controllability are dual concepts in mathematics.

Take a car as an example. During driving, we cannot directly perceive the internal state of the car system, but we can get the current engine speed, speed, fuel volume and the running state of other systems through the instrument panel.

In software engineering, the internal state of the system is understood by collecting logs, metrics and traces, which is called observability.

Observability is of great value for the whole life cycle of software. The current system load, abnormal links, abnormal conditions and alarms can be viewed through observability; Early warning can be made based on observability, and then analyzed based on early warning, so as to reduce the fault perception time and location time as much as possible, and finally shorten the fault MTTR.

Observability is the basis of software system stability assurance. From the perspective of software engineering, observability can provide more than stability assurance.

In the earliest demand analysis stage of software engineering, capacity budget can be evaluated through observability; R&D quality control can be carried out in CI phase, such as construction success, test coverage, etc; During the delivery process, the delivery quality can be guaranteed through observability; At the same time, cost and safety can be effectively controlled through observability.

From the perspective of the whole software life cycle, there is no standard answer to how to build your own ideal observability model. However, from the perspective of typical software architecture such as single application or distributed application to microservice, a standard model can be abstracted, as shown in the figure above, which is divided into five layers from bottom to top:

⚫ Resource layer: including host, storage, network, runtime, etc.

⚫ Platform layer: including RPC, DB, message, cache, scheduling, etc.

⚫ Application layer: including availability, delay, error number, traffic, saturation, logs, etc.

⚫ Product level: including order quantity, order success rate, production success rate, production time consumption, etc.

⚫ Customer layer: including business continuity, SLA, dial test, etc. The customer layer is the most overlooked but valuable layer. We need to pay more attention to the continuity of customer business and how to use observable capabilities from the user's perspective.

At present, the ecology and products of the observability technology system are very rich. Logging side includes Logstash, iLogtail, SLS, etc. Metrics side includes Prometheus, Grafana, Kibana, etc. Tracing side includes Elastic, Opentemeletry, Skywalking, etc.

There is no standard answer to how to select the type in the process of building observability, and you need to choose according to your actual needs.

02 Observability on Cloud

In 2011, in the initial stage of the cloud launch of elastic computing observability, the observability system was lacking, mainly in the single application mode, with only a few monitoring and early warning; In 2016, early warning was connected to Alibaba's self-developed monitoring platform, mainly through traditional monitoring systems and metrics collection and display systems based on time series databases; In 2019, Alibaba began to gradually move ECS core applications to the cloud and build a system based on the cloud, including cloud monitoring and SLS; In 2021, Alibaba started the cloud native transformation, which has now completed about 90% of the transformation. In addition, we also changed the technology system to open source standard technology system on the basis of cloud native.

The flexibility, reliability and natural multi region isolation of the cloud provide great advantages for the monitoring platform. In addition, the cloud native technology has always kept up with the latest open source standards in the industry, and its solutions are universal and advanced enough, which is unmatched by the pace of self-developed platform research and development.

Infrastructure monitoring is mainly carried out through cloud monitoring and ARMS. ARMS is an APM tool, which is responsible for collecting the node indicators of the machine.

The platform layer includes many middleware, databases, etc. Previously, it was often necessary to connect many different systems. On the cloud, the native technology eBPF provided by ARMS can seamlessly collect all indicators, such as the three gold indicators, databases, MySQL, etc., and finally generate standard metrics data.

The golden three indicators, availability, delay, error rate, and call times of the application layer can be built by complementing ARMS and SLS.

The production success rate, time consumption and other data of business layer ECS instances are built based on the trace and metrics capabilities of ARMS and SLS.

The customer layer is mainly based on ARMS and SLS. We need to change our cognition. Observability can be provided not only for PM, but also for operators, financial personnel and managers. It has different meanings and values for different roles, which is also the value of observability.

The figure above shows the overall upgrade scheme of ECS.

The old platform's Monitor and Sunfire are migrated to cloud native, mainly to migrate basic monitoring to CMS cloud monitoring, and business monitoring and application monitoring to ARMS. The trace capabilities will also be migrated from the original log editing service to ARMS trace and SLS based log database editing capabilities.

In addition to the cloud native open source technical standards, we have also developed an automated operation and maintenance system based on our basic capabilities, such as alarm, fault diagnosis and rapid recovery. The underlying capabilities are built using the SLS and ARMS external Open APIs.

The observation perspective with good observability is the application perspective. In addition, we also build observability from the business dimension based on the particularity of some businesses. For example, if ECS has clusters, we build observability based on the cluster dimension.

After building all the observability, we also built a unified early warning platform and automatic operation and maintenance capabilities.

The original Monitor and Sunfire on the left are built based on logs, and the latest observability system on the right is built based on the open source method of Prometheus and Grafana standards. It has realized the transformation from customized diversity to cloud native, from complexity to standard simplification.

Taking monitoring and early warning as an example, the left side of the figure above shows the monitoring and early warning system before cloud deployment, which is composed of Alibaba Group monitoring, Sunfire monitoring platform and SLS alarm.

The monitoring and early warning system after cloud deployment is shown on the right. Standardized data calculation can be generated based on Metrics data, such as calculating influence surface. The data in the operation and maintenance can also be calculated dynamically. You can view the original stack, on-site key indicators, changes, etc. In addition, based on the native API, we have built the owner's capabilities of accurate push and standardized early warning operations.

03 Beyond Observability

From the perspective of software life cycle, there will be daily construction, testing and real-time management in the CI process; The automation press conference is based on the Prometheus metrics data format to do the automation checkpoint; Automatic operation and maintenance and Chat Ops are provided during operation; At present, we are building safety measurement and cost control based on observability.

Taking efficiency and quality as examples, the two core links in DevOps ring are continuous integration and continuous delivery, which are also two factors that directly affect the quality of software engineering.

In terms of continuous integration, there is a CI Dashboard. Each time you enter a code, the CI will be triggered to calculate the code line coverage, branch coverage, full complexity, success or failure and other statuses, as shown on the right side of the above figure.

In terms of continuous delivery, the difficulty of automatic release is how to grant credit, so we have achieved canary release. In addition, we believe that observability is also required in the release process. Therefore, we will show some core indicators in the release phase through the metrics dashboard. At the same time, it will cooperate with the application dimension metrics to provide atomic capabilities and publishing system checkpoints to achieve publishing automation.

The essence of the above functions is to shift the observability to the left, from the observability in the running period of software engineering to the code release and delivery stage, to ensure the quality of delivered code.

Most application scenarios of observability are operation and maintenance scenarios, such as viewing capacity, water level, early warning and other indicators. The above figure shows the application of observability in operation and maintenance after cloud installation.

In addition, observability can also be applied to chaos engineering, which is the core dependence of chaos engineering. The core of chaos engineering is the fault drill mechanism, which injects faults into the service to check whether there are exceptions. The premise of injection fault is that the system is stable, which depends on the observability system to check the metrics indicators. Secondly, when exploring the factors that may lead to instability, we need to compare the differences through observability, find the real internal hidden dangers by checking which internal systems and which links have problems, and realize early fault hidden danger mining.

Cost management and safety observation are the two scenarios we are exploring.

How to reduce costs is one of the focuses of managers. First, you need to identify the current cost. After going to the cloud, there may be a hybrid cloud scenario or a multi cloud scenario. You need to view finance and data from multiple places. The system water level and resource consumption can be obtained through observability for further cost optimization. For example, there will be a lot of log storage when SLS is used. You can view SLS utilization and which indexes consume resources through observability.

Security observability requires that security should be put in front of others. Through observability, security risks, such as abnormal traffic and attacks, should be found in advance, and the security system should be improved through gradual iteration.

04 Future

In the future, the development trend of observability is standardization and diversification.

Observability will gradually transform from diversified products to standardized open source standard models. Loging, metris, and tracing are rich in open source and commercial products. However, the greater the choice space, the more difficult it is to make decisions or best practices. Therefore, we think it is necessary to establish the standard system of observability as soon as possible in the future, such as the OpenTelemetry of tracing, the Prometheus of metrics, and the Grafana of multiple data sources. Alibaba Cloud also proposed the OPLG model, where O is OpenTelemetry, P is Prometheus, L is Loki, and G is Grafana.

Diversity means that the application fields of observability will develop in diversity. From the initial observation of single application to the subsequent monitoring APM, a variety of monitoring scenarios have been gradually derived, enabling everything to be monitored and observed.

Q&A link, audience questions

Q1 Observability is specifically applied to which stage of software development?

Answer: Traditional observability is generally applied in the operation and maintenance stage, focusing on online system water level, monitoring operation, etc. Now, the observability tends to shift to the left, and it has been applied in the software architecture design, CI, and CD stages. For example, the CI market is provided, and metrics indicators are acquired and automatically intercepted in the delivery and release process.

Q2 How to obtain data of different dimensions of observability?

A: On the resource and platform layer, the open source eBPF naturally has the ability to collect data, such as MySQL, Redis, Kafka traffic and node data. The product side data needs to be collected through certain development work.

Q3 What commercial products are used in Alibaba Cloud's observation system construction?

A: There are SLS and ARMS products. SLS has both logging and trace functions. In addition, it can also build metrics and dashboards, providing a complete closed loop.

ARMS currently has no logging, only metrics and tracing, but its advantage is that its cost is lower than SLS. In addition, ARMS provides a complete APM tool chain, not only for metric data, but also for insight data. For example, you can view and analyze JVM indicators. It also provides Arthas' profile capabilities and the ability to intelligently identify system exceptions.

Q4 What is the relationship between Prometheus, Grafana links and ARMS?

A: No direct relationship. ARMS provides the service of hosting Prometheus. It has commercial cooperation with Grafana Labs. Some of Grafana's capabilities will be directly hosted on Alibaba Cloud ARMS. Based on these two hosting capabilities and its own link tracking service, ARMS can well combine the three to produce the effect of 1+1>2.

Q5 What indicator systems need to be unified to achieve end-to-end monitoring?

A: First of all, the core indicators of the link should not be lost, including resources, platforms, applications, etc. Secondly, the trace index is also necessary. Serializing end-to-end links is a technical difficulty, which can be achieved through OpenTelemetry, Skywalking and other products. In addition, the golden three indicators, the user side dial ability of web services, and the observability path from the user's perspective are also necessary. Taking shopping order as an example, the complete link from the user's perspective should be a series of gateway monitoring, payment monitoring, order monitoring, etc. related to the entire link from the request from the C end to the payment completion of the order, and these indicators can be uniformly displayed.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us