×
Community Blog Exploration From Failure Drills to O&M Tool Capability Evaluation

Exploration From Failure Drills to O&M Tool Capability Evaluation

This article explores the need for failure drills and O&M tool evaluation in the context of evolving system architecture and its impact on system stability.

By OpenAnolis System Operation & Maintenance Alliance (SOMA)

As AI and cloud-native technologies continue to evolve, O&M tools are playing an increasingly vital role in rapid assessment of system stability, timely alerting, and root cause analysis. Among the varieties of O&M tools available in the industry, there are ineffective ones characterized as low-quality, redundant, difficult to use, and poorly compatible. To identify excellent options, the OpenAnolis System Operation & Maintenance Alliance (SOMA) conducts a comprehensive evaluation and comparison of various O&M tools through fault injection.

1

Evolution of System Architecture

With the development of the Internet industry, the software architecture used for applications such as e-commerce and short video has undergone leapfrog upgrades. In a word, the general system architecture has evolved from monolithic architecture to distributed architecture and finally to microservices architecture. At present, service mesh is arising as a new technology pattern.

Earlier websites only need to handle a smaller traffic volume, using a monolithic architecture where all required functions, modules, and components of the system were coupled within a single application. These tightly coupled elements usually ran in the same process and share the same database. This design pattern is simple and intuitive, with lower costs of development, deployment, and maintenance. It is suitable for the rapid development of small applications. However, its disadvantages are also evident, as it does not allow horizontal expansion and exhibits high coupling.

2

Distributed architecture abstracts away repetitive code to provide unified functionality able to be called by other business code or modules. It consists of a service layer and a presentation layer. The service layer encapsulates specific business logic for use by the presentation layer, while the presentation layer handles interactions with the web pages. Distributed architecture improves code reusability and the overall access performance. However, such architecture involves complex call relationships, making it difficult to maintain the system.

4

Microservices architecture is an approach of breaking a large application into multiple small, independent applications (microservices) running separate processes. They communicate with one another through messaging or remote calls. Each microservice in the system can be independently developed, deployed, and maintained. These microservices are loosely coupled, which facilitates expansion and maintenance. Its development can be costly and complicated, with challenges in maintaining data consistency.

4

Problems Caused by Architecture Changes

As system architecture becomes more complex and applications diversify, especially in Kubernetes scenarios, the number of call levels between different services increases. This poses the following challenges to the stability of business systems and the effectiveness of O&M tools:

  • Large-scale changes to a module may cause frequent stability issues.
  • The complexity of architecture renders traditional maintenance methods inadequate to meet stability requirements.
  • Infrastructure such as monitoring and alerting and O&M tools may fail to work properly when faults occur.

5

Stability issues resulting from architecture changes can cause more severe impact, especially as the traffic and the number of users grow. In addition to necessary pre-launch tests, targeted assurance activities are equally important. For instance, when a new game project goes live, it is accompanied by a one- or two-month special assurance operation.

However, current assurance measures are limited to post-incident remediation. Is it possible to enable proactive O&M through fast detection and early warning of issues? The answer is enhanced O&M tools to support system monitoring and observability (such as the current cloud-native observability products). Such tools should be able to collect system operation metrics in real time, make fault predictions based on abnormal metric trends, conduct root cause analysis using intelligent algorithms, and offer suggestions for issues. This way, we can quickly find and solve problems.

To ensure effective O&M, we also need to carry out failure drills to uncover and address potential problems, such as insufficient metrics, ineffective alerts, and inadequate root cause analysis, in the O&M tools.

Why Do We Need Failure Drills?

The development of cloud-native technologies and the wide adoption of microservices architecture and containerization result in increasingly complex software architecture, and the uncertainty caused by dependencies between services is also growing exponentially. Any unexpected or abnormal changes in one component may exert a significant impact on other services. Therefore, it is necessary to build a failure drill platform and mechanism to improve the fault tolerance and resilience of the system architecture and verify the entire fault location capability and recovery system.

The following figure shows the drill tasks performed in different environments in each stage. The purpose is to verify the system stability and the effectiveness of the O&M alert management platform in the test environment, staging environment, and production environment through fault injection.

6

According to the general process of fault handling, a failure drill can also be divided into the following three stages:

  • Pre-event: Identify risks beforehand, establish frameworks and plans, and conduct drills.
  • During-event: Promptly detect faults, locate faults, and mitigate loss.
  • Post-event: Investigate the root causes and implement the resumption improvement items.

Failure drills are conducted to test the stability of a business system or the overall performance of O&M tools in response to various potential problems that may occur in online environments.

Failure drills conducted in the production environment is of the highest level. They provide key insights into the diversity of injected cases, the control and orchestration capabilities (chaos engineering) of the system, the alerting and root cause analysis capabilities of the observability platform, and the effectiveness of data isolation. This is where the full-link stress testing comes into play, which will be discussed in the next section.

How Do We Perform Full-Link Stress Testing?

Why do we need full-link stress testing if we already have failure drills? This is because some problems are only exposed under extreme traffic conditions, such as those related to network bandwidth, inter-system impacts, and infrastructure dependencies.

Many people may be curious about how we conduct full-link pressure testing for major sales events like the Double 11 Shopping Festival. Beyond a mere stress test, the full-link stress testing is more of a simulated high-traffic sales event aiming to verify the high-availability solutions, such as plan drills, throttle validation, and destructive drills.

Special attention must be paid to isolation of production data in the full-link stress testing. This is to ensure that the stress testing on the production environment does not affect the normal operation of online business. The solution is to store the data generated from stress testing in isolation from the actual data used in production to prevent dirty reads. Currently, there are two isolation approaches: shadow tables and shadow databases.

Here we will take a closer look at shadow databases, which are created on the same instance. The stress testing probe mounted on the user service performs bypass processing when detecting the traffic markers. For shadow traffic, it retrieves shadow connections from the shadow connection pool for use by the business side to route the data generated from the stress testing traffic to the corresponding shadow database. This way, stress testing data can be isolated from the production databases and will not contaminate the production data.

7

A business can be safely launched after failure drills and full-link stress testing. Does it make O&M tools useless? Of course not. Typically, large-scale cloud-native systems still rely on robust observability tools to safeguard business operations. Therefore, O&M tools that have been validated through failure drills and full-link stress testing could be potential options. It is helpful to design various failure scenarios for a comprehensive assessment of available O&M tools.

Failure Scenarios

There are many scenarios for failure drills. We can design a failure scenario at the level of a single system application or cluster components to test the stability of a business system and, more importantly, its ability to detect problems early. This places higher demand on O&M tools, thus posing greater challenges.

Design failure scenarios based on vertical layering:

_6

Design failure scenarios at the cluster and component levels:

_7

Conducting drills for different business and deployment scenarios allows us to comprehensively assess O&M tools. For example, by simulating packet loss using chaos engineering, we can evaluate whether the tools can identify and alert on issues within milliseconds and detect connection failures and timeouts from application monitoring, and whether the tools can pinpoint errors in the OS kernel, cloud environments, or physical environments.

O&M tools can be effectively tested in an all-around manner with a scenario where system crashes are caused by null pointer dereference. It tests whether cluster-based O&M tools can quickly detect single-node failures, capture vmcore information, analyze the root cause of the crash, report on the health status of the node, and perform migration actions based on impact assessment.

Tool Evaluation

According to the above discussion on failure drills, it is evident that O&M tools play an important role in the early warning and detection of problems, as well as root cause analysis. They are essential for quickly identifying the stability issues of business systems, giving timely alerts, and conducting root cause analysis. Other metrics of interest for evaluating O&M tools include diversity of features, timeliness of alerts, metric validity, stability, size, ease of use, functionality, and community support. Failure drills create a suitable environment for effectively evaluating O&M tools based on the aforementioned considerations.

Deploying a complete set of O&M tools on a mature business system, especially those that run constant monitoring, typically incurs a performance overhead. For instance, profiling, which is often used for performance analysis in observable scenarios, can affect the overall system performance. Therefore, it is essential to minimize the impact of O&M tools on the system after deployment. To this end, we must select an O&M product that offers rich functionality while ensuring minimal performance overhead, reasonable storage costs, effective fault prediction and alerting, and intelligent root cause analysis and repair suggestions.

Considerations for the evaluation of O&M tools are summarized as follows:

  • Low resource usage. O&M tools should use minimal CPU, memory, and other resources, and never consume more resources than the business system.
  • No impact on the original business system. Always consider stability before deploying any tool, ensuring that it will not cause system failures. After all, what use is a tool if it spoils rather than improves your business?
  • Timely response and valid alerts. The O&M tool should alert users of valid issues. No unnecessary alerts are allowed.
  • Accurate analysis with consistency between functions. Monitoring and profiling functions should be stable, with accurate and consistent root cause analysis results.
  • Low data costs. Logs, metrics, and traces should be concise but effective for locating problems. Excessive metrics are not affordable.
  • Ease of use, deployment, and upgrade. The tools should be user-friendly and capable of solving problems rather than being unnecessarily complex.
  • Good architecture portability. Compatibility across different platforms and kernel versions for X86, Arm, and eBPF is essential.

The evaluated items under functionality, alerting, and observability metrics are further discussed in the following section.

Functionality

1. Comprehensive functionality:

  • System monitoring: monitors basic resources such as CPU, memory, disk, and network traffic in real time.
  • Application layer monitoring: monitors API calls, database performance, and container or microservice status.
  • Metric customization: allows users to define custom metrics and thresholds.

2. Flexibility in alert configuration:

  • Alert rule setting based on multiple conditions: supports alert triggering based on multi-dimensional data aggregation and logical operations.
  • Alert policy management: includes features such as duplicate alert suppression, alert escalation mechanism, and alert merging.

3. Fault location accuracy:

  • Root cause analysis: provides the tools or functions for quickly locating the root causes of problems.
  • Failure path tracing: records and displays the sequence of events leading to a failure to facilitate investigation.

Alerting

1. Alert latency detection:

  • The time between real-time alert triggering and actual fault occurrence.
  • Statistics on average alert latency.

2. Coverage of alerting channels:

  • Multiple notification methods such as SMS, phone, email, WeCom, and DingTalk.
  • Third-party integration: the ability to integrate with other alerting channels and services.

3. Threshold sensitivity:

  • Sensitivity to anomalies and rationality of threshold settings.

4. Alert notification frequency control:

  • The policy for adjusting the alert frequency during continuous exceptions to prevent excessive repeated alerts.

Evaluation Items of Monitoring Metrics

1. Richness of observable data source:

  • Log observation: provides log collection, parsing, search, and association analysis capabilities.
  • Time series data observation: visualizes various time series data, such as system performance metrics and key business metrics.
  • Distributed tracing: provides the tracing capability for distributed system call chains.

2. Data visualization:

  • Customization capabilities of the visualization dashboard: supports the customization of chart type, layout, and color coding.
  • Real-time data update speed: supports high interface refresh rate and low data synchronization latency.

3. Insight and analysis depth:

  • Effectiveness of the anomaly detection algorithm: checks whether it can accurately identify potential problems.
  • Intelligent diagnosis suggestion: provides AI/ML-based fault prediction and solution recommendation.

4. Extensibility and compatibility:

  • Support for different architectures, such as cloud-native and hybrid cloud environments.
  • Compatible with third-party systems, such as Prometheus, OpenTelemetry, and other standard protocols.

The quantitative evaluation of the above evaluation items provides a full picture of the functional accuracy, alarm timeliness, and observation effectiveness of O&M products. At the same time, the evaluation process should be combined with specific scenarios and user needs to ensure that the evaluation results are targeted and practical.

Tool Evaluation by the System O&M Alliance

As the business system involves thousands of industries, many kinds of chaos engineering exist in the market, O&M tools are uneven, the O&M technology is always going after foreign countries, and industry development is blocked. The O&M ecosystem also shows the phenomenon that no one pays attention when there is no problem, and the O&M personnel take the blame when there is a problem, with no awe of faults and no sufficient understanding of the role of O&M tools.

With the development of AI and cloud-native technologies, systems are becoming increasingly complex, and the number of call levels is rising. Therefore, the exploration of observability and AIOps O&M has never ceased both domestically and internationally, leading to the emergence of many excellent tools. However, some of these tools are low quality, repetitive, difficult to use, and poorly compatible. To foster growth in the O&M industry and distinguish excellent tools, the OpenAnolis community has established SOMA. One of its key tasks is to inject failures into specific business systems to assess and evaluate different O&M tools. These tools are then compared through a scoring and ranking mechanism to determine which is the best.

SOMA is an organization initiated and established by the OpenAnolis community together with platform manufacturers, O&M manufacturers, universities and scientific research institutes, institutions, and industry users in accordance with the principles of equality and voluntariness, with the aim of promoting the progress of system operation and maintenance technology and facilitating industry-university-research cooperation. By establishing a fault injection platform and an O&M product strength evaluation system, SOMA bridges the communication between platform manufacturers, O&M manufacturers, and customers, enabling users to have a comprehensive understanding of the O&M product landscape.

The O&M tool evaluation set up by SOMA mainly includes four systems: case injection, system under test, evaluation system, and report and scoring system. By injecting different types of cases into the system under test that uses a standard microservice system, with the fault expectation given to the evaluation system through the standardized API, the evaluation system collects field metrics from the test point, such as the standard API exposed by the O&M tool or the third-party standard observation system, to match the range with the expected metrics. After the evaluation is completed, the evaluation system generates a score and test report of the product. Later, these score results will be ranked, industry reports will be released, and some commercialization actions will be carried out.

Workflow:

1. Input fault cases and chaos control engineering.

The fault cases include customer cases, manufacturer cases, and evaluation cases. The first two cases are mainly used to allow customers and manufacturers to simulate case injection and fault behaviors at any time, so that customers can have a comprehensive understanding and evaluation of fault manifestations and the capabilities of O&M tools. Evaluation cases are mainly used to review, rank, and score the qualifications of O&M tools, and are standardized and credible.

2. Collects metrics.

When evaluating business systems, you should consider the following selection criteria: sustainable maintenance, microservice applications, and rich debugging APIs. The Train-ticket microservice application developed by Professor Peng Xin's team from Fudan University is currently used as a business benchmark. This microservice application can also be based on online boutiques. The benchmark metrics of the evaluation system come from the fault injection system, and we have to collect the business system metrics from the APIs provided by the O&M tool. At present, most O&M tools do not have these APIs, or the APIs are not perfect. Therefore, we need to create some standards to standardize them.

3. Carry out metric comparison and range matching.

The evaluation system needs to obtain the expected output metrics from the injection system as a benchmark and compare them with the metrics obtained by the O&M tool itself. Develop a set of criteria based on the metric category, and then determine the metric deviation range to score the tool.

4. Score, output tool level distribution, and generate a single test report.

5. Publish the rankings and the annual O&M industry report according to the scores.

Workflow and data APIs of failure drills and O&M tool evaluation:

_2

From the above analysis, we can see that the key aspect of an evaluation system is to formulate a set of evaluation items and standards to determine which metrics can be evaluated automatically. The metric range needs to be standardized for specific scenarios, ideally determined at the time of fault injection. Metrics such as CPU utilization, memory usage, and system delay should have deterministic expectations.

The most challenging work done by the O&M Alliance is to define this set of standards and implement automated evaluation capabilities based on the standards. We also hope that more and more individuals and enterprises who are interested in this will contribute to it together.

At present, the Chaos Mesh-based fault injection system that is developed by the alliance member Yunguan Qiuhao team has been open sourced in the OpenAnolis community. It currently provides fault cases for network, storage, and Kubernetes. I hope everyone can contribute cases together. You can visit the SOMA webpage to experience it through the OpenAnolis community homepage: https://soma.openanolis.cn/

We have formed the following evaluation knowledge graph for O&M platforms of Alliance members according to different evaluation types.

_1

Evaluation Direction of SOMA

As can be seen from the above evaluation knowledge graph, we have divided evaluations conducted by SOMA into the following fields:

  • OS System O&M
  • Cloud-native Observability
  • AIOps Intelligent O&M
  • Server O&M
  • Traditional O&M

Each field has its own evaluation items, evaluation content, and evaluation methods, some of which are shared and cross-cutting. But in any case, a good evaluation includes the following key points:

  • Definition of precondition under which an evaluation is taken.
  • Definitions of evaluation criteria and evaluation items, such as metric types and API definitions, and metric range definitions.
  • Case injection design. The expected metric range, creating an API connecting to the evaluation system.
  • Review results, scores, and ratings. Evaluate the resource usage of the system and the tested tool, and formulate scoring rules and tool levels.

The focuses of O&M tools vary from different business. SOMA will focus on the popular tools of these two types: observability and AIOps.

For observability, we'll choose a specific function point for evaluation. Feedback on the health status of a node is provided based on aggregated metrics. The health status can be healthy, sub-healthy, or unhealthy. The metrics include saturation, system errors, system latency, and network traffic. The following list describes the metrics:

  • Saturation: dynamic adjustment of kernel parameters before problems occur, and resource leak analysis, through continuous monitoring.
  • Errors: system error events or logs, impact on the business, and identification of potential risks.
  • Latency: the impact of system latency on applications.
  • Traffic: monitoring processes and applications that cause traffic bursts.

In addition, when the O&M tool is in place, how can we judge its impact on the original system, that is, the damage to system resources? We first take a third-party calibration tool as a trusted source, such as the Prometheus collection tool, and compare the metrics of the calibration tool with the metrics output by the tested tool to score and evaluate the tested tool. At present, this evaluation item is being developed by the Alliance member Yunshan network team and will soon be open sourced at https://gitee.com/anolis/soma.

Related evaluation items:

  • Basic resources: CPU, memory, traffic, file descriptors, IO resources, and disk space.
  • Application resources: connection pool, thread pool, middleware, call library, and other libraries.
  • Logs and storage resources: log volume, metric data volume, and other data volume.

In the end, we will evaluate the level of the tool, which will include the following aspects:

8

The score and tool level that are evaluated may vary with the evaluation item. The research results in this field will be introduced in the future.

AlOps Tool Evaluation

When evaluating the AIOps tool, you need to analyze the root causes of specific problems and use chaos engineering to simulate multiple fault scenarios to test the effectiveness and robustness of the AIOps solution. You provide a set of standardized evaluation metrics and datasets by collecting metrics and logs, and tracing analysis, offering a reference to users to select an appropriate AIOps tool. You provide real data and scenarios for O&M application developers and researchers for academic research, product testing, and evaluation ranking.

SOMA cooperated with the OpenAIOps community initiated by Professor Pei Dan from Tsinghua University to jointly establish evaluation capabilities in the AIOps field.

_4

References:

  1. 云原生背景下故障演练体系建设的思考与实践—云原生混沌工程系列之指南篇 (Thinking and Practice of Fault Drill System Construction under cloud-native Background-Guide to cloud-native Chaos Engineering Series):
    https://developer.aliyun.com/article/851555?utm_content=m_1000318166
  2. 全链路压测:影子库与影子表之争 (Stress Testing: Competition between Shadow Databases and Shadow Tables):
    https://developer.aliyun.com/article/982802
0 1 0
Share on

Alibaba Cloud Community

1,003 posts | 246 followers

You may also like

Comments