Performance Monitoring Best Practices - Well-Architected Framework

Building an Integrated Performance Monitoring Platform

With the continuous development of Internet technology, the scale and complexity of enterprise businesses are constantly increasing. In order to ensure the stability and reliability of business operations, enterprises need comprehensive performance monitoring of their systems. An integrated performance monitoring platform is a comprehensive monitoring solution that integrates multiple monitoring tools and technologies to help enterprises monitor the performance of their systems more comprehensively and efficiently.

Improving monitoring efficiency: Traditional performance monitoring solutions often require the use of multiple monitoring tools, such as network monitoring, server monitoring, database monitoring, etc. These tools often require separate configuration and management, and monitoring data is also scattered across different systems, resulting in low monitoring efficiency. An integrated performance monitoring platform can integrate multiple monitoring tools, which can be managed and monitored through a unified monitoring platform. This greatly improves monitoring efficiency, reduces the workload of monitoring personnel, and allows for more comprehensive monitoring of system performance.
Improving monitoring accuracy: Traditional performance monitoring solutions often only monitor basic metrics of a system, such as CPU utilization, memory utilization, etc. An integrated performance monitoring platform can monitor various aspects of a system, such as network traffic, disk I/O, database response time, etc., through the integration of multiple monitoring tools and technologies. This provides a more comprehensive understanding of the system's performance, helps to identify and address issues in a timely manner, and improves monitoring accuracy.
Improving failure diagnosis efficiency: When a system or application fails, traditional performance monitoring solutions usually require IT operation and maintenance personnel to manually analyze monitoring data to determine the cause of the failure, which wastes a lot of time and effort. An integrated performance monitoring platform can automatically analyze and process multiple related monitoring data, helping IT operation and maintenance personnel quickly locate the cause of the failure and improve failure diagnosis efficiency.
Improving monitoring visualization: An integrated performance monitoring platform can present different types of performance monitoring data in a unified visualization interface, making the monitoring data more intuitive and easy to understand. This helps monitoring personnel to quickly identify and resolve issues. Additionally, an integrated performance monitoring platform can notify monitoring personnel of any problems in a timely manner through an alarm mechanism, improving real-time monitoring and response speed.

Therefore, an integrated performance monitoring platform is an indispensable part of today's enterprise informatization construction. By integrating multiple performance monitoring tools into a unified monitoring platform, it can improve monitoring efficiency, accuracy, failure diagnosis efficiency, and visualization level. This helps enterprises better understand the operation of their business systems and improve the stability and reliability of their business systems.

Steps to Build an Integrated Performance Monitoring Platform

An integrated performance monitoring platform refers to the integration of multiple performance monitoring tools into a unified platform to better monitor and manage the performance of a system. Building an integrated performance monitoring platform typically involves the following steps:

Identifying monitoring requirements: First, it is necessary to clarify the goals and requirements of monitoring, including the objects, metrics, and frequency of monitoring. These requirements will determine the selection and configuration of monitoring tools. Common monitoring objects include request response time, cache hit rate, Full GC count, database connection count, CPU utilization, etc. It covers almost the entire IT infrastructure stack, such as end-user devices, gateways, microservice applications, databases, containers, or physical machines.
Choosing monitoring tools: Based on the monitoring requirements, select suitable monitoring tools. Common monitoring tools include Zabbix, Nagios, Grafana, etc. These tools can monitor various performance metrics of servers, networks, databases, applications, etc.
Configuring monitoring tools: Configure the selected monitoring tools based on the monitoring requirements. This includes adding monitoring objects, setting monitoring metrics, adjusting monitoring frequency, etc. Additionally, configure alert rules to promptly notify administrators of any abnormalities in the system.
Integrating monitoring tools: Integrate multiple monitoring tools into a unified monitoring platform. This can be achieved by using open-source monitoring integration tools such as Prometheus, Grafana, etc. These tools can integrate different monitoring data into a unified monitoring view.
Data visualization: Visualize the monitoring data to help administrators gain a more intuitive understanding of the system's performance. Tools such as Grafana can display monitoring data in the form of charts, dashboards, etc., making it easier for administrators to analyze and make decisions.
Automation operation and maintenance: Combine the monitoring platform with automation operation and maintenance tools to achieve automation operation and maintenance. This can be done using tools such as Ansible, SaltStack, etc. These tools can perform fault troubleshooting, performance optimization, and other operations automatically based on monitoring data, improving system stability and performance.

Building an integrated performance monitoring platform requires selecting appropriate monitoring tools, configuring and integrating them, achieving data visualization and automation operation and maintenance to improve system stability and performance. Building a complete integrated performance monitoring platform from scratch requires a huge investment in research and development, operation and maintenance, and trial and error. Alibaba Cloud's observability team, based on a large number of user scenarios, has developed a set of best practices for integrated performance monitoring that conforms to open-source standards, is stable, reliable, and easily expandable. By using the ARMS provided by Alibaba Cloud, an integrated performance monitoring platform can be built to meet the core requirements of business continuity, architectural stability, and business growth in an all-in-one manner.

Achieving End-to-End Tracing Analysis

The value of tracing analysis lies in "association". End users, backend applications, and cloud components (databases, messages, etc.) together form the trace topology. The wider the coverage of this topology, the greater the value that chain tracing can bring. End-to-end tracing analysis is a best practice that covers the entire associated IT system, allowing the complete recording of user behaviors, call paths, and states.

End-to-end tracing analysis can provide three core values for businesses: end-to-end problem diagnosis, dependency analysis between systems, and custom tag propagation.

End-to-end issue diagnosis: VIP customer order failures, timeouts for beta users' requests, and user experience issues for many end users are often caused by abnormal backend applications or cloud components. End-to-end tracing analysis is the preferred solution for end-to-end issue diagnosis.
Dependency analysis between systems: With the launch of new businesses, decommissioning of old businesses, relocation of data centers/architectural upgrades, the dependencies between IT systems have become complex and beyond the ability of manual analysis. Based on the trace topology discovery of end-to-end tracing analysis, decision-making in the above scenarios becomes more agile and reliable.
Custom tag transmission: Full-trace load testing, user-level gray release, order traceability, traffic isolation. Based on the grading processing and data association of custom tags, a prosperous full-chain ecosystem has emerged. However, once there is a data disconnection or tag loss, unpredictable logic disasters may occur.

Challenges of End-to-End Tracing Analysis

The value of end-to-end tracing analysis is proportional to the coverage range, and so are its challenges. To maximize the integrity of the chain, whether it is a front-end application, a cloud component, a Java application, or a Go application, unified tracing specifications must be followed, and data interoperability must be implemented. The three major challenges to achieving end-to-end tracing analysis are multi-language protocol stack unification, front/back/cloud-side linkage, and cross-cloud data integration.

Multi-Language Protocol Stack Unification

In the cloud-native era, multi-language application architectures are becoming more common, and using different language features to achieve optimal performance and development experience has become a trend. However, the differences in the maturity of different languages make it impossible to achieve completely consistent capabilities for end-to-end tracing analysis. The current mainstream approach in the industry is to ensure the uniformity of the remote call protocol format and implement call interception and context propagation within each language's application internally. This ensures the integrity of the basic trace data.

However, most online issues cannot be effectively located and resolved by the basic capabilities of trace tracing alone. The complexity of online systems determines that excellent Trace products must provide more comprehensive and effective data diagnostic capabilities, such as code-level diagnostics, memory analysis, thread pool analysis, lossless statistics, etc. Fully utilizing the diagnostic interfaces provided by different languages is the foundation for the continuous development of Trace.

Standardizing protocol transmission: All applications in the full trace need to follow the same set of protocol transmission standards to ensure the complete transmission of trace context between different language applications and prevent chain disconnection or context loss. The current mainstream open-source protocols for transmission include W3C, Jaeger, B3, SkyWalking, etc.
Maximizing the potential of multi-language products: Tracing analysis has evolved beyond the basic capability of call chains and has derived advanced capabilities such as application/service monitoring, method stack tracing, performance profiling, etc. However, the differences in the maturity of different languages result in large differences in product capabilities. For example, Java agents can implement many advanced edge-side diagnostics based on JVMTI. An excellent end-to-end tracing analysis solution should maximize the differentiated technical advantages of each language instead of pursuing convergence and mediocrity blindly.

Front/Back/Cloud-side Linkage

Currently, open-source tracing analysis implementations mainly focus on the backend application layer, lacking effective means of tracing in user terminals and cloud components (such as cloud databases). This is mainly because the latter is usually provided by cloud service providers or third-party vendors and depends on the compatibility and adaptability of the providers to open source. It is difficult for business entities to be directly involved in the development.

The direct impact of this situation is that it is difficult to pinpoint which backend application or service causes slow frontend page response. It is also difficult to directly correlate cloud component anomalies with application exceptions, especially in scenarios where multiple applications share the same database instance, which requires more indirect means of verification, resulting in low efficiency of fault troubleshooting.

To address this issue, cloud service providers need to better support open-source chain standards and add core method instrumentation and support for open-source protocol stack propagation and data flow back (e.g., Alibaba Cloud's frontend monitoring capability in ARMS supports Jaeger protocol propagation and method stack tracing).

Additionally, due to factors such as business ownership between different systems, it is difficult to achieve a unified protocol stack for the entire trace. To achieve multi-side linkage, the Trace system needs to provide a solution to bridge the gap between heterogeneous protocol stacks.

To achieve interoperability between heterogeneous protocol stacks, the Trace system needs to support the following capabilities:

Protocol stack conversion and dynamic configuration: For example, frontend traces use the Jaeger protocol that needs to be passed down, while newly onboarded external downstream systems use the B3 protocol. In this case, a Node.js application in the middle can receive the Jaeger protocol and then pass down the B3 protocol to ensure the completeness of tag propagation.
Server-side data format conversion: Convert reported data in different formats into a unified format for storage or perform compatibility at the query side. The former has lower maintenance costs, while the latter has higher compatibility costs but is more flexible.

Cross-Cloud Data Integration

Many large enterprises choose multi-cloud deployment due to considerations such as stability or data security. Under such deployment architecture, network isolation between different cloud environments and differences in infrastructure pose huge challenges for operation and maintenance personnel.

With multi-cloud deployment, the trace's integrity can be achieved through cross-cloud data reporting and cross-cloud querying. Regardless of the method used, the goal is to achieve unified visibility of multi-cloud data, quickly locate or analyze problems through complete trace data.

Cross-Cloud Reporting Cross-cloud reporting has relatively low implementation difficulties and is easy to maintain and manage. It is currently the mainstream practice adopted by cloud service providers, such as Alibaba Cloud's ARMS. Cross-cloud reporting has the advantage of low deployment costs, as only one set of server-side components needs to be maintained. However, cross-cloud transmission occupies public network bandwidth, and public network traffic fees and stability are important limiting conditions. Cross-cloud reporting is more suitable for architectures where there is one primary and multiple secondary cloud environments, and the majority of nodes are deployed in one cloud environment, and other clouds/self-built data centers account for a small portion of business traffic. For example, the toC business of a company is deployed in Alibaba Cloud while internal applications are deployed in a self-built data center.

Cross-Cloud Query Cross-cloud querying refers to storing the original chain data within the current cloud network, issuing queries for a user request separately, and aggregating the query results for unified processing to reduce public network transmission costs.

The advantage of cross-cloud querying is that the amount of data transmitted across networks is small, especially since the actual query volume of chain data is usually less than one in ten thousandth of the original data volume, which can greatly save public network bandwidth. However, it requires deployment of multiple data processing endpoints, and it does not support complex calculations such as quantile and global top N. Cross-cloud querying is more suitable for multi-master architectures, as it can support simple trace stitching, max/min/avg statistics, etc.

There are two modes of cross-cloud querying implementation. One is to build a centralized data processing endpoint within the cloud network and connect to user networks through dedicated lines, which can process data from multiple users simultaneously. The other is to build a dedicated data processing endpoint in the VPC for each user. The former has lower maintenance costs and greater capacity elasticity, while the latter has better data isolation.

Other Approaches In addition to the above two solutions, a mixed mode or passthrough mode can be used in actual applications.

Mixed mode refers to reporting statistical data through the public network for centralized processing (small data volume, high accuracy requirements), and using cross-cloud querying for chain data retrieval (large data volume, low querying frequency).

Passthrough mode means that only the integrity of chain context is guaranteed between each cloud environment, and the storage and querying of trace data are independently implemented. The advantage of this mode is very low implementation costs, as each cloud environment only needs to follow the same set of passthrough protocols, and the specific implementation can be completely independent. Artificial correlation can be achieved using the same TraceId or application name. This mode is more suitable for the rapid integration of existing systems with minimal transformation costs.

Best Practices of Alibaba Cloud Full-Chain Tracing

In the Alibaba Cloud environment, a comprehensive full-chain tracing system that spans from the front end, gateway, server, container, and cloud components can be quickly built based on the OpenTelemetry version of Tracing Analysis to achieve end-to-end full-chain tracing.

Header transmission format: Adopt a unified header transmission protocol for the full chain, which can be W3C TraceContext, B3, or Jaeger format.
Frontend integration: Low-code integration can be achieved using two methods: CDN (injecting scripts) or NPM. This supports scenarios such as Web, H5, and mini-programs.
Backend integration:

For Java applications, it is recommended to use the ARMS Agent, which supports non-intrusive instrumentation without requiring code modification. It supports advanced features such as edge diagnostics, lossless statistics, and precise sampling. The SDK provided by OpenTelemetry can be used for custom method instrumentation.
For non-Java applications, it is recommended to integrate the OpenTelemetry version of Observability, report data to the corresponding access point, and achieve trace transmission and display between multi-language applications.

End-to-end tracing analysis is just the beginning and far from the end. Based on the tracing analysis ecosystem, more metrics, logs, events, profiling data, and tools can be associated and related to improve problem diagnosis and business analysis efficiency in order to better unleash the value of end-to-end tracing analysis.

Based on Alibaba Cloud user practices, a typical process for selecting and diagnosing a trace includes the following steps:

1. Filtering target trace based on any combination of criteria such as TraceId, application name, interface name, response time, status code, and custom tags.

2. Selecting a specific trace from the list of call chains that satisfy the filtering criteria and viewing its details.

3. Combining request call trajectories, local method stacks, and associated data (such as SQL queries, business logs), to perform comprehensive analysis of the trace.

4. If the above information is still insufficient to determine the root cause, use additional tools such as memory snapshots and Arthas for secondary analysis