New generation cloud native observable best practices

What is OPLG?

With the rise of cloud native architecture, observable boundaries and division of labor have been redefined, traditional container/application/business hierarchical monitoring boundaries have been broken, and the division of labor among Dev, Ops, and Sec has gradually become blurred. Everyone realizes that the IT system, as an organic whole, also requires an integrated solution for monitoring and diagnosing the status of the IT system. After years of exploration and practice, the new generation cloud native observable system based on OPLG has gradually become a popular choice for communities and enterprises.

OPLG refers to the unified display of (O) penTelemetry Traces, (P) romethous Metrics, and (L) oki Logs through (G) rafana Dashboards, which meets most scenarios for enterprise level monitoring and analysis, as shown in the following figure (image sourced from YouTube Grafana Labs). Based on the OPLG system, a unified and observable platform covering the entire stack of cloud native applications can be quickly constructed, comprehensively monitoring infrastructure, containers, middleware, applications, and end user experiences. Links, indicators, logs, and events can be organically integrated to achieve stability operation and commercialization analysis goals more efficiently.

OPLG self built solution

Xiao Ming joined a trendy brand buying company that specializes in helping young people find high-quality trendy brand goods. With the continuous expansion of business scale, the requirements for global observability in system stability and commercialization analysis have also increased, and underlying system failures directly affect business revenue and customer satisfaction. Therefore, Xiaoming's IT department has built a new observable platform through the OPLG system, which has advantages such as "fast access, flexible expansion, seamless migration, and heterogeneous integration".

OPLG Advantages

Fast access: Due to the large number of mature open-source SDKs/Agents/Exporters provided by OpenTelemetry and Prometheus communities, it is possible to quickly access mainstream components and frameworks for link tracking and indicator monitoring without significant code modifications.

Flexible extension: Based on PromQL/LogQL's flexible query syntax and Grafana's rich customization capabilities, it can meet the personalized and observable needs of various business lines or operation and maintenance teams.

Seamless migration: Considering data security and future overseas business development plans, the components buried in observable platforms and customized large disks can be seamlessly migrated between different cloud service providers. Compared to the deep locking of users in commercial markets, Grafana can integrate multiple data sources, truly achieving "end-to-end migration freedom".

Heterogeneous integration: Applications in different languages such as Java, Go, Node.js, and observable data in multi cloud environments can be interconnected and displayed uniformly.

OPLG Challenge

Although the OPLG system has many advantages, the enterprise's self construction will also face multiple challenges, especially in the process of in-depth use, many non function problem such as large-scale operation and maintenance, performance, and cost will gradually emerge.

Component scale upgrade and configuration: The scale management of client probes is almost a "nightmare" for the operation and maintenance team, and various faults caused by probe abnormalities are also common. In addition, dynamic configuration push down and feature degradation, such as "life-saving strategies," usually require enterprises to build their own configuration centers, develop and manage them themselves.

• Traces full collection and storage cost: The daily average adjustment amount of production systems in large and medium-sized enterprises can reach hundreds of millions of levels, and the cost of full reporting and storage of call chains is a significant expense. Which links are the most cost-effective for sampling? How to solve the problem of inaccurate indicator monitoring and alarms caused by link sampling?

Metrics large volume query performance: The more metrics scanned in one query, the worse the query performance. When the query time range exceeds one week or one month, it is common to encounter query lag or even inability to find results. In addition, APM Metrics often encounters excessive indicator lines caused by URL/SQL divergence, which undermines the storage and query layers.

• Massive alarm scheduling latency and performance: Each alarm rule represents a periodic polling task. When the alarm rules exceed a thousand or even ten thousand levels, it is often encountered that the alarm is delayed or unable to be sent, missing the best time for troubleshooting.

Weak component disaster tolerance: Multi region/availability zone disaster tolerance is an important means to ensure high service availability. However, due to the dual investment of technology and resources in building disaster recovery capabilities, many enterprises' self built systems do not have disaster recovery capabilities.

The above problems are classic problems that enterprises encounter during the process of building their own observable systems. These performance and availability issues brought about by scale require a large amount of research and development and time to settle, significantly increasing the operation and maintenance costs of enterprises. Therefore, more and more enterprises are choosing to host observable servers to cloud service providers, while enjoying the technological dividends of open source solutions, they can also receive sustained and stable service guarantees.

OPLG Hosting Scheme

In order to reduce operation and maintenance costs and provide more stable observable services, Xiaoming's IT department has decided to adopt the OPLG hosting solution provided by Alibaba Cloud ARMS. While retaining the advantages of open source solutions, this solution provides high-performance, highly available, and maintenance free backend services, helping Xiaoming's team solve the large-scale operation and maintenance challenges in massive data scenarios. In addition, the coverage and integration of observable data have been further improved through eBPF network detection, satellite edge computing and Insights intelligent diagnosis, as shown in the following figure.

High performance: Supports lossless statistics, data compression, connection optimization, automatic convergence of divergence metrics, DownSample and other technologies, significantly reducing performance overhead in massive data scenarios.

High availability: The client supports resource quotas and automatic current limiting protection, ensuring cluster stability in high-pressure scenarios; Backend services support elastic horizontal expansion, multi region/multi availability zone disaster recovery, and maximize service availability.

Flexible and easy to use: JavaAgent, OT Collector, Grafana Dashboard Template managed upgrades, automatic adaptation updates, no user management required; Support dynamic configuration push down, real-time adjustment of traffic switches, call chain sampling rate, interface filtering and convergence rules and other parameters.

Network detection: non-invasive analysis of network requests through eBPF, automatic resolution of network protocols, construction of network topology, and display of network performance between specific containers or between containers and specific cloud product instances.

• edge computing: through Satellite (OpenTelemetry Collector), we have realized the edge collection and computing capability of observable data in user clusters, standardized data formats, unified data labels, and effectively improved the correlation between Trace/Metrics/Logs data.

• Intelligent diagnosis: combined with the domain knowledge base and algorithm model accumulated for many years, regular patrol inspection is carried out for common online fault problems (such as slow SQL and uneven flow), and specific root cause analysis and suggestions are automatically given.

ARMS for OpenTelemetry Satellite

In the past two years, communities such as OpenTelemetry and SkyWalking have been vigorously developing edge collection and computing Satellite solutions. ARMS for OpenTelemetry Satellite (referred to as ARMS Satellite) is a unified edge side collection and processing platform for observable data (Traces, Metrics, Logs) developed based on OpenTelemetry Collector. It has the characteristics of security, reliability, and ease of use, and is suitable for production environment access.

ARMS Satellite can achieve standardization of multi-source heterogeneous data through data collection, processing, caching, and routing on the edge side; Enhance the correlation between observable data in Traces, Metrics, and Logs; Support lossless statistics, reduce data reporting and persistence storage costs.

Scenario 1: One click collection and analysis of panoramic monitoring data (container environment)

ARMS Satellite deeply integrates Alibaba Cloud Kubernetes monitoring component and Prometheus monitoring component in the container service ACK environment. After one click installation, it will automatically collect Kubernetes container resource layer and network performance data. Combining the application layer data reported by users (with only Endpoint modification and no code modification) and automatic pre aggregation indicators, all are reported to the fully hosted server data center, and then displayed uniformly through Grafana. The ultimate goal is to achieve panoramic monitoring data collection and analysis covering applications, containers, networks, and cloud components.

Scenario 2: Multi cloud/hybrid cloud network, heterogeneous Tracing framework data association and unified display

In a multi cloud/hybrid cloud architecture, there may be differences in the selection of link tracking technology between different clusters or applications, such as A using Jaeger and B using Zipkin. The data formats reported by different link tracking protocols are incompatible with each other and cannot be connected in series, greatly reducing the diagnostic efficiency of the entire link.

Through ARMS Satellite, links from different sources can be uniformly converted into OpenTelemetry Trace format and reported to a unified server for processing and storage. Users can easily achieve joint data query and analysis across network or heterogeneous link frameworks.

Application scenario 3: Link sampling+lossless statistics, low-cost implementation of accurate statistics of application monitoring alarms

The daily average adjustment amount of the production system can reach 100 million levels, and the cost of full reporting and storage of the call chain is a significant expense. Sampling and storage of the call chain is a good choice. However, traditional link sampling can lead to a significant decrease in the accuracy of link statistical indicators. For example, after 10% sampling of one million real calls, the remaining 100000 calls will result in a significant "sample skew" in the statistical results, ultimately leading to a high false alarm rate of monitoring alarms and being basically unavailable.

ARMS Satellite supports the lossless statistics of Trace data. It automatically performs local pre aggregation on the received Trace data, and performs link sampling and reporting after obtaining accurate statistical results. This reduces network overhead and persistence storage costs, while ensuring the accuracy of application monitoring and alarm indicators. The default integrated Satellite APM Dashboard is shown in the following figure.

Summary

The OPLG system has a mature and vibrant open source community ecosystem, and has also undergone practical testing in a large number of enterprise production environments. It is currently a popular choice for building a new generation of cloud native unified observable platforms. However, OPLG only provides a technical system, and how to flexibly apply it, solve practical problems, and precipitate best practices for general industries or scenarios still needs everyone to explore together.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us