OPLG: Best Observability Practices of New Generation Cloud-Native

By Xia Ming (Yahai)

What Is OPLG?

With the rise of cloud-native architectures, observability boundaries and division of responsibilities are redefined. Traditional container/application/business hierarchical monitoring boundaries are broken. The division of labor among Dev, Ops, and Sec is blurred. We realize that the IT system is an organic whole, and the monitoring and diagnosis of IT system status needs an integrated scheme. After years of exploration and practice, the new generation of cloud-native observability systems based on OPLG has become a popular choice for communities and enterprises.

OPLG refers to the unified display of (O)penTelemetry Traces, (P)rometheus Metrics, and (L)oki Logs through (G)rafana Dashboards to meet most scenarios of enterprise-level monitoring and analysis, as shown in the following figure (images from Youtube Grafana Labs). Based on the OPLG system, you can quickly build a unified observability platform covering the full stack of cloud-native applications. It can comprehensively monitor infrastructure, containers, middleware, applications, and user experiences. It organically integrates traces, metrics, logs, and events to achieve stable O&M and commercial analysis goals.

OPLG Self-Built Solution

Xiao Ming joined a fashion brand buyer company to help young people find good quality fashion goods. With the expansion of business scale, the requirements of system stability and commercialization analysis for global observability are rising, and the failure of the underlying system affects business revenue and customer satisfaction. Therefore, Xiao Ming's IT department has built a brand-new observability platform through the OPLG system, which has the advantages of fast access, flexible expansion, seamless migration, and heterogeneous integration.

OPLG Benefits

Fast Access: Thanks to a large number of mature open-source SDKs/Agents/Exporters provided by the OpenTelemetry and Prometheus communities, you can quickly access the tracing analysis and metric monitoring of mainstream components and frameworks without a large amount of code modification.
Flexible Extension: Based on PromQL/LogQL flexible query syntax and Grafana's rich dashboard customization features, it can meet the personalized observability requirements of various business or O&M teams.
Seamless Migration: Considering data security and overseas business development plans, components embedded in observability platforms and custom dashboards can be seamlessly migrated between different cloud service providers. Compared with the commercial dashboard, Grafana can integrate multiple data sources to realize free end-to-end migration.
Heterogeneous Convergence: Applications in different languages (such as Java, Go, and Node.js) and observability data in multi-cloud environments can be interconnected and displayed in a unified manner.

OPLG Challenges

The OPLG system has many advantages, but enterprises will face multiple challenges, especially in the process of in-depth use. Many non-functional issues will become prominent (such as large-scale operation and maintenance, performance, and cost).

Large-Scale Upgrade and Configuration of Components: The large-scale management of client probes is a nightmare for the operation and maintenance team. Various faults caused by probe exceptions are common. In addition, life-saving moves (such as dynamic configuration push-down and function degradation) usually require enterprises to build their configuration center and develop and manage it independently.
Full Collection and Storage Cost of Traces: The average daily call volume of production systems of medium and large enterprises can reach hundreds of millions of levels. The cost of full reporting and storage of call chains is high. Which traces are the best to sample? How can we solve the problem of inaccurate monitoring and alerting caused by trace sampling?
Large-Volume Query Performance of Metrics: The more metrics scanned in a query, the worse the query performance. When the query time range exceeds one week or one month, you often encounter query stalling or even fail to query results. APM Metrics often encounters too many indicator lines caused by URL/SQL divergence, which explodes the storage and query layer.
Massive Alert Scheduling Latency and Performance: Each alert rule represents a regular polling task. When an alert rule exceeds a thousand-level (or even a million–level), it is often encountered that the alert is delayed or cannot be sent, and the best time for troubleshooting is missed.
Weak Component Disaster Recovery: Multi-region/zone disaster recovery is an important means to ensure high service availability. However, due to the dual investment of technology and resources required for disaster recovery capacity building, many enterprise self-built systems do not have disaster recovery capabilities.

The preceding problems are all classic problems that enterprises will face in the process of building observability systems. These performance and availability problems caused by scale require a large amount of R&D and time to precipitate, which increases the operation and maintenance costs of enterprises. Therefore, more enterprises choose to host observability servers to cloud service providers. While enjoying the technical dividends of open-source solutions, they can obtain continuous and stable service guarantees.

OPLG Hosting Solution

The IT department where Xiao Ming works decided to adopt the OPLG hosting solution provided by Alibaba Cloud ARMS to reduce O&M costs and provide observability services stably. This solution provides high-performance, highly available, and O&M-free backend services while retaining the advantages of the open-source solution, helping Xiao Ming's team solve the problem of large-scale O&M in massive data scenarios. In addition, the coverage and integration of observability data are improved through eBPF network detection, Satellite edge computing, and Insights intelligent diagnosis, as shown in the following figure.

High Performance: Technologies (such as lossless statistics, data compression, connection optimization, automatic convergence of divergent metrics, and DownSample) are supported to reduce performance overheads in massive data scenarios.
High Availability: The client supports resource quotas and automatic throttling protection to ensure cluster stability in high-voltage scenarios. Backend services support elastic horizontal scale-out and multi-region disaster recovery to ensure service availability as much as possible.
Flexibility and Easy-to-Use: JavaAgent, OT Collector, and Grafana Dashboard Template hosting upgrade. It automatically adapts to updates without user management. It also supports dynamic configuration pushdown and real-time adjustment of parameters (such as traffic switches, the sampling rate of traces, interface filtering, and convergence rules).
Network Detection: Uses eBPF to analyze network requests without intrusiveness, automatically parses network protocols, and builds a network topology to show network performance between specific containers or between containers and specific cloud product instances.
Edge Computing: The Satellite (OpenTelemetry Collector) enables the edge collection and computing capabilities of observability data in user clusters. The standardized data format and unified data label effectively improve the correlation between Trace/Metrics/Logs data.
Intelligent Diagnosis: Combined with the domain knowledge base and algorithm model accumulated over the years, regular inspection of common online fault problems (such as slow SQL and uneven traffic) is carried out, and specific root cause analysis and suggestions are given automatically.

ARMS for OpenTelemetry Satellite

Communities (such as OpenTelemetry and SkyWalking) have been developing edge cluster collection and computing Satellite solutions over the past two years. ARMS for OpenTelemetry Satellite (ARMS Satellite) is a unified edge-side collection and processing platform for observability data (Traces, Metrics, and Logs) developed based on the OpenTelemetry Collector. It is safe, reliable, and easy to use. It is suitable for access to production environments.

ARMS Satellite enables the standardization of multi-source heterogeneous data through data collection, processing, caching, and routing at the edge. It enhances the correlation among the observability data of traces, metrics, and logs. It supports lossless statistics and reduces the cost of data reporting and persistent storage.

Applicable Scenario One: One-Click Collection and Analysis of Panoramic Monitoring Data (Container Environment)

ARMS Satellite is deeply integrated with the Alibaba Cloud Kubernetes monitoring component and the Prometheus monitoring component in the Container Service ACK environment. After one-click installation is completed, the Kubernetes container resource layer and network performance data are automatically collected. Combining the application data reported by users (only need to modify endpoints, no code modification) and automatic pre-aggregation metrics, all are reported to the fully hosted server data center and then displayed through Grafana. Finally, panoramic monitoring data collection and analysis covering applications, containers, networks, and cloud components are realized.

Applicable Scenario Two: Multi-Cloud/Hybrid Cloud Network, Heterogeneous Tracing Framework Data Association, and Unified Display

Under the multi-cloud/hybrid cloud architecture, there may be differences in the selection of tracing technologies between different clusters or applications. For example, A uses Jaeger and B uses Zipkin. The data formats reported by different tracing analysis protocols are incompatible with each other and cannot be connected in series, which reduces the efficiency of diagnosis in comprehensive traces.

ARMS Satellite can convert traces from different sources into the OpenTelemetry Trace format and report them to a unified server for processing and storage. Users can easily query and analyze federated data across networks or heterogeneous trace frameworks.

Applicable Scenario Three: Trace Sampling + Lossless Statistics to Achieve Accurate Statistics of Application Monitoring Alerts at Low Costs

The average daily call volume of the production system can reach 100 million levels, and the cost of full reporting and storage of the traces is high, which is a good choice for sampling and storage of the traces. However, traditional trace sampling will lead to a significant decrease in the accuracy of link statistical indicators. For example, if one million real calls are retained after 10% sampling, the results obtained by the statistics will produce obvious sample skew, which will lead to a high false alarm rate of monitoring and alarm, which is unavailable.

ARMS Satellite supports lossless statistics on trace data, automatically pre-aggregates the received trace data locally, and then performs trace sampling and reporting after accurate statistical results are obtained. This reduces network overheads and persistent storage costs while ensuring the accuracy of application monitoring and alert metrics. The following figure shows the Satellite APM Dashboard of the default integration.

Summary

OPLG system has a mature and dynamic open-source community ecology and has been tested by a large number of enterprise production environments. It is a popular choice for building a new generation of cloud-native unified observability platforms. However, OPLG only provides one technical system. Learning how to use it flexibly, solving practical problems, and precipitating the best practices of general industries or scenarios still needs to be explored together.