This topic compares multiple monitoring and analysis platforms and provides O&M and site reliability engineering (SRE) teams with multiple solutions to build monitoring and analysis platforms based on their business requirements.

Background information

O&M and SRE teams play key roles and undertake multiple complex tasks, such as application deployment, performance and availability monitoring, alert monitoring, shift scheduling, capacity planning, and business support. The rapid development of cloud native services, containerized services, and microservices has accelerated the iteration of related features and posed the following challenges to O&M and SRE teams:
  • Widespread business lines
    • Business lines are widely spread to clients, frontend web applications, and backend applications.
    • Up to dozens of business lines need to be supported at the same time.
  • Manpower shortage

    The O&M and SRE teams in some companies consist of only 1% of employees or less as compared with their R&D teams.

  • High pressure on online service stability
    • O&M and SRE teams must frequently respond to emergencies and fix urgent issues.
    • Complex business processes and the tremendous number of system components have put great pressure on service troubleshooting and recovery.
  • Disperse and inefficient monitoring and analysis platforms
    • Various types of data are monitored in different dimensions. This results in excessive scripts, monitoring tools, and data silos.
    • Various types of data are stored in different monitoring systems that do not support associated analysis. As a result, root causes cannot be located in an efficient manner.
    • Each system may have thousands of alert rules. This increases management costs and results in excessive alerts. In addition, the specified conditions in an alert rule may be incorrectly or incompletely evaluated.

To address the preceding challenges, O&M and SRE teams require an easy-to-use and powerful monitoring and analysis platform. This platform can be used to analyze data with high efficiency, improve the work efficiency, locate root causes in a timely and accurate manner, and ensure service continuity.

Data-related issues to be fixed by a monitoring and analysis system

To ensure service stability and support business development, O&M and SRE teams need to collect and analyze large amounts of data. The data includes hardware and network metrics, business data, and user behavior data. After data is collected, O&M and SRE teams also need a suitable system to convert, store, process, and analyze the data based on business requirements. O&M and SRE teams face the following data-related issues:
  • Diversified data types
    • System data: hardware metrics related to CPU, memory, network, and disk, and system logs
    • Golden signals: latency, traffic, errors, and saturation
    • Access logs
    • Application logs: Java application logs and error logs
    • User behavior data: website clicks
    • Tracking points in apps: statistics of tracking points in Android and iOS apps
    • Framework data: Kubernetes framework data
    • Call chains: trace data
  • Diversified requirements
    O&M and SRE teams use the preceding types of data to ensure service stability. The teams also need to support other business teams based on the following data management requirements:
    • Monitoring and alerting: Small amounts of stream data can be processed within seconds or minutes.
    • Customer service and troubleshooting: Data can be queried by keyword within seconds.
    • Risk management: Traffic can be processed within seconds.
    • Operations and data analysis: Large-scale data can be analyzed within seconds or hours. For example, data can be analyzed in online analytical processing (OLAP) scenarios.
  • Difficult resource estimation
    The data amount of fast-growing business is difficult to estimate at the early stage due to the following reasons:
    • No reference can be used to estimate the data volume of newly connected data sources.
    • The rapid business growth results in data surges.
    • The previous data storage methods and data retention periods no longer meet the changing data management requirements.

Solutions to build a monitoring and analysis platform

O&M and SRE teams need a monitoring and analysis platform to process data that is collected from various sources in different formats. To meet diversified business requirements, the O&M and SRE teams may need to use and maintain multiple systems based on the following open source services:

  • Telegraf+Influxdb+Grafana

    Telegraf is a lightweight and plugin-driven server agent that can be used to collect metric data from operating systems, databases, and middleware. The collected time series data can be written to and read from InfluxDB that is used to store and analyze the data. Then, the related analysis results can be rendered into visualized charts and interactively queried by using Grafana.

  • Prometheus

    Prometheus is a basic tool that is used to process time series data in the cloud native ecosystem of Kubernetes. Prometheus can ingest metric data that is flexibly collected by using exporters. Prometheus can also connect with Grafana for data visualization.

  • ELK

    The Elasticsearch, Logstash, and Kibana (ELK) stack is an open source component that is most commonly used to perform multidimensional log query and analysis. The ELK stack provides fast, flexible, and powerful query capabilities to meet most query requirements of R&D, O&M, and customer service teams.

  • Tracing systems

    In a microservice and distributed system, call chains are complex. If no suitable tracing system is used, root causes of errors cannot be located in an efficient manner. Tracing systems such as Zipkin, Jeager, OpenTelemetry, and SkyWalking are applicable solutions. OpenTelemetry is the industry standard and SkyWalking is a Chinese tracing system. However, these tracing systems do not provide data storage components and must be used together with Elasticsearch or Cassandra to store trace data.

  • Kafka+Flink

    To meet the requirements for data cleansing and risk management, a platform is required to support real-time stream processing. In most cases, the platform is built by combining Kafka and Flink.

  • ClickHouse, Presto, and Druid

    In scenarios where operational analytics and data analysis reports are required, data is often imported to an OLAP engine for higher capabilities in real-time data analysis. This way, large amounts of data can be analyzed and various ad hoc queries can be performed within seconds to minutes.

Different system components are used to process different types of data. The data is distributed to these components or a data copy is stored in multiple systems, increasing system complexity and usage costs.

When the amount of data increases, O&M and SRE teams face big challenges in terms of component stability, system performance, cost control, and support for a large number of business lines.

Challenges for monitoring and analysis platforms

To maintain multiple systems and support various business lines, O&M and SRE teams must address challenges in the following aspects:

  • System stability
    • System dependencies: Data is distributed to multiple systems that depend on each other. If a system has an issue, other associated systems are affected. For example, if the write speed of the downstream system Elasticsearch becomes slow, the storage usage of Kafka clusters that are used to cache data becomes high. In this case, the Kafka clusters may have no more space for written data.
    • Traffic bursts: Traffic bursts frequently occur in the Internet. Traffic bursts also occur when large amounts of data are written to a system. O&M and SRE teams must ensure that the system can run as expected and the read and write operations are not affected by traffic bursts.
    • Resource isolation: Data entries can be separately stored based on their respective priorities in different physical storage resources. On one hand, if data entries are isolated by using only this method, a large number of cluster resources are wasted and the O&M costs are greatly increased. On the other hand, if data is shared by cluster resources, the interference between these resources must be minimized. For example, a query of a large amount of data in a system may break down all clusters of the system.
    • Technical issues: O&M and SRE teams need to optimize a large number of parameters for multiple systems and spend long periods of time in comparing and adjusting optimization solutions to meet various requirements in different scenarios.
  • Predictable performance
    • Data size: The size of data in a system has a significant impact on the system performance. For example, O&M and SRE teams need to predict whether tens of millions to hundreds of millions of rows of time series data can be read from and written to a system, and whether 1 billion to 10 billion rows of data can be queried in Elasticsearch.
    • Quality of service (QoS) control: Each system consists of limited resources, which must be allocated and managed based on the QPS and concurrent requests for different data types. In some cases, the hardware resources need to be degraded to prevent other performance metrics from declining. However, most open source components are developed without considering QoS control.
  • Cost control
    • Resource costs: The deployment of each system component consumes hardware resources. If a data copy exists in multiple systems at the same time, large amounts of hardware resources are consumed. In addition, the amount of business data is difficult to estimate. In most cases, the amount of business data is overestimated, resulting in wastes of resources.
    • Data access costs: O&M and SRE teams require a tool or platform to support automatic access to business data. The tool or platform can help O&M and SRE teams adapt data formats, manage environments, maintain configurations, and estimate resources. This way, O&M and SRE teams can focus on more important business.
    • Support costs: Technical support is required to resolve issues that may occur when multiple systems are used. These issues include unsuitable usage modes and invalid parameter settings. Bugs in open source software may result in additional costs.
    • O&M costs: O&M and SRE teams need to spend long periods of time in fixing hardware and software issues of each system. They also need to replace hardware, scale up or down the system capacity, and upgrade software.
    • Cost sharing: To effectively utilize resources and make strategic budgets and plans, O&M and SRE teams must estimate resource consumption based on actual business lines as accurately as possible. This requires a monitoring and analysis platform to support separate billing based on effective data metering.

Example scenario

Background information
  • In this example, a company has developed more than 100 applications, and each application generates NGINX access logs and Java application logs.
  • The log size of each application is greatly changing by 1 GB to 1 TB per day. A total of 10 TB of logs are generated by all the applications per day. The data retention period of each application ranges from 7 to 90 days and the average data retention period is 15 days.
  • Log data is used for business monitoring and alerting, online troubleshooting, and real-time risk management.
Business architectureBusiness architecture
  • Beats: collects data and sends the data to Kafka in real time.
  • Kafka: temporarily stores data and sends data to Flink for real-time consumption and sends data to Elasticsearch.
  • Flink: monitors data, analyzes data, and manages risks in real time.
  • Elasticsearch: queries logs, analyzes logs, and troubleshoots errors.
The preceding architecture describes how data is processed in a monitoring and analysis platform. The following workloads must be taken into consideration when the O&M and SRE team of the company creates solutions based on Elasticsearch.
  • Capacity planning: Disk capacity = Size of raw data × Expansion coefficient × (1 + Number of replicas) × (1 + Percentage of reserved space). The expansion coefficient ranges from 1.1 to 1.3. The number of replicas is 1. The percentage of reserved space for temporary files is 25%. In this case, the actual disk capacity is 2.75 to 3.5 times the size of raw data. If the _all parameter is enabled, more reserved space is required for data expansion.
  • Cold and hot data isolation: If all data is stored in an SSD, the storage cost is high. Some indexed data can be stored in an HDD based on the priority and timestamp of the indexed data. The indexed data can also be migrated by using the rollover feature of Elasticsearch.
  • Index setting: The NGINX access logs and Java application logs of each application are periodically indexed in chronological order. The number of shards is specified based on the actual scenario and the storage size of each shard can be 30 GB to 50 GB. However, the amount of log data of each application is difficult to accurately estimate. If issues occur when data is written and queried, the number of shards may need to be adjusted and the reindexing process consumes more resources.
  • Kafka consumer setting: Logstash is used to consume data from Kafka and the data is then written to Elasticsearch. During this process, the number of partitions of Kafka topics must match the value of the logconsumer_threads parameter. Otherwise, data cannot be evenly consumed from each partition.
  • Parameter optimization for Elasticsearch: The O&M and SRE team must first evaluate the performance of Elasticsearch in write throughput, latency visibility, data security, and data query. Then, the team can make full use of Elasticsearch by optimizing the related parameters based on the CPU and memory of clusters. Common parameters include the number of threads, memory control, translog settings, queue length, the intervals between various operations, and merge scheduling.
  • Memory: The memory size of a JVM heap is less than 32 GB and the remaining space is used as the cache memory of the operating system. Frequent garbage collection (GC) operations affect application performance and may result in the unavailability of a Java application.
    • The memory usage of a master node is related to the number of shards in a cluster. In most cases, the number of shards in the cluster is less than 100,000. In the default settings of Elasticsearch, the maximum number of shards in a master node is 1,000. The numbers of indexes and shards need to be specified based on the actual scenario.
    • The memory of a data node is based on the size of index data. If the finite-state transducer (FST) of Elasticsearch resides in the memory for a long period of time, the system recovers the cache memory. In this case, the data query performance is reduced even though the mmap method is provided in Elasticsearch version 7.3 and later. If a previous version is used, the data size of a node must be controlled.
  • Query and analysis: The O&M and SRE team needs to spend long periods of time in continuously testing the data query and analysis performance of Elasticsearch by using the trial and error method.
    • The mappings must be configured based on the actual scenario. For example, nested mappings need to be minimized when the text and keyword fields are specified.
    • Queries of large amounts of data or complex query statements (for example, deeply nested GROUP BY clauses) must be avoided to prevent sharp consumption of system resources. To prevent the out of memory (OOM) issue that may occur when an aggregate query or fuzzy search is performed on a large data set, the size of the data set must be limited.
    • The number of segments needs to be controlled. The force merge API can be called as needed. In addition, the amount of disk I/O and resource consumption after a force merge needs to be evaluated.
    • A filter context and a query context must be selected based on the actual scenario. In scenarios where analytics are not required, the query cache feature of Elasticsearch can be used with higher efficiency in a filter context than a query context.
    • If scripts are used to query and analyze data, system instability or other performance issues may occur.
    • If the routing parameter is used to route search requests to a specific shard, the performance can be improved.
  • Data corruption: If a crash occurs, files may be corrupted. If a segment or translog file is corrupted, the data in the related shard cannot be loaded, and damaged data must be cleaned up manually or by using tools. However, data loss occurs during this process.

The preceding issues may occur when the O&M and SRE team uses and maintains an Elasticsearch cluster. If the data size increases to hundreds of terabytes and a large number of data access requests occur, the stability of the cluster is difficult to maintain. The same issues also exist in other systems. Therefore, the O&M and SRE team needs to spend more time in fixing these issues.

Solution for an integrated cloud service

To meet the requirements of O&M and SRE teams for a monitoring and analysis platform, the Alibaba Cloud Log Service team provides a high-performance and cost-effective solution that is easy-to-use, stable, and reliable. This solution helps O&M and SRE teams fix the issues that may occur when the teams build the platform and improve work efficiency. In the early stage of its development, Log Service was the internal logging system that supports only Alibaba Group and Ant Group. After years of evolution, Log Service has become a cloud native observability and analysis platform that supports petabytes of data of multiple types, such as logs, metrics, and traces.

  • Simple data import methods
    • Logtail: Logtail is proven easy-to-use, reliable, and powerful after being tested on millions of servers. Logtail can be managed in the Log Service console.
    • SDKs or Producers: Data from mobile terminals can be collected by using SDK for Java, C, Go, iOS, or Android or by using the web tracking feature.
    • Cloud native data collection: Custom Resource Definitions (CRDs) can be created to collect data from Container Service for Kubernetes (ACK).
  • Real-time data consumption and ecosystem connection
    • Log Service supports elastic scalability that can be implemented within seconds. Petabytes of data can be written to and consumed from Log Service in real time.
    • Log Service is natively compatible with Flink, Storm, and Spark Streaming.
  • Large-scale data query and analysis
    • Tens of billions of rows of data can be queried within seconds.
    • Log Service supports the SQL-92 syntax, interactive queries, machine learning algorithms, and security check functions.
  • Data transformation
    • Compared with the traditional extract, transform, and load (ETL) process, the data transformation feature of Log Service can reduce up to 90% of development costs.
    • Log Service is a fully-managed service that provides high availability and elastic scalability.
  • Metric data

    Cloud native metric data can be imported to Log Service. A hundred million rows of metric data from Prometheus can be collected and stored.

  • Unified trace data import methods

    The OpenTelemetry protocol can be used to collect traces. The OpenTracing protocol can be used to import traces from Jaeger or Zipkin to Log Service. Traces from OpenCensus and SkyWalking can also be imported to Log Service.

  • Data monitoring and alerting

    Log Service provides an integrated alerting feature that can be used to monitor data, trigger alerts, denoise alerts, manage alert incidents, and dispatch alert notifications.

  • AIOps-based anomaly detection

    Log Service offers unsupervised process monitoring and fault diagnosis and supports manual data labeling. These features greatly improve monitoring efficiency and accuracy.

Log Service architecture
Compared with the solutions that combine multiple open source systems, Log Service is a one-stop logging service. Log Service provides only one system to meet all data monitoring and analysis requirements of O&M and SRE teams. O&M and SRE teams can use Log Service instead of multiple combined systems, such as Kafka, Elasticsearch, Prometheus, and OLAP. Log Service has the following advantages:
  • Lower O&M complexity
    • Log Service is an out-of-the-box and maintenance-free cloud service.
    • Log Service supports visualization management. Data can be accessed within 5 minutes. The cost of business support can be greatly reduced by using Log Service.
  • Lower cost
    • Data is retained as only one copy, and the data copy does not need to be transferred among multiple systems.
    • Resources are scaled in or out based on the actual scenario. No reserved resources are required.
    • Comprehensive technical support is provided to reduce labor costs.
  • Better resource permission management
    • Log Service provides complete consumption data for internal separate billing and cost optimization.
    • Log Service supports full permission control and resource isolation to prevent information leakage.

To help O&M and SRE teams support business with higher efficiency, Log Service is committed to providing a large-scale, low-cost, and real-time monitoring and analysis platform for logs, metrics, and traces.