Observability | Best Practices for Centralized Data Management of Multiple Prometheus Instances

This article introduces Prometheus and addresses the common challenge of achieving a global view with data scattered across different Prometheus instances.

By Kenwei (Wei Dan) and Yiling (Qikai Yang)

1. Introduction

Prometheus, as one of the mainstream observable open-source projects, has become a standard for cloud-native monitoring and is widely used by many enterprises. When using Prometheus, we often encounter the requirement of a global view. However, the data is indeed scattered in different Prometheus instances. How can we solve this problem? This article lists the general solutions of the community, provides a global view solution of Alibaba Cloud, and finally provides a practical case of a customer based on Alibaba Cloud Managed Service for Prometheus, which is expected to inspire and help you.

2. Background

When using Alibaba Cloud Managed Service for Prometheus, we may encounter some problems with multiple Prometheus instances due to regional restrictions and business reasons.

2.1. Problem 1: Single Grafana Dashboard Data Source

We are aware that the Grafana dashboard is the most common and widely used method for observing Prometheus data. Typically, creating a separate data source is required for each Prometheus cluster being observed. For instance, if there are 100 Prometheus clusters, it means creating 100 data sources. This may seem like a tedious task.

When editing the Grafana panel and entering PromQL, we can select a data source. However, to ensure the consistency and simplicity of data query and display, only one data source is allowed for one Grafana panel.

If we need to draw the panels of multiple data sources at the same time in one dashboard, then 100 panels will be generated when using more than 100 data sources. Therefore, we need to edit the panel 100 times and write PromQL 100 times, which is very unfavorable for O&M. Ideally, it should be merged into one panel and each data source should have one timeline, which not only facilitates metric monitoring but also greatly reduces the maintenance operations of the dashboard.

2.2. Problem 2: Data Calculation and Query among Instances

When different businesses use different Prometheus instances that report the same metrics, we want to perform the sum and rate operations for these data. However, due to the storage isolation between instances, such operations are not allowed. At the same time, we do not want to report all the data to the same instance, because depending on the business scenario, the data may come from different ACK clusters, ECS instances, Flink instances, or even from different regions. Therefore, it is necessary to maintain the instance-level isolation.

3. Community Solutions

So, how does the community solve the preceding problems that exist in multiple Prometheus instances?

3.1. Federation Mechanism

The Prometheus Federation mechanism is a clustering extension provided by Prometheus itself, but it can also be used to solve the problem of centralized data query. When we need to monitor a large number of services, we should deploy many Prometheus nodes to respectively pull the metrics exposed by these services. The Federation mechanism can aggregate the metrics obtained by these separately deployed Prometheus nodes and store them in a central Prometheus. The following figure shows a common Federation architecture:

Each Prometheus instance on an edge node contains a /federate interface to obtain the monitoring data of a specified set of time series. You only need to configure one collection task for the global node to obtain the monitoring data from edge nodes. To better understand the Federation mechanism, the following code describes the configuration of the Global Prometheus configuration file.

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 10s

    honor_labels: true
    metrics_path: '/federate'

    # Pull metrics based on the actual business situation and configure the metrics to be pulled by using the match parameter
    params:
      'match[]':
        - '{job="Prometheus"}'
        - '{job="node"}'

    static_configs:
      # Other Prometheus nodes
      - targets:
        - 'Prometheus-follower-1:9090'
        - 'Prometheus-follower-2:9090'

3.2. Thanos Mechanism

For the open-source Prometheus, we can use Thanos to implement aggregate queries. The following code shows the Sidecar deployment mode of Thanos:

This figure shows several core components of Thanos (but not all of them):

Thanos Sidecar: connects to Prometheus, provides Prometheus data to Thanos Query for query, and uploads it to the object storage for long-term storage.
Thanos Query: implements the Prometheus API and provides a global query view to aggregate data provided by the StoreAPI and finally return it to the client such as Grafana.
Thanos Store Gateway: exposes the object storage data to Thanos Query for query.
Thanos Compact: compacts the object storage data and reduces the sample rate to accelerate the query speed of monitoring data in a large time interval.
Thanos Ruler: evaluates and alerts on monitoring data. It can also calculate the new monitoring data, provide these new data to Thanos Query, and/or upload the new data to the object storage for long-term storage.
Thanos Receiver: receives data by writing WAL from the remote Prometheus, exposes it, and/or uploads it to the cloud storage.

How Does Thanos Implement Global Queries?

Thanos Query implements the Prometheus HTTP API. Therefore, the client that queries Prometheus monitoring data does not directly query Prometheus itself. Instead, the client queries Thanos Query. Thanos Query then queries data in multiple downstream locations where data is stored. Finally, the data is returned to the client after merging and deduplication. By these steps, it implements a global query. To implement Thanos Query to query distributed downstream data, Thanos abstracts the Store API internal gRPC interface. Other components use this interface to expose data to Thanos Query.

In the preceding architecture, a single Prometheus will store the collected data on the local disk, and each Prometheus is equipped with a Thanos Sidecar that implements Thanos Store API. Due to the limited local disk of Prometheus, the long-term data will be stored in the object storage by the Thanos Sidecar Store API. Both the data query on a single Prometheus and the object storage query are based on the "Store API". The query is further abstracted as follows.

3.3. Prometheus Remote Write Mechanism

The Remote Write mechanism is also an effective solution to solve the global query problem of multiple Prometheus instances. Its basic idea is very similar to the Prometheus Federation mechanism. The metrics obtained by separately deployed Prometheus nodes are stored in a central Prometheus or third-party storage by using the Remote Write mechanism.

Users can specify the URL of Remote Write in the Prometheus configuration file. Once the configuration items are set, Prometheus sends the collected sample data to the Adaptor by HTTP, and users can connect to any external service in the Adaptor. External services can be open-source Prometheus, real storage systems, or public cloud storage services.

The following example shows how to modify Prometheus.yml to add configuration content related to Remote Storage.

remote_write:
  - url: "http://*****:9090/api/v1/write"

4. Alibaba Cloud Solutions

4.1. Solution of Alibaba Cloud Managed Service for Prometheus Global Aggregation Instances

4.1.1. Introduction to the Solution of Alibaba Cloud Managed Service for Prometheus Global Aggregation Instances

Alibaba Cloud has launched Prometheus Global Aggregation Instance (Global View) to aggregate data across multiple Alibaba Cloud Managed Service for Prometheus instances. When you query data, you can read data from multiple instances at the same time. The principle is metric aggregation during the query.

You can use global aggregation instance (Global View) to isolate data between Alibaba Cloud Managed Service for Prometheus instances. Each Prometheus instance has an independent storage at the backend. Instead of pooling data to central storage, you can dynamically retrieve the required data from the storage of each Prometheus instance during the query. Therefore, when a user or a frontend application initiates a query request, Global View queries all relevant Prometheus instances in parallel and summarizes the results to provide a centralized view.

4.1.2. Comparison and Analysis

The following section describes the open-source Prometheus Federation, Thanos, and Alibaba Cloud Global Aggregation Instance.

(1) Prometheus Federation

Although Prometheus Federation can solve the problem of global aggregation queries, there are still some problems.

The edge node and the global node are still single points. You need to decide whether to use double-node repeated collection for keep-alive at each layer, that is, there will still be a single-node bottleneck.
The storage problem of historical data is still unresolved, and it must rely on third-party storage, which lacks the capability of downsampling for historical data.
The overall O&M cost is relatively high.
The scalability is poor. To add or remove a Prometheus instance, you must modify the configuration file.

(2) Thanos Federation

The architecture is complex and the O&M costs are high.
There is still a single-point issue with Prometheus replicas.
In the case of timeline divergence, the supported upper limit is insufficient and the dimension divergence scenario optimization is not provided.
Downsampling is not supported and the performance of long-term queries is not high.
Operator pushdown is not supported. The performance of requests with large data volumes is limited and the processing overhead is high.

(3) Alibaba Cloud Global Aggregation Instance

Prometheus instances are managed and O&M-free.
It supports graphical interfaces for managing multiple instances, with high flexibility and scalability. This mode allows the system to easily add or remove Alibaba Cloud Managed Service for Prometheus instances without the need to reconfigure the entire storage system.
It does not take up additional storage space. This approach saves storage space because data is not replicated in the centralized storage. Each Prometheus instance only needs to maintain its own dataset. If no additional storage is configured, the queried data is only used for temporary display, and the actual data persistence is still attributed to the aggregated instance.
Isolation: the autonomy of each instance can improve the fault tolerance of the system because the problem of a single instance does not directly affect other instances.
It supports aggregation of instances across regions and accounts to meet the personalized needs of enterprises.

However, it should be noted that both Thanos Federation and Alibaba Cloud Global Aggregation Instance implement global queries in a non-pooled data manner. The need to retrieve data from multiple data sources during the query may cause the query performance to decrease. Especially when the query involves a large amount of unneeded data, you need to wait for multiple data sources to filter out the needed data. The process of waiting for these data processing may cause query timeouts or long waits.

4.1.3. Practice of Alibaba Cloud Prometheus Global Aggregation Instance

Alibaba Cloud Managed Service for Prometheus greatly simplifies users' operations. You do not need to manually deploy Prometheus extensions, instead, you can use the console to implement the global view feature. When you create an Alibaba Cloud Managed Service for Prometheus instance, select Global Aggregation Instance. Select the instance to be aggregated, then select the region where the query frontend is located (which affects the generation of the query domain name), and click Save.

Enter the created global aggregation instance and click any dashboard. You can see that the instance can query the data of other instances just aggregated. This meets the requirement of querying data from multiple instances in one Grafana data source.

4.2. Solution of Alibaba Cloud Managed Service for Prometheus Remote Write

4.2.1. Solution of Alibaba Cloud Managed Service for Prometheus Remote Write

The capabilities of Alibaba Cloud Managed Service for Prometheus Remote Write are atomic capabilities of Prometheus data delivery. Prometheus data delivery is based on the principle of metric aggregation during the storage. Alibaba Cloud Managed Service for Prometheus data delivery aims to extract data from multiple Prometheus instances through the ETL service and then write the data to the storage of an aggregated instance.

In this way, the same Prometheus data can be stored in different instances at the same time:

The aggregated Prometheus instance stores all the raw data of the instance, including the instance that is expected to be aggregated and other data. It is used to query a single instance in the original business scenario.
The central/aggregated Prometheus stores the "expected to be aggregated data" of other "aggregated instances". In the scenario of centralized management, you can query the global view through this instance and perform a cross-instance data search.

4.2.2. Alibaba Cloud Managed Service for Prometheus Remote Write VS Community Prometheus Remote Write

(1) Prometheus Remote Write

The biggest disadvantage of the open-source Remote Write format is the impact on the Prometheus Agent. Setting Remote Write on the Agent increases the resource consumption of the Agent and affects the data collection performance, which is often fatal.

(2) Alibaba Cloud Managed Service for Prometheus Remote Write

The advantages of Alibaba Cloud Managed Service for Prometheus Remote Write are still very obvious.

High query performance: since only the necessary aggregated data is stored, the query response time of the aggregated Prometheus instance is shorter, which greatly improves user experience. In addition, the query is performed only on one Prometheus instance instead of multiple instances. This provides higher read-write performance and computing performance.
High data quality: the filtered data is cleaner and free of unnecessary "dirty data", which contributes to more accurate and effective data analysis.
Rich ETL capabilities: it provides rich processing capabilities before data is written to the aggregated instance, such as filtering and metric enrichment.
Graphical configuration: it is simple and convenient to operate.

At the same time, of course, there are some disadvantages, and we need to make a comprehensive trade-off.

Fees: an additional Prometheus instance is used as a storage point for aggregation and global queries. It means that an additional TSDB backend is required to store the data to be aggregated. These independent storage spaces are billed.
Network consumption: during data delivery, data transmission across networks increases bandwidth usage, especially in the cross-data center or bandwidth-limited environments. Therefore, a reasonable evaluation is required.

4.2.3. Use of Alibaba Cloud Managed Service for Prometheus Remote Write

1. In the left-side navigation pane, choose Prometheus Monitoring > Data Delivery (beta) to enter the Data Delivery page of Managed Service for Prometheus.

2. In the top navigation bar of the Data Delivery page, select a region and click Create Task.

3. In the dialog box, set Task Name and Task Description and click OK.

4. On the Edit Task page, configure the data source and delivery destination.

Configuration item	Description	Example
Prometheus instance	Delivered Prometheus data source	c78cb8273c02*
Data filtering	Enter metrics to be filtered based on the whitelist or blacklist mode and use Label to filter delivery data. It supports regular expressions, line breaks for multiple conditions, and multiple conditions with the relationship (&&).	__name__=rpc.job=apiserverinstance=192.

5. Configure the Prometheus Remote Write endpoints and authentication method.

Prometheus type	Endpoint acquisition method	Requirement
Alibaba Cloud Managed Service for Prometheus	Remote Write and Remote Read endpoint usage● If the source and destination instances are in the same region, use the internal endpoint.	Select B "O&M Platform A" c Auth for Authentication Method, and enter AK and SK of the RAM user that has the AliyunARMSFullAccess permissions. For more information about how to obtain the AK and SK, see View the Information about AccessKey Pairs of a RAM User● Username: the AccessKey ID, which is used to identify the user. ● Password: the AccessKey Secret, which is used to authenticate the user. You must keep your AccessKey Secret confidential.
Self-managed Prometheus	Official documentation	1. The version of self-managed Prometheus is later than 2.39. 2. You must configure the out_of_order_time_window. For more information, see PromLabs.

6. Configure the network.

Prometheus type	Network model	Network requirement
Alibaba Cloud Managed Service for Prometheus	Public Internet	N/A
Self-managed Prometheus	Public Internet	N/A
	VPC	Select the VPC where the self-managed Prometheus instance resides, and make sure that the Alibaba Cloud Managed Service for Prometheus Remote Write endpoint that you enter is accessible from the VPC, vSwitch, and security group.

4.3. Summary and Selection of the Alibaba Cloud Solutions

Alibaba Cloud provides global aggregation instances and data delivery Remote Write solutions with their own advantages and disadvantages.

Alibaba Cloud Managed Service for Prometheus global aggregation instances are designed to provide a centralized interface to query multiple instances to implement a global view while maintaining the storage independence of Prometheus instances. The core concept of this solution is "metric aggregation during the query", that is, the data is stored in multiple instances intact and the data of multiple instances is obtained and aggregated only during the centralized query. This method has obvious advantages, such as saving storage space, but it also faces some challenges. For scenarios with a large number of instances and data, the query performance will be greatly affected.
Alibaba Cloud Managed Service for Prometheus data delivery Remote Write is designed to convert query traffic into data write traffic. It consumes additional storage space to provide aggregation data across multiple instances. It enables the central instance to store aggregated data compactly by filtering data before writing. The core concept of this solution is "metric aggregation during the storage". In this case, data replicas of multiple instances are stored in a centralized instance. Queries on multiple instances are converted into single-instance queries, which greatly improves the query rate and data quality.

Solution	Core concept	Storage space	Query method
Global aggregation instance	Metric aggregation during the query	No additional storage space is consumed	Multi-instance query
Data Delivery Remote Write	Metric aggregation during the storage	Additional storage space is required to aggregate data	Single-instance query

5. Case Analysis

5.1. Current Observability of a Customer O&M Platform

5.1.1. Introduction

The following figure shows the internal O&M platform of a customer, which is temporarily referred to as O&M platform A. The customer company uses the O&M platform A to manage the lifecycle of its internal Kubernetes clusters. On the O&M platform A, you can only view the monitoring data of a single cluster. If multiple clusters have problems and you need to troubleshoot, you can only handle them one by one.

Similarly, when using Grafana, the current dashboard can only view the specific data of a cluster, but cannot monitor multiple clusters at the same time.

In this case, the SRE team cannot have a global view of the states of all clusters and cannot accurately obtain the health state of the product. In most O&M jobs, alert events are used to indicate that a cluster is unhealthy. At present, there are more than 500 clusters hosted by the O&M platform A. If all clusters depend on alert events, there is a risk of too many messages, therefore, high-level faults cannot be quickly located.

5.1.2. Demand

The current O&M management on the O&M platform A faces one challenge: it lacks a global view of the states of clusters in all regions. The goal of the O&M platform A is to configure a single Grafana dashboard and introduce a single data source to realize real-time monitoring of the running states of all tenant clusters in each product line. It includes visualization of key metrics, such as the overall states of the cluster (including the number of clusters, the number of nodes and pods, and the CPU usage of the cluster across the network), and the SLO (Service Level Objective) states of the APIServer (such as the proportion of verbs with non-500 responses across the network, details of 50X errors, and request success rate).

With this well-designed dashboard, the O&M team can quickly locate any cluster that is in an unhealthy state, quickly overview the business volume, and quickly investigate potential problems. It will greatly improve the O&M efficiency and response speed. Such integration not only optimizes the monitoring process but also provides a powerful tool for the O&M team to ensure the stability of the system and the continuity of services.

5.1.3. Difficulty

Cross-continent data transmission service: the O&M platform A scenario involves all regions around the world. During O&M, the SRE team wants to view the instance data of all regions around the world on the dashboard of the Hangzhou region, which involves cross-continent data transmission service. When you use Grafana to query instances across continents, queries frequently time out due to network transmission latency.

(Note: when you use Prometheus to configure data cross-border, you agree and confirm that you have all the disposal rights of the business data and are fully responsible for the behavior of the data transmission. You shall ensure that your data transmission complies with all applicable laws, including providing adequate data security protection technologies and policies, fulfilling legal obligations such as obtaining personal full express consent, and completing data exit security assessment and declaration, and you undertake that your business data does not contain any content that is restricted, prohibited from transmission or disclosure by applicable laws. If you fail to comply with the aforementioned statements and warranties, you will be liable for the corresponding legal consequences. If any losses are suffered by Alibaba Cloud and other affiliates as a result, you shall be liable for compensation.)

A large amount of data on a single instance: not all data needs to be aggregated and queried on all instances in the entire region. In most cases, O&M from a global perspective only focuses on a few metrics that indicate the state of a cluster. Alternatively, for some metrics, only a few specific labels (namespaces) are required. With the increase of clusters and tenants hosted by the "O&M Platform A", the labels of reported metrics are becoming more and more diversified, which may involve the divergence of metric latitude. Currently, the industry still does not have a centralized solution to solve the problem of metric latitude divergence. In this case, queries consume a large amount of TSDB memory. In the scenario where a single Prometheus instance is used to query such divergent metrics, the TSDB instance has already faced great pressure. When the TSDB instance is used to obtain the data of all Prometheus instances on the O&M Platform A at once, the server is under great pressure.
Extra-large query: you need to calculate the sum of all instance data in the current region or the global regions for several metrics. The data volume of a single instance in Problem 2 is extended to more than 500 Prometheus instances on the O&M Platform A. In this case, the data volume of all instances is larger. When TSDB performs query, filtering, and calculation, a large amount of memory is occupied and the general computing resource quota cannot be met.

5.2. Implement Centralized Data Queries through Data Delivery

5.2.1. Solution Selection

Global Aggregation Instance or Data Delivery? In the O&M platform A scenario, to meet the requirements and cope with the difficulties discussed above, data delivery is a better solution. The reasons are as follows:

(1) Transmission Latency Tolerance

When data delivery is used, the process can withstand greater network latency.

When a global aggregation instance query is used:
- Each request generates multiple network latencies across continents. During the test, the latency of cross-continent network transmission ranges from 500 ms to 700 ms. In special periods and network fluctuations, the latency can even be more than 1 minute, which can easily cause query timeout.
- O&M Platform A instances are deployed in various regions around the world. When 99% of the data is successfully queried and the query times out due to network fluctuations in one region, the 99% of data successfully queried will be unavailable, which requires high data completeness.
- The PromQL and time span of the customer are not fixed during the query, resulting in an arbitrary amount of data to be queried. If the amount of query data is too large, the data may be split into multiple HTTP packets for transmission (limited by the network provider). In this case, the network latency is high.
When data delivery is used:
- The network transmission of data delivery does not change with the number of user queries. Instead, the data collected by each Prometheus instance is delivered to the central Prometheus instance in real time. In this case, the data packet size does not exceed 1 MB, and the network latency is maintained within a fixed range.
- Aggregated data is stored in a centralized Prometheus instance. Therefore, you only need to ensure that queries of this instance are error-free. You do not need to consider the issue of query completeness.
- Even after ultra-large cross-continent network transmission, we can still ensure that the data is successfully written to the central Prometheus instance by batch and retry. Although there is a minute-level latency between the latest data in the central instance and the current time, the query success rate can be guaranteed.

(2) Computing Resource Saving

When a PromQL query is executed, the number of timelines of the metric determines the CPU and memory resources required for the query. In other words, the more diverse the labels of the metrics are, the more resources are consumed.

When a global aggregation instance query is used:
- Aggregated instances store all raw data and the resource consumption required by the query is relatively large. Due to the characteristics of TSDB, even if the label filtering is performed, the full data within the period may still be loaded into the memory. In the O&M platform A scenario, each query involves a large amount of data, which consumes a large amount of memory and often triggers query throttling.
- During the test, the query time span is 1 hour, and the results are returned after 30 seconds of waiting.
When data delivery is used:
- Only one instance is queried, and the data stored in this instance is pre-filtered, which is the relevant data that we need to aggregate. If we want to query the global data of more than 500 instances in the Hangzhou region, the underlying layer is equivalent to querying only one instance in the Hangzhou region, which is highly efficient.
- During the test, the query time span is 1 hour, and the results can be returned only after 1 second of waiting.

In general, when we choose the solution of centralized data management for multiple instances, in addition to whether additional storage space is required, the query success rate is a more important reference metric for business scenarios.

In the O&M Platform A scenario, a large number of instances across continents and a large amount of data are involved. Therefore, metric aggregation during the query may cause network request timeout, database query throttling, and excessive memory consumption of the database. This reduces the query success rate.

The data delivery solution that uses metric aggregation during the storage stores data to the centralized instance in advance, converts the network transmission of queries into the network transmission of data writing, and converts the query requests of multiple instances around the world into queries of instances in the current region. This solution has a high query success rate and meets business scenarios.

5.2.2. Solution Architecture

The following figure shows the product form of Prometheus data delivery - Remote Write. The data delivery service consists of two components. One is the Prometheus delivery component which is used to obtain data from the source Prometheus instance and then send it to the Internet forwarding service component after the metric filtering and formatting. The other is the Internet forwarding service component which is used to route data to the centralized instance in the Hangzhou region over the Internet.

In the future, we plan to use EventBridge to replace the existing Internet forwarding service component to support more delivery target ecosystems.

5.3. Effect

The Remote Write function delivers data of more than 500 instances on the O&M platform A in 21 regions around the world to a centralized instance in the Hangzhou region and configures a single Grafana data source. After configuring the dashboard, you can monitor all clusters managed by the O&M platform A. This eliminates the previous configuration of one data source per cluster, greatly facilitating O&M operations.