Community Blog Cloud-Native Prometheus Solution: High Performance, High Availability, and Zero O&M

Cloud-Native Prometheus Solution: High Performance, High Availability, and Zero O&M

This article describes Log Service supports Prometheus to provide a high performance, high availability, and easy-to-manage cloud-native Prometheus engine.

By Yuanyi

Alibaba Cloud Log Service (SLS) strives to develop itself into a DevOps data mid-end that provides rich capabilities including host data access, storage, analysis, and visualization. This article describes how SLS supports the Prometheus solution to provide a cloud-native Prometheus engine that features high performance, high availability, and zero O&M.

Prometheus - De-facto Standard for Cloud-Native Monitoring

Cloud-native technologies have been booming and flourishing across the world in recent years, and Cloud Native Computing Foundation (CNCF), one of the most influential projects in the IT field, is the strong support behind cloud-native technologies. As a non-profit organization under Linux Foundation, CNCF manages a dozen projects related to cloud-native technologies, among which the best known is Kubernetes, the de-facto standard in the container orchestration field.

Prometheus is the second CNCF graduated project, and has become the most popular one apart from Kubernetes. It is no exaggeration to say that Prometheus has become a de-facto standard of cloud-native monitoring. If the first step of enabling cloud native is to build a Kubernetes environment, then Prometheus is the first step to implement cloud-native monitoring.


After you deploy apps in Kubernetes, you will find it necessary to check the running statuses of the cluster and the apps. However, some of the monitoring methods in the virtual machine (VM) environment are no longer applicable. Although there are several alternatives to Prometheus, it is the best choice for many applications due to these advantages:

  1. Prometheus is easy to deploy. Especially in the Kubernetes environment. You only need several YAML files to configure Prometheus and its monitoring items after Prometheus Operator is deployed. Then, you will gain overall information about the monitored products in the monitoring dashboard by using Grafana and its rich set of Prometheus templates.
  2. Prometheus has a wealth of service discovery mechanisms. In particular, it can collect Kubernetes pod indicators by only declaring one simple annotation.
  3. Prometheus' exporters cover almost all open-source software systems and are supported by many commercial software and systems, such as Alibaba Cloud CloudMonitor which provides a Prometheus Exporter module.
  4. Prometheus provides software development kits (SDKs) for almost all languages so that you can expose metrics in an app. These SDKs are elegantly designed and convenient to expose metrics.
  5. Prometheus is an open-source project under CNCF. Therefore, you do not need to worry that the software updates will be stopped in a few years.
  6. If you take a closer look at Kubernetes code, you will find that all Kubernetes components expose Prometheus metrics, and that Prometheus is indispensable for monitoring Kubernetes.

Prometheus' Challenges in a Production Environment

When we first move an app and related monitoring methods from a test environment to an online cluster, everything goes smoothly, the app runs properly and the monitoring metrics looking normal. However, when more and more apps are deployed in the production environment and the access pressure gradually increases, we will gradually realize some of the pain points of Prometheus:

  1. Memory usage: Prometheus caches all the data in the last two hours in memory. If the number of pods increases, the number of metrics in the system also rises, which may eventually cause an out-of-memory (OOM) issue. In some cases, a 100-node cluster requires an exclusive memory of 64 GB to run Prometheus.
  2. Recovery from exceptions: Prometheus persists the data written in real time by using the binlog method. It replays the binary logs to recover data when the system crashes. However, the recovery may take long as the data is stored in memory for two hours. Once the cluster is restarted due to an OOM issue, Prometheus will restart again and again endlessly.
  3. Storage duration: This is what Prometheus is the most complained about. Prometheus' long-term storage (LTS) setting supports up to 15 days of data storage by default. You can adjust the startup parameters to set a longer storage duration, but persistent storage may be unfeasible due to the restrictions of a single instance.
  4. Single instance: Prometheus is deployed on a single instance. Its data capture, storage, and calculation are all implemented at a single point. This makes it difficult to use Prometheus in a large-scale cluster. The community provides a variety of distributed solutions to address this issue, such as Cortex, Thanos, and M3DB.
  5. AIOps related: Prometheus still adopts traditional monitoring metrics. PromQL focuses on arithmetical operations and does not support time series AI algorithms, such as prediction, outlier detection, change point detection, break point detection, and multi-cycle estimation algorithm.


SLS and Cloud-native Technologies

Alibaba Cloud Log Service (SLS) strives to develop itself into a DevOps data mid-end that provides rich capabilities such as host data access, storage, analysis, and visualization. It provides an all-in-one platform where you can easily handle data-related tasks in DevOps scenarios and quickly build your enterprise's observable platform.


SLS provides a wide range of data access methods and supports many data access approaches related to cloud-native observability. The preceding figure shows the projects that are supported by SLS for data access in the CNCF landscape. The monitoring, logging, and tracing features all support CNCF graduated projects, such as Prometheus, Fluentd, and Jaeger. The main reasons for using SLS to store Prometheus monitoring data include:

  1. SLS data can be stored persistently. Many users want to store key Prometheus metrics persistently in SLS.
  2. Many users now store their logging and tracing data in SLS and want to do so for Prometheus data as well, so as to implement integrated observability data solutions and reduce O&M workloads.
  3. SLS provides many metric-related AIOps algorithms, such as multi-cycle estimation, prediction, outlier detection, and time series classification. Clients also expect more intelligent use of Prometheus data.
  4. SLS also supports data pipeline models. Prometheus can enable faster alarming if its metrics are interconnected with downstream systems for stream computing. In addition, Prometheus can enable offline statistical analysis if its metrics are interconnected with data warehouses.

SLS Solutions for Prometheus

The SLS MetricStore provides native support for PromQL. All data is distributed to multiple hosts for distributed storage as shards. The computing layer integrates a Prometheus QueryEngine module to separate storage and computing, so that massive data processing can be carried out easily.


Compared with the community-provided Prometheus distributed extensions, such as Cortex, Thanos, M3DB, FiloDB, and VictoriaMetrics, the SLS's distributed implementation solution is closer to the community's goal of solving the restrictions on the use of native Prometheus.

  1. Compatibility: SLS implementation reuses the Prometheus code without any modification needed. This ensures the SLS implementation keeps pace with the official updates in real time.
  2. Global view: SLS is an SaaS-based service and supports multitenancy and multiple instances. Therefore, it can write the data of multiple clusters to the same instance to display a global view.
  3. Persistent storage: SLS data supports the TTL mechanism and persistent storage.
  4. High availability: Each instance contains multiple shards, and different shards are allocated to different hosts. The failure of hosts where some shards are stored does not compromise the overall writing performance. Each shard has three replicas on Apsara Distributed File System to ensure the reliability of each shard.

In addition to supporting these requirements of the community, SLS can provide the following advantages for Prometheus:

  1. Larger storage: SLS is a fully cloud-based service. The storage space for each user is unlimited.
  2. Lower costs: In terms of labor cost, SLS's Prometheus access method does not require the operation and maintenance of Prometheus instances. In terms of usage, SLS MetricStore uses a pay-as-you-go model without the need to separately purchase hosts and disks for data calculation and storage.
  3. Faster speed: The storage and computing separation architecture of SLS gives full play to cluster capabilities, enabling faster end-to-end processing especially in processing of massive data.
  4. More intelligent algorithms: All of SLS's metric-related AI algorithms can be applied to Prometheus data, such as multi-cycle estimation, prediction, outlier detection, and time series classification, to add AI power to Prometheus.
  5. More extensive ecosystems: SLS features sound connectivity with upstream and downstream ecosystems. Therefore, you can integrate Prometheus metrics with stream computing for faster alarming, with data warehouses for offline statistical analysis, and with OSS for archival storage.
  6. Better support: Observability requires full connectivity between metrics, logging data, and tracing data. SLS is committed to building a unified OpenTelemetry storage platform to act as an underlying data foundation for all kinds of intelligent data apps.

Cloud-native Kubernetes Monitoring

As cloud-native monitoring software, Prometheus provides sound native support for Kubernetes. In Kubernetes, almost all components provide Prometheus metrics interfaces. Therefore, Prometheus has become a de-facto Kubernetes' monitoring implementation standard. The next section describes how to deploy Prometheus monitoring for Kubernetes and how to use SLS MetricStore as the storage backend.

Before You Begin

  1. Set up a Kubernetes cluster of V1.10 or later.
  2. Create a MetricStore instance in SLS.

Installation of Independently Built Kubernetes

We recommend that you register a cluster to connect an independently built Kubernetes to Alibaba Cloud. After the registration is complete, you can follow the Alibaba Cloud Kubernetes installation procedure to install the cluster.

If you opt for other connection approaches, see the official instructions of Helm package for installation. Before the installation, you need to create a secret and change the default configuration. For more information, see the following description of installing Alibaba Cloud Kubernetes.

Installing Alibaba Cloud Kubernetes

If you use Alibaba Cloud Kubernetes, you can install and configure Prometheus in the app directory to store data to SLS. The configuration procedure is as follows:

1. Create a Secret

  • Log on to the Container Service for Kubernetes console.
  • In the left-side navigation pane, select Namespace and create a namespace named monitoring.
  • In the left-side navigation pane, choose Configuration > Secrets. Select the monitoring namespace you just created. If this namespace does not appear, forcibly refresh the entire page.
  • Click Create to start creating a secret. Set the secret name to sls-ak, and add two key-value pairs of username and password. Populate them with your Alibaba Cloud AccessKeyId and AccessKeySecret, respectively. Use an RAM user account and grant only the SLS write permission to the account. For more information about authorization, see Permission to write data to a specified project.


2. Create a Prometheus Operator

  1. Log on to the Container Service for Kubernetes console.
  2. In the left-side navigation pane, choose Marketplace > App Catalog.
  3. Click ack-prometheus-operator.
  4. In the pop-up installation page, click the Parameters tab and modify the configuration items. Major modifications include the following.

    • Change the value of retention under prometheusSpec. The value 1d or 12h is recommended.
    • Set enable under prometheusSpec to true, and add the remoteWrite configuration. Modify the URL parameters as well.
      - basicAuth:
            name: sls-ak
            key: username
            name: sls-ak
            key: password
          batchSendDeadline: 20s
          maxBackoff: 5s
          maxRetries: 10
          minBackoff: 100ms
        ### The URL is https://{sls-enpoint}/prometheus/{project}/{metricstore}/api/v1/write.
        ### For the sls-endpoint settings, see https://help.aliyun.com/document_detail/29008.html.
        ### Replace project and metricstore values with your own project and metricstore.
        url: https://cn-beijing.log.aliyuncs.com/prometheus/sls-zc-test-bj-b/prometheus-raw/api/v1/write

Query and Analysis of Diversified Time Series Data


SLS provides three time-series data modes. SQL plays a dominant role in time series data queries, and SQL's support for calling PromQL ensures both easier syntax and powerful functionality. In addition, SLS supports directly calling PromQL to support the open-source ecosystem, such as the integration with Grafana.

1. Pure PromQL Queries

SLS supports the Prometheus remote write protocol for data writes in MetricsStore implementation and supports PromQL queries by calling Prometheus APIs. This enables Prometheus to act as a data source of Grafana, so that Prometheus can be compatible with open-source ecosystems. If your data is written by using Prometheus, SLS will be very suitable for your scenarios.

2. Pure SQL Queries

Prometheus MetricsStore reuses the underlying architecture of SLS, so it is designed to support SQL queries. For example, the long SQL statements in the preceding example are pure SQL queries. Nevertheless, pure SQL queries require a lot of optimization to handle time series data with ease, which is time-consuming. In view of this, SLS offers a third solution.

3. SQL + PromQL Hybrid Queries

SLS encapsulates PromQL into several functions, which can serve as subqueries to support nesting complete SQL statements at the outer layer. The following shows an example.

Pure PromQL queries:

SELECT promql_query('up') FROM metrics
SELECT promql_query_range('up', '1m') FROM metrics

PromQL as subqueries:

SELECT sum(value) FROM (SELECT promql_query('up') FROM metrics)

Complicated SQL queries with PromQL as subqueries:

select ts_predicate_arma(time, value, 5, 1, 1 , 1, 1, true) from ( SELECT (time/1000) as time, value   from ( select  promql_query_range('1 - avg(irate(node_cpu_seconds_total{instance=~".*",mode="idle"}[10m]))', '10m') as t from metrics ) order by time asc ) limit 10000

Currently, SLS supports the following frequently used APIs in PromQL: query(varchar), query_range(varchar, varchar?), labels(,label_values(varchar), and series(varchar).

Specifically, query_range also supports an automatic step when the second parameter is empty.

Multiple Visualization Features

Access to Prometheus Data in SLS

SLS provides multiple visualization features for time series scenarios by default, and supports analysis in the standard SQL and the PromQL + SQL modes. For more information about SLS visualization, see Log Service Visualization Dashboard.


Access to Prometheus Data in Grafana

In addition to native visualization features, SLS also supports access to time series data in Grafana by connecting SLS to Grafana as a Prometheus data source. In this way, SLS is compatible with all Prometheus dashboard templates.


Prometheus has no authentication mode. Unlike Prometheus, the Prometheus interface provided by SLS supports the HTTPS protocol and requires BasicAuth authentication, making data more secure.

Note: Make sure you are using HTTPS.

Information Description Example
Entrance (Endpoint) https://endpoint/prometheus/{project-name}/{logstore-name} https://cn-beijing.log.aliyuncs.com/prometheus/sls-prometheus-test/prometheus
BasicAuth The username is AK ID, and the password is AK Secret. The RAM user account AK is recommended to grant only the read-only permission to this project and LogStore.

1.  Add a data source, and select Prometheus.


2.  Configure the URL.


Enter the aforementioned URL.

3.  Enable Basic Auth, and enter the AK information.


Information Description Example
Entrance (URL) https://{endpoint}/prometheus/{project-name}/{metricstore-name}
Specifically, endpoint is the domain name of the region for SLS. For more information, see Service endpoint.
BasicAuth The username is AK ID, and the password is AK Secret. The RAM user account AK is recommended to grant only the read-only permission to this project and MetricsStore.
0 0 0
Share on


12 posts | 1 followers

You may also like