×
Community Blog New Evolution of SLS Prometheus Time Series Storage Technology

New Evolution of SLS Prometheus Time Series Storage Technology

This article introduces recent technical updates to the Prometheus storage engine of SLS, which achieves a performance improvement of over 10 times with PromQL compatibility.

1

Observability has been a popular topic in recent years. It focuses on innovating businesses using large amounts of diverse data (such as Log, Trace, Metric). To support data innovation, a stable, powerful, and inclusive storage and computing engine is crucial.

In the industry, various solutions (like ES for Log, ClickHouse for Trace, Prometheus for Metric) are used for unified observability. However, SLS introduces a unified architectural design for all observable data at the data engine level. In 2018, we released the first version of PromQL syntax support based on the Log model, confirming its feasibility. Subsequently, we redesigned the storage engine to achieve unified storage of Log, Trace, and Metric in one architecture and process.

2

This article introduces the recent technical updates to the Prometheus storage engine of Simple Log Service (SLS). Compatible with PromQL, the Prometheus storage engine achieves a performance improvement of more than 10 times.

Technical Challenges

With a growing number of customers and large-scale applications like ultra-large-scale Kubernetes cluster monitoring and high-QPS business platform alerting, there is an increasing demand for higher overall performance and resource efficiency in SLS time series storage. These requirements can be summarized as follows:

  1. Faster: The system should maintain high QPS for point queries even with large data sizes, and reduce the latency of range queries.
  2. Easier: The solution should be user-friendly without requiring excessive manual optimizations such as HashKey aggregation writing, manual downsampling, and cross-store data aggregation.
  3. Lower cost: The goal is to achieve cost reduction and efficiency improvement simultaneously, with a particular focus on reducing costs as the performance increases.

To meet these requirements, we have implemented various optimizations to the overall solution of the Prometheus time series engine over the past year. These optimizations have resulted in a tenfold improvement in query performance and overall cost reduction.

Technical Upgrade of the SLS Time Series Engine

Smarter Aggregation Writing

3

Due to the highly aggregated nature of time series data, storing data from the same timeline in one shard can result in highly efficient compression. This leads to significant improvements in storage and read efficiency.

Initially, users were required to aggregate data by time series using the SLS Producer on the client side and then specify a Shard Hash Key to write data to a specific shard. However, this approach demanded high computing and memory capabilities on the client, making it unsuitable for methods like RemoteWrite and iLogtail. As the online usage of iLogtail, RemoteWrite, and data distribution scenarios increased, it led to lower overall resource efficiency of SLS clusters.

To address this issue, we have implemented an aggregation writing solution for all MetricStores on the SLS gateway. Clients no longer need to use the SDK for data aggregation (though the current SDK-based aggregation writing is not affected). Instead, data is randomly written to an SLS gateway node, and the gateway automatically aggregates the data to ensure that data from the same time series is stored on one shard. The following table compares the advantages and disadvantages of client SDK aggregation writing and SLS gateway aggregation writing.

Client SDK aggregation SLS gateway aggregation
Client load High Low
Configuration method The aggregated buckets are statically configured. If the number of split shards exceeds the number of buckets, an adjustment needs to be made. No configuration is required.
Shard splitting The hash range must be balanced after each shard is split. Otherwise, data is written unbalancedly. No operations are needed. All shards are automatically balanced.
Aggregation policy The client can perform aggregation based on time series features such as instancelD. Not sensitive to time series features. Perform unified aggregation based on metric names and labels.
Query policy Able to be optimized based on aggregation features. For example, if the instancelD of a query is known, a specific shard can be queried. Not sensitive to query policies. All shards are queried by default.

As shown in the table above, SLS gateway aggregation writing has certain disadvantages in specific query policies compared to client SDK control. However, this only applies to a small number of users who have high QPS requirements (such as tens of thousands of QPS). In most scenarios, global queries do not affect performance.

The Solution for Dashboard Monitoring: Global Cache

4

Dashboard is a critical application in time series scenarios and often experiences high query stress. In scenarios like stress testing, large-scale promotions, and troubleshooting, multiple users may access a dashboard simultaneously. In situations where multiple users access hotspots, caching becomes crucial. However, in PromQL, each request is precise to seconds or milliseconds, and the PromQL logic searches for data from the exact time of the request at each step of computation. Caching the calculation results directly would have little impact, even on requests initiated within one second.

To address this issue, we attempt to align the range of PromQL queries based on the step. By doing so, the internal results of each step can be reused, significantly improving the cache hit rate. The overall policy is as follows:

  1. When a user request is sent to a compute node with the cache feature enabled, the range is adjusted based on the step.
  2. Use the adjusted range to access the SLS Cache Server and retrieve the cached range that hits.
  3. For the range that does not hit, query the data in the SLS backend and perform computation.
  4. Concatenate the results and return them to the client. The incremental computation result is simultaneously updated in the cache.

It is important to note that this method has been modified to some extent compared to standard PromQL behavior. However, in actual tests, aligning the range has little effect on the result. Enabling this method based on MetricStore configuration or specific requirements (such as URL Param control of the request) is supported.

PromQL Can Perform Distributed Parallel Computing

5

In the open-source Prometheus, PromQL computation is done in a standalone and single coroutine manner. This approach is more suitable for small business scenarios. However, as the cluster grows larger, the number of time series involved in the computation process increases significantly. In such cases, the standalone computation of a single coroutine is not sufficient to meet the requirements (for a time series of hundreds of thousands, a query spanning several hours would usually experience delays of more than 10 seconds).

To address this issue, we introduce a layer of parallel computing architecture to the PromQL computational logic. In this architecture, the majority of the computing workload is distributed to worker nodes, while the master node only aggregates the final results. Additionally, computing concurrency is decoupled from the number of shards, allowing storage and computing to scale independently. The overall computational logic is as follows:

  1. When a user request is sent to a compute node, it is determined whether parallel computing should be used based on certain policies, such as query support, user-enabled settings, or a large number of historical query time series.
  2. If parallel computing is used, the compute node is upgraded to a master node (a virtual role), and the query is split in parallel.
  3. The master node sends the subquery to another worker node (a virtual role) for execution.
  4. The worker node executes the subquery and returns the result to the master node.
  5. The master node then consolidates all the results and computes the final result.

Not all queries support parallel computing due to the nature of PromQL, and not all queries that support it can guarantee good results. However, based on our analysis of actual online requests, more than 90% of queries can support parallel computing and achieve acceleration.

Extraordinary Performance: Computing Pushdown

6

The previous parallel computing solution solves the issue of computing concurrency. However, for storage nodes, the amount of data sent to computing nodes remains the same whether the computing is standalone or parallel. The overhead of serialization, network transmission, and deserialization still exists, and the parallel computing solution even introduces a bit more overhead. Therefore, the overall resource consumption of the cluster doesn't change significantly. We also analyzed the performance of Prometheus in the current SLS storage-computing separation architecture and found that most of the overhead comes from PromQL computing, Golang GC, and deserialization.

To address this problem, we started exploring whether some PromQL files can be pushed down to the shards of SLS for computation. In this approach, the shards are considered as Prometheus Workers. There are two options for pushdown computing:

  1. The standard Prometheus Golang computing engine supports most queries, but serialization and deserialization overheads are still unavoidable.
  2. Implement some Prometheus operators in C++ and directly execute them on the C++ side. This option avoids serialization and deserialization while reducing GC overhead.

After comparison, the second solution proves to be more effective, although it may require more workload. To validate this, we reanalyzed the queries from all online users and found that more than 80% of scenarios only use a few common queries, making the implementation cost of the second solution relatively low. As a result, we chose the second solution: manually writing a C++ PromQL engine that supports common operators (see performance comparison below).

Before pushing down computing, it is necessary to ensure that data from the same time series is stored on one shard. This can be achieved by enabling the aggregate writing feature or manually writing data using an SDK.

Comparison between pushdown and parallel computing is as follows:

Parallel computing Pushdown computing
Concurrency Able to be set dynamically. Concurrency is the same as that of shards.
Same concurrency performance Moderate High
Overall resource consumption Moderate Low
Additional limits None Data in the same timeline should be stored on one shard.
QPS Low High

The Much-Anticipated Built-in Downsampling

7

In time series scenarios, long-term metric storage is usually done to analyze the overall trend. However, storing a full amount of high-precision metrics can be costly. Therefore, metric data is often downsampled (reducing metric accuracy) and stored. The previous solution in SLS was more manual:

1.  Users needed to use the ScheduledSQL feature to regularly query the original metric database, extract the latest/average value as the downsampled value, and store it in the new MetricStore.

2.  During a query, the appropriate MetricStore for the query range needed to be determined. If the MetricStore was a downsampled MetricStore, some query rewriting was required.

  • For example, if the downsampling precision is 10 minutes and the query is avg(cpu_util), the default LookBackDelta value is 3 minutes. Consequently, the query with a precision of 10 minutes may not be obtained. Therefore, the value needs to be changed to avg(last_over_time(cpu_util[15m])).

This method required a high level of configuration and usage, and professional research and development skills were needed for its implementation. As a result, we have introduced built-in downsampling capabilities in the SLS backend:

1.  For the downsampling library, you only need to configure the downsampling interval and the duration for which metrics should be stored.

2.  The internal SLS system regularly downsamples the data according to the configuration (retaining the latest point within the downsampling range, similar to last_over_time), and stores it in the new metric library.

3.  During a query, the appropriate metric library is automatically selected based on the query's step and time range.

  • For example, if the metric precision is 1 minute, 10 minutes, and 1 hour respectively, and the query has a step of 30 minutes, a 10-minute library will be selected.
  • If the original metric library only stores data for 3 days and the downsampled library stores data for 30 days, the downsampling library will be selected for a 7-day query.

4.  If you query a downsampled library, SLS automatically rewrites the query or performs data fitting computations. There is no need for you to manually modify the query.

UnionMetricStore

8

We have numerous internal and external customers of SLS across various locations in China, and data is typically stored locally. However, when creating global dashboards and performing queries, you often need to manually query in different locations or develop a gateway to query and aggregate data from multiple projects and stores. Thanks to the technical implementation of SLS distributed computing and computing pushdown capabilities, we can provide support for high-performance, low resource consumption cross-project or even cross-region query capabilities: Union MetricStore at the engine layer.

  1. Union MetricStore allows you to associate multiple projects and MetricStore; under the same account.
  2. Union MetricStore supports full PromQL;

Note: Due to compliance issues, UnionStore does not support multinational projects.

User Experience

Gif1

  • The above figure is the performance comparison of SLS in different computing modes under 40,000 time series.

    • Due to the network latency of the local browser > Grafana backend > SLS backend, the cache feature still shows a slight latency. If the cache is hit, the backend latency is only at a microsecond level.

Gif2

  • The above figure is the performance comparison of SLS in different computing modes under the maximum of more than 2 million time series.

    • In the case of over 2 million time series, error reports in the normal mode of SLS are to prevent a query from directly turning a standard Prometheus compute node into an OOM state.

Performance Comparison with Open-Source Prometheus

We conducted a comprehensive performance comparison between SLS and some popular PromQL engines on the market. The contestants include the open source standard Prometheus, the distributed solution Thanos, and the VictoriaMetrics which is claimed to be the fastest PromQL engine in the market.

  • Note: During the test, we did not turn on the global cache and downsampling functions. The performance improvement of these two scenarios is too great when hit, so it is not meaningful to compare them with others using the same standards.
  • Note: The parallel computing and computing pushdown of the preceding SLS is still fully compatible with the PromQL syntax, with a compatibility test pass rate of 100%. The following are the compatibility results of the open-source and commercial versions of Prometheus.

9

Test load:

  • Open source data source: Prometheus Benchmark

Test environment:

  • Prometheus: a standalone instance deployed on a 32 Core 128 GB ECS instance. Storage: PL1 ESSD
  • Thanos: components deployed on five 32 Core 128 GB ECS instances. Storage: PL1 ESSD and OSS
  • VictoriaMetrics: components deployed on five 32 Core128 GB ECS instances. Storage: PL1 ESSD
  • SLS: three different scenarios with 4, 16, and 64 shards

Test scenario:

  • 100Target: The upper limit of the time series at the same time is 25,600. Replacement rate: 1% (updated every 10 minutes).
  • 1000Target: The upper limit of the time series at the same time is 256,000. Replacement rate: 40% (updated every 10 minutes).
  • 5000Target: The upper limit of the time series at the same time is 1,280,000. Replacement rate: 80% (updated every 10 minutes).
  • Note: The replacement rate is used to indicate the speed at which metrics are updated. In traditional scenarios, the replacement rate usually remains relatively stable. However, in scenarios such as microservices, cloud-native, and machine learning, the metrics' replacement speed becomes extremely fast due to the short and dynamic lifecycles of pods, services, and jobs (generating a new ID or a new time series each time). This rapid replacement of metrics aligns with the concepts of related measures and time series expansion.

The performance of three versions of SLS are tested:

  • sls-normal: the basic version of the SLS Prometheus engine, which is a computing engine that uses the open-source Prometheus. The storage engine is SLS.
  • sls-parallel-32: SLS Prometheus parallel computing version, able to perform parallel computing with 32 degrees of concurrency (computing engine is still the open source Prometheus).
  • sls-pushdown: SLS computing pushdown version. Use the Prometheus computing engine implemented by C++ for common queries.

prometheus thanos VictoriaMetrics sls-normal sls-parallel-32 sls-pushdown
6,400 time series 500.1 953.6 24.5 497.7 101.2 54.5
359,680 time series 12260.0 26247.5 1123.0 12800.8 1182.2 409.8
1,280,000 time series 36882.7 Mostly failed. 5448.3 33330.4 3641.4 1323.6

The table above presents the average latency (in ms) of common queries in three different scenarios. Here are the key observations:

  1. VictoriaMetrics performs well overall, thanks to its logic implementation of PromQL (with low compatibility). It exhibits particularly low latency when the number of time series is low.
  2. The sls-normal version, using the same computing engine as open-source Prometheus and computing in a single coroutine, shows similar performance to Prometheus.
  3. Thanos is focused on distributed solutions. Its PromQL computing performance is average, and large queries may result in timeouts or encounter out-of-memory errors.
  4. The sls-pushdown feature significantly improves performance, especially for ultra-large-scale time series. It shows an overall improvement of more than 10 times compared to the sls-normal version. If computing is fully pushed down, the actual performance improvement can reach 100 times.

Summary

In the same scenario, the longer the storage time, the better the cost-effectiveness of downsampling.

Technical upgrades have been implemented in some SLS clusters and will gradually cover all clusters in the future. However, more time is needed to ensure overall stability. Please be patient.

The pursuit of improvements in performance, cost, stability, and security is a continuous process. SLS will continue to strive for progress in these areas and provide reliable and observable storage engines for everyone.

References

0 0 0
Share on

DavidZhang

12 posts | 1 followers

You may also like

Comments