Prometheus: The Unicorn in Metrics

By Jiao Xian

Overview

Prometheus is an all-in-one monitoring and alerting platform developed by SoundCloud, with low dependency and full functionality.

It joined CNCF (Cloud Native Computing Foundation) in 2016 and is widely used in the monitoring system of the Kubernetes cluster. In August 2018, it became the second project to "graduate" from CNCF after K8S. Prometheus, as an important member of CNCF ecosphere, is second only to Kubernetes in the level of activity.

Key Functions:

Multi-dimensional data model: metrics, and labels.
Flexible query language: PromQL. In the same query statement, operations, such as multiplication, addition, concatenation, and fractional bits taken, can be performed on multiple metrics.
It can be deployed independently. It is out-of-the-box, and does not rely on distributed storage.
It collects data based on HTTP using the pull method.
It is compatible with the push method through the push gateway.
It obtains monitored items through static configuration or service discovery.
It supports charts, dashboards and more.

Core Components:

Prometheus server: To collect and store time series data.
Client library: To connect to the Prometheus server. It can be used to query and report data.
Push gateway (for short-running tasks): To monitor the summary node of the data in batches and for a short period of time. It is mainly used for business data reporting.
Customized exporters (such as, HAProxy, StatsD, and Graphite): To reports the plug-ins for the machine data.
Alert manager: Prometheus can configure rules, and then periodically query the data. When the condition is triggered, the alert is pushed to the configured alert manager.
A variety of support tools

Advantages and Disadvantages:

In terms of scenarios, PTSDB is suitable for numerical time series data compared with InfluxDB. However, it is not suitable for log-type time series data, and index statistics for billing. InfluxDB is intended for the universal time series platform, and scenarios including log monitoring. Prometheus focuses more on the metrics solutions. A lot of similarities exist between the two systems, including collection, storage, alert, and display.
- The combination of InfluxData: Telegraf + InfluxDB + Kapacitor + Chronograf
- The combination of Promethues: Exporter + Prometheus server + Alert Manager + Grafana
- The collection terminal of Prometheus mainly adopts the push-pull method, while supports the push method through the push gateway. Telegraf, the collection tool of InfluxData, mainly adopts the push method.
- In the aspect of storage, the two are similar in the basic idea. However, differences exist in the key points, including, for example, the indexing of the timeline and the way to solve the disorder.
- InfluxDB supports multi-valued model, the String type, and more. The content supported by InfluxDB is more abundant.
- Kapacitor is a tool that combines the data processing, storage rules, alert rules, and alert notification functions of Prometheus. The alert manager provides further grouping, deduplication, and more.
- The cluster mode previously provided by InfluxDB has been removed, and now only the relay-based high availability is retained. The cluster mode is released as a feature of the commercial version. Prometheus provides a unique cluster mode, which aggregates multiple Prometheus nodes through multi-level proxy to implement the extension.
  Prometheus has also enabled the remote storage, so the two system are integrated with each other, and InfluxDB is the remote storage of Prometheus.
The data model of OpenTSDB is almost the same as that of Prometheus, but the PromQL query language is simpler, and OpenTSDB is more functional. OpenTSDB relies on the Hadoop ecosystem, while Prometheus grows in the Kubernetes ecosystem.

Data Model

Single-valued model is adopted. The core concepts of the data model are metrics, labels, and samples.
Format: {=, …}
- For example: http_requests_total{method="POST",endpoint="/api/tracks"}.
The metric name has business implications. For example: http_request_total.
- The types of metrics are divided into Counter, Gauge, Historgram, and Summary.
Labels are used to represent dimensions. Samples consist of timestamps and numeric values.
jobs and instances
- Prometheus automatically generates targets and instances as labels.
job: api-server
instance 1: 1.2.3.4:5670
instance 2: 1.2.3.4:5671

Overall Design Concept

The overall technical architecture of Prometheus can be divided into several important modules:

Main function: As the portal, it is responsible for starting, connecting and managing various component. It coordinates the operation of components in the Actor-Like mode.
Configuration: It configures the parsing, verifying, and loading of items
Scrape discovery manager: The service discovery manager communicates with the service scrape server through a synchronous channel. When the configuration changes, the service must be restarted to take effect.
Scrape manager: It scrapes metrics and sends them to storage components.
Storage:
- Fanout storage: the proxy abstraction layer of storage. It shields the details of the local storage and the remote storage at the underlying layer, writes samples in double write mode, and reads them in a merged way.
- Remote storage: The remote storage creates a queue manager, which sends data in turn based on the load, and reads merged data in the remote endpoint from the client.
- Local storage: a lightweight time series database based on the local disk.
PromQL engine: The query expression is parsed into an abstract syntax tree and an executable query, and the data is loaded in the Lazy Load mode.
Rule manager: manages the alert rules.
Notifier: notifies the distribution manager.
Notifier discovery: notifies the Service Discovery.
Web UI and API: the embedded management interface. It can run query expression analysis and show the results.

PTSDB Overview

This article focuses on the analysis of Local Storage PTSDB. The core of PTSDB includes: inverted index + window storage block.

Data is written in two hours as a time window. The data generated within two hours is stored in a Head Block. Each block contains all the sample data (chunks), metadata file (meta. json), and index files (index) in the time window.

The newly written data is stored in the memory block and written to the disk two hours later. The background thread eventually merges the two hours of data into a larger data block. For a general database, if the memory size is fixed, the write and read performance of the system will be limited by this configured memory size. The memory size of PTSDB is determined by the minimum time period, the collection period, and the number of timelines.

WAL mechanism is implemented to prevent memory data loss. Records in a separate tombstone file are deleted.

Core Data Structure and Storage Format

The core data structure of PTSDB is the HeadAppender. When the Appender commits, the WAL log encoding is flushed to the disk and written to the head block.

PTSDB local storage uses a custom file structure. It mainly includes: WALs, metadata files, indexes, chunks, and tombstones.

Write Ahead Log

WAL has three encoding formats: timelines, data points, and delete points. The general policy is to scroll based on the file size, and perform cleanup based on the minimum memory time.

When a log is written, it is stored in units of segments. Each segment is 128 MB by default, and is refreshed once when the size of records reaches a 32 KB page. When the remaining space is smaller than the size of new records, a new segment is created.
During the compaction, WAL executes the cleanup policy based on time. WAL logs with the memory time less than the minimum memory time of the block will be deleted
When restarting, first open the latest segment, and resume loading data from the log to memory.

Metadata File

The meta. json file records the details of chunks. For example, several small chunks that the new compactin chunk comes from. The statistical information of this chunk. For example, the minimum and maximum time range, the timeline, and the number of data points.

The compaction thread determines whether the block can perform compaction based on the statistical information: (maxTime-minTime) accounts for 50% of the overall compaction time range, and the number of deleted timelines accounts for 5% of the total number.

Index

An index is partially written into the Head Block first, and is flushed to the disk with compaction triggered.

The index uses the inverted mode. The IDs in the posting list are locally auto-incremented and represent the timeline as the reference ID. When the index is compacted, the index is flushed to the disk in 6 steps: Symbols -> Series -> LabelIndex -> Posting -> OffsetTable -> TOC

Symbols stores a string table in which tagk and tagv are incremented alphabetically. For example: _ name __, go_gc_duration_seconds, instance, and localhost: 9090. The string is uniformly encoded in UTF-8 format.
Series stores two parts of information. One part is the symbol table reference of the label key-value pair. The other part is the index from the timeline to the data file, which cuts and stores the specific location information of the data block records according to the time window, so that a large number of records in non-query windows can be quickly skipped during query.

To save space, the timestamp range and the location information of the data block are stored using difference encoding.

LabelIndex stores label keys and all the label values corresponding to each label key. And, the specific stored data is also the reference value in the symbol table.
Posting stores the posting refid corresponding to each tag pair inverted.
OffsetTable speeds up the lookup of a layer of mapping and loads this part of the data to the memory. OffsetTable is mainly associated with the LabelIndex and Posting data blocks. TOC is the position offset of each data block. If no data exists, the search can be skipped.

Chunks

Data points are stored in the Chunks directory. Each data point is 512 MB by default. The data encoding method supports XOR. Chunks are indexed by refid, which consists of segmentid, and offset inside the file.

Tombstones

Records are deleted by marking, and the data is physically cleared when compaction and reloading are performed. The deleted records are stored in units of time windows.

Query PromQL

The query language of Promethues is PromQL. The syntax parsing AST, the execution plan and the data aggregation are completed by PromQL. The fanout module sends query data to both the local and remote endpoints simultaneously. PTSDB is responsible for local data retrieval.

PTSDB implements the defined Adpators, including Select, LabelNames, LabelValues, and Querier.

PromQL defines three types of queries:

Instant vector: contains a set of time series, and each time series has only one point. For example: http_requests_total

Range vector: contains a set of time series, and each time series has multiple points. For example: http_requests_total[5m]

Scalar: only has one number and no time series. For example: count(http_requests_total)

Some typical queries include:

Query all current data

http_requests_total
select * from http_requests_total where timestamp between xxxx and xxxx

Criteria query

http_requests_total{code="200", handler="query"}
select * from http_requests_total where code="200" and handler="query" and timestamp between xxxx and xxxx

Fuzzy query. To query data with code 2xx:

http_requests_total{code~="20"}
select * from http_requests_total where code like "%20%" and timestamp between xxxx and xxxx

Value filtering. To query data with the value greater than 100:

http_requests_total > 100
select * from http_requests_total where value > 100 and timestamp between xxxx and xxxx

Range query. To query data for the past 5 minutes:

http_requests_total[5m]
select * from http_requests_total where timestamp between xxxx-5m and xxxx

Count query. To count the total number of current records:

count(http_requests_total)
select count(*) from http_requests_total where timestamp between xxxx and xxxx

Sum query. To count the total value of the current data:

sum(http_requests_total)
select sum(value) from http_requests_total where timestamp between xxxx and xxxx

Top query. To query the top three values:

topk(3, http_requests_total)
select * from http_requests_total where timestamp between xxxx and xxxx order by value desc limit 3

Irate query. To query the rate:

irate(http_requests_total[5m])
select code, handler, instance, job, method, sum(value)/300 AS value from http_requests_total where timestamp between xxxx and xxxx group by code, handler, instance, job, method;

Key Technical Points of PTSDB

Solve the Disorder

PTSDB uses the minimum time window to solve the disorder, and specifies a valid minimum timestamp. Data smaller than this timestamp will be discarded and not processed.

The valid minimum timestamp depends on the earliest timestamp in the current head block and the chunk range that can be stored.

This limitation on data behavior greatly simplifies the design flexibility, and provides the foundation for efficient compaction and data integrity.

Memory Management

MMAP is used to read compressed and merged large files (without occupying too many handles).

The mapping between the virtual address of the process and the file offset is established. The data is truly read to the physical memory after the query reads the corresponding location.

The file system page cache is bypassed, so that a data copy is reduced.

After the query is complete, the corresponding memory is automatically reclaimed by the Linux system based on the memory pressure, and can be used for the next query hit before reclamation.

Therefore, using mmap to automatically manage the memory cache required for queries has the advantage of simple management and efficient processing.

Compaction

The main operations of the compaction include merging blocks, deleting expired data, and refactoring chunk data.

Merging multiple blocks into a larger block can effectively reduce the number of blocks, and can avoid merging the query results of many blocks when the query covers a long time range.
To improve the deletion efficiency, the location of the deleted time series data is recorded when it is deleted. The entire directory of the block is deleted when all data of the block needs to be deleted.
The size of block merging also needs to be limited to avoid retaining excessive deleted space (extra space usage).

It is better to compute the maximum duration of the block by percentage (for example, 10%) based on the data retention period. When the minimum and maximum duration of the block exceeds the time range of 2 or 3 block respectively, compaction is executed.

Snapshot

PTSDB provides the snapshot data backup function. You can use the admin/snapshot protocol to generate snapshots. The snapshot data is stored in the data/snapshots/- directory.

PTSDB Best Practices

Generally, each sample stored in Prometheus occupies about 1-2 bytes. To plan the capacity of the local disk space of the Prometheus server, use the following formula:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

With the retention time (retention_time_seconds) and the sample size (bytes_per_sample) unchanged, if you want to reduce the capacity requirement of the local disk, you can only reduce the number of samples obtained per second (ingested_samples_per_second).
Therefore, two methods can be adopted. One is to reduce the number of time series, and the other is to increase the time interval for collecting samples.
Considering that Prometheus compacts the time series, the effect of reducing the number of time series is more obvious.
The limitations of PTSDB are clustering and replication. Therefore, when a node goes down, data in a certain window is lost.
If the data reliability required by the business is not extremely demanding, the local disk can also store persistent data for several years.
When the PTSDB corruption occurs, it can be recovered by removing the disk directory or the directory of a certain time window.
For PTSDB, the high availability, and the preservation of clusters and historical data can be achieved through external solutions, which are not covered in this article.
Due to the limitation of the historical solution, PTSDB used a single time line to store a file in the early days. This solution has many drawbacks. For example:
The disk-flushing pressure of snapshots; the burden of regular file cleanup; for low base and long period queries, a large number of files need to be opened; and the timeline expansion may cause inode to run out.

Challenges for PTSDB

During usage, PTSDB has also encountered some problems in some aspects. For example:

Comparison has an impact on IO, CPU, and the memory.
After the cold start, the CPU and memory usage will increase during the push phase.
Issues, such as a CPU spike, may occur during high-speed writing.

Summary

PTSDB, as the implementation standard for storing time series data in the K8S monitoring solution, has gradually increased its influence and popularity in time series. Alibaba TSDB currently supports the implementation of remote storage through the Adapter.

Community