Prometheus Remote Write configuration - Time Series Database

Different cloud TSDB instances provided by Alibaba Cloud have different maximum write TPS settings. Avoid excessive TPS and make the TSDB example unavailable, and protect the TSDB instance from running normally. When the TPS exceeds the maximum TPS allowed by the TSDB instance, the TSDB instance traffic protection rule is triggered, which causes the write failure to be abnormal. Therefore, Prometheus’ remote_write configuration needs to be adjusted according to the TSDB instance specification, so that the indicators collected by Prometheus can be written into the TSDB smoothly and reliably.

All configuration items of Prometheus’ remote_write can be obtained from Prometheus’ official website. This article only introduces the best practices for writing configuration when Prometheus docks Alibaba Cloud TSDB. To improve write efficiency, Prometheus caches the collected samples before they are written to the remote storage, and then packages them to the remote storage. The configuration parameters of this memory queue have a great impact on the efficiency of Prometheus writing to remote storage. The configuration items included are mainly as follows.

# Configures the queue used to write to remote storage.
queue_config:
  # Number of samples to buffer per shard before we start dropping them.
  [ capacity: <int> | default = 10000 ]
  # Maximum number of shards, i.e. amount of concurrency.
  [ max_shards: <int> | default = 1000 ]
  # Minimum number of shards, i.e. amount of concurrency.
  [ min_shards: <int> | default = 1 ]
  # Maximum number of samples per send.
  [ max_samples_per_send: <int> | default = 100]
  # Maximum time a sample will wait in buffer.
  [ batch_send_deadline: <duration> | default = 5s ]
  # Maximum number of times to retry a batch on recoverable errors.
  [ max_retries: <int> | default = 3 ]
  # Initial retry delay. Gets doubled for every retry.
  [ min_backoff: <duration> | default = 30ms ]
  # Maximum retry delay.
  [ max_backoff: <duration> | default = 100ms ]

In the above configuration, for the configuration item min_shards, only Prometheus V2.6.0 and later versions are supported. The version before V2.6.0 is 1 by default, so if there is no special need, you can not set this parameter.

The max_shards and max_samples_per_send in the above parameters determine the maximum TPS that Prometheus writes to remote storage. Assuming 100 ms is required to send 100 samples, then according to the default configuration above, the maximum TPS written by Prometheus to remote storage is 1000 * 100 / 0.1s = 100W/s. If the maximum write TPS of the purchased TSDB instance is less than 100 W/s, it is easy to trigger the TSDB instance current limit protection rule, which will cause the write failure to be abnormal. The configuration of the remote_write reference for Prometheus docking TSDB for different specifications of TSDB is given below. It can be adjusted in different usage scenarios.

TSDB specification ID	Write data points/second	Reference configuration
mlarge	5000	capacity:10000 max_samples_per_send:500 max_shards:1
large	10000	capacity:10000 max_samples_per_send:500 max_shards:2
3xlarge	30000	capacity:10000 max_samples_per_send:500 max_shards:6
4xlarge	40000	capacity:10000 max_samples_per_send:500 max_shards:8
6xlarge	60000	capacity:10000 max_samples_per_send:500 max_shards:12
12xlarge	120000	capacity:10000 max_samples_per_send:500 max_shards:24
24xlarge	240000	capacity:10000 max_samples_per_send:500 max_shards:48
48xlarge	480000	capacity:10000 max_samples_per_send:500 max_shards:96
96xlarge	960000	capacity:10000 max_samples_per_send:500 max_shards:192

Taking the TSDB instance as the mlarge specification as an example, a complete example of the Prometheus reference configuration is as follows:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
# Remote write configuration (TSDB).
remote_write:
  - url: "http://ts-xxxxxxxxxxxx.hitsdb.rds.aliyuncs.com:3242/api/prom_write"
    # Configures the queue used to write to remote storage.
    queue_config:
      # Number of samples to buffer per shard before we start dropping them.
      capacity: 10000
      # Maximum number of shards, i.e. amount of concurrency.
      max_shards: 1
      # Maximum number of samples per send.
      max_samples_per_send: 500

# Remote read configuration (TSDB).
remote_read:
  - url: "http://ts-xxxxxxxxxxxx.hitsdb.rds.aliyuncs.com:3242/api/prom_read"
    read_recent: true