All Products
Search
Document Center

Prometheus Remote Write configuration

Last Updated: May 19, 2022

Different cloud TSDB instances provided by Alibaba Cloud have different maximum write TPS settings. Avoid excessive TPS and make the TSDB example unavailable, and protect the TSDB instance from running normally. When the TPS exceeds the maximum TPS allowed by the TSDB instance, the TSDB instance traffic protection rule is triggered, which causes the write failure to be abnormal. Therefore, Prometheus’ remote_write configuration needs to be adjusted according to the TSDB instance specification, so that the indicators collected by Prometheus can be written into the TSDB smoothly and reliably.

All configuration items of Prometheus’ remote_write can be obtained from Prometheus’ official website. This article only introduces the best practices for writing configuration when Prometheus docks Alibaba Cloud TSDB. To improve write efficiency, Prometheus caches the collected samples before they are written to the remote storage, and then packages them to the remote storage. The configuration parameters of this memory queue have a great impact on the efficiency of Prometheus writing to remote storage. The configuration items included are mainly as follows.

  1. # Configures the queue used to write to remote storage.
  2. queue_config:
  3. # Number of samples to buffer per shard before we start dropping them.
  4. [ capacity: <int> | default = 10000 ]
  5. # Maximum number of shards, i.e. amount of concurrency.
  6. [ max_shards: <int> | default = 1000 ]
  7. # Minimum number of shards, i.e. amount of concurrency.
  8. [ min_shards: <int> | default = 1 ]
  9. # Maximum number of samples per send.
  10. [ max_samples_per_send: <int> | default = 100]
  11. # Maximum time a sample will wait in buffer.
  12. [ batch_send_deadline: <duration> | default = 5s ]
  13. # Maximum number of times to retry a batch on recoverable errors.
  14. [ max_retries: <int> | default = 3 ]
  15. # Initial retry delay. Gets doubled for every retry.
  16. [ min_backoff: <duration> | default = 30ms ]
  17. # Maximum retry delay.
  18. [ max_backoff: <duration> | default = 100ms ]

In the above configuration, for the configuration item min_shards, only Prometheus V2.6.0 and later versions are supported. The version before V2.6.0 is 1 by default, so if there is no special need, you can not set this parameter.

The max_shards and max_samples_per_send in the above parameters determine the maximum TPS that Prometheus writes to remote storage. Assuming 100 ms is required to send 100 samples, then according to the default configuration above, the maximum TPS written by Prometheus to remote storage is 1000 * 100 / 0.1s = 100W/s. If the maximum write TPS of the purchased TSDB instance is less than 100 W/s, it is easy to trigger the TSDB instance current limit protection rule, which will cause the write failure to be abnormal. The configuration of the remote_write reference for Prometheus docking TSDB for different specifications of TSDB is given below. It can be adjusted in different usage scenarios.

TSDB specification ID Write data points/second Reference configuration
mlarge 5000 capacity:10000
max_samples_per_send:500
max_shards:1
large 10000 capacity:10000
max_samples_per_send:500
max_shards:2
3xlarge 30000 capacity:10000
max_samples_per_send:500
max_shards:6
4xlarge 40000 capacity:10000
max_samples_per_send:500
max_shards:8
6xlarge 60000 capacity:10000
max_samples_per_send:500
max_shards:12
12xlarge 120000 capacity:10000
max_samples_per_send:500
max_shards:24
24xlarge 240000 capacity:10000
max_samples_per_send:500
max_shards:48
48xlarge 480000 capacity:10000
max_samples_per_send:500
max_shards:96
96xlarge 960000 capacity:10000
max_samples_per_send:500
max_shards:192

Taking the TSDB instance as the mlarge specification as an example, a complete example of the Prometheus reference configuration is as follows:

  1. # my global config
  2. global:
  3. scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  4. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  5. # scrape_timeout is set to the global default (10s).
  6. # Alertmanager configuration
  7. alerting:
  8. alertmanagers:
  9. - static_configs:
  10. - targets:
  11. # - alertmanager:9093
  12. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  13. rule_files:
  14. # - "first_rules.yml"
  15. # - "second_rules.yml"
  16. # A scrape configuration containing exactly one endpoint to scrape:
  17. # Here it's Prometheus itself.
  18. scrape_configs:
  19. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  20. - job_name: 'prometheus'
  21. # metrics_path defaults to '/metrics'
  22. # scheme defaults to 'http'.
  23. static_configs:
  24. - targets: ['localhost:9090']
  25. # Remote write configuration (TSDB).
  26. remote_write:
  27. - url: "http://ts-xxxxxxxxxxxx.hitsdb.rds.aliyuncs.com:3242/api/prom_write"
  28. # Configures the queue used to write to remote storage.
  29. queue_config:
  30. # Number of samples to buffer per shard before we start dropping them.
  31. capacity: 10000
  32. # Maximum number of shards, i.e. amount of concurrency.
  33. max_shards: 1
  34. # Maximum number of samples per send.
  35. max_samples_per_send: 500
  36. # Remote read configuration (TSDB).
  37. remote_read:
  38. - url: "http://ts-xxxxxxxxxxxx.hitsdb.rds.aliyuncs.com:3242/api/prom_read"
  39. read_recent: true