Use Prometheus Agent automatic scaling to prevent data latency or loss - Managed Service for Prometheus

An insufficient number of Agent replicas can cause frequent restarts due to out-of-memory (OOM) errors, leading to data latency or data loss. The horizontal auto-scaling feature for Agent replicas in Managed Service for Prometheus helps prevent this issue.

Triggers and policies for Prometheus Agent automatic scaling (HPA)

After a Prometheus Agent starts, it scrapes targets to determine the number of time series. It then calculates the required number of replicas based on the scraping capacity of each replica. If the Agent determines that multiple replicas are needed for data collection, the Horizontal Pod Autoscaler (HPA) automatically scales out. The specific policies for this process are as follows:

When the Agent runs in single-replica mode: The master replica performs both target service discovery and target scraping. When the master's memory usage reaches 75%, the Agent automatically switches to multi-replica mode. However, if a single scrape job is too large, it can cause an OOM error on the master replica before the switch occurs.
When the Agent runs in multi-replica mode: The master replica only performs target service discovery, while worker replicas perform target scraping. If a worker replica's memory usage exceeds 60%, scrape jobs are reassigned. The system then calculates the required number of worker replicas and automatically scales out. This ensures that the average memory usage across all worker replicas does not exceed 60%.
Note
The multi-factor collaborative scheduling algorithm sets the following limits for each Agent per round: The product of the total number of targets and the total number of metrics cannot exceed 4 billion. The memory usage limit is 70%. The maximum number of metrics that each Agent can scrape is 4,000,000.

How to enable

Upgrading the Prometheus Helm chart to version 1.0.0 or later automatically enables Prometheus Agent HPA. For more information about how to upgrade the Helm chart, see Component Upgrade: Upgrading to Helm 1.1.17/Agent v4.0.0.

Prometheus Agent automatic scaling does not increase the number of scrape replicas indefinitely. The default maximum number of scrape replicas is 30. The Prometheus Agent does not automatically scale in, because scaling in can cause data loss.