Self-monitoring dashboard of the Prometheus agent - - Alibaba Cloud Documentation Center

The self-monitoring dashboard of the Prometheus agent provides information such as the agent status, the time consumed to capture real-time and historical metrics, the number of metrics captured, the amount of data delivered, and resource usage. You can use the self-monitoring dashboard to monitor your cluster, discover issues, and analyze causes in a timely manner. This simplifies O&M with higher efficiency.

Prerequisites

The cluster that you want to ACK Serverless cluster or an Alibaba Cloud Container Service for Kubernetes (ACK) managed cluster. The Helm version of your Prometheus instance is upgraded to V1.1.9 or later. The version the Prometheus agent is upgraded to V3.2.0 or later. For more information, see Upgrade the component version.

Procedure

Log on to the ARMS console.
In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.
Click the name of the Prometheus instance that you want to manage. In the left-side navigation pane of the page that appears, click Dashboards.
Click the Prometheus Agent dashboard to view monitoring data on the dashboard details page.
The self-monitoring dashboard contains four sections: Overview, Agent Status, Job Service Discovery, and Targets Capture. You can view monitoring data in a comprehensive and refined manner.
- Overview: provides an overview of the cluster.
- Agent Status: displays the monitoring data of the Prometheus agent.
- Job Service Discovery: displays the monitoring data of jobs.
- Targets Capture: displays the monitoring data of targets.
You can also click the name of the Prometheus instance. In the left-side navigation pane, click Settings. On the Self-Monitoring tab, click the Agent Self-monitoring tab to view the self-monitoring data of the Prometheus agent.

Self-monitoring dashboard

Generally, only the master replica is running on the Prometheus agent. If you need to collect a large number of metrics, you can scale out the agent by adding multiple worker replicas to it.

If only the master replica is running, it discovers services and collects targets.
If the master replica and multiple worker replicas are running, the master replica discovers service and then delivers the targets to worker replicas for target collection based on the specified replication strategy.

Therefore, the dashboard displays the monitoring data of targets, series, WriteARMS, and RemoteWrite collected or sent by the master replica only if no worker replica is running.

Note

WriteARMS indicates that data is written to the default backend storage of ARMS.

Overview section

The following table describes the monitoring metrics in the Overview section of the self-monitoring dashboard.

Metric	Description
Cluster Status (Current) (1)	Shows the current health status of the cluster, which is determined by the health-related metrics.
Agent Status (Current) (2)	Shows the current status of the replicas, the number of collected targets, and the number of captured series.
Agent Discovery Errors & Capture Errors (Current) (3)	Shows the number of job service discovery errors, the number of job capture errors, and the number of target capture errors in the cluster.
Agent Distribution Config Errors & Targets Errors (Current) (4)	Shows the number of distribution configuration errors and the number of target collection errors that occur when multiple replicas are running.
Agent Status Overview (5)	Shows the number of collected targets, the number of collected series, used memory, and resource usage of the agent.

Agent Status section

Expand the Agent Status section. You can view the status data of the Prometheus agent, including the changes in the number of replicas, whether the restart operation is performed, whether the collection of targets and series is changed, whether data delivery is continuous, and other errors during data delivery.

Metric	Description
Expected & Actual Running Replicas (1)	Shows the number of expected replicas and the actual number of running replicas. You can check whether replicas are added to the agent.
Heartbeats Reported by Replicas (2)	Shows the status of each replica.
Success Rates of Data Delivery (3)	Shows the success rates of data delivery.
Accumulated Data Delivery Errors (7)	Shows the data delivery errors.
Number of Captured Targets & Series (4)	Shows the number of collected targets and the number of captured series.
Number of Targets (5)
Number of Series (6)
CoreDNS Unavailability (8)	Indicates that the data delivery errors may occur due to CoreDNS unavailability.
Config & Targets Distribution Errors (Multiple Replicas) (9)

Job Service Discovery section

Expand the Job Service Discovery section. You can view the service discovery data and collection status of jobs, including fully configured jobs, jobs with no target found, jobs on which collection fails, the number of targets found on jobs, the number of series captured in a single round, and the time consumed for collection.

Metric	Description
Job Service Discovery (1)	Shows the number of fully configured jobs, the number of jobs with no target found, the number of jobs, the number of jobs that capture targets and series as expected, and the number of jobs that fail to capture targets and series.
Jobs with No Targets Found (2)	Shows the jobs on which targets are not found.
Jobs Capturing No Series in Current Round (3)	Shows the jobs that does not capture series in the current round.
Number of Targets Found in Current Round (4)	Shows the number of targets found in the current round and the corresponding jobs.
Number of Series Captured in Current Round (5)	Shows the number of series captured in the current round and the corresponding jobs.
Number of Historical Collected Targets (6)	Shows the changes in target collection and the corresponding jobs.
Number of Historical Captured Series (7)	Shows the changes in series capture and the corresponding jobs.

Targets Capture section

Expand the Targets Capture section. You can view the monitoring data of the target capture, including slow captures, the number of series captured in the current round and time consumed, the number of historical captured series and time consumed, changes in the number of relabelConfigs, and changes in the number of addSeries.

Metric	Description
Targets Capture	Shows the total number of targets, the number of captured targets, and the number of targets that fail to be captured in the cluster.
Slow Captures (ScrapeDuration > 10s)	Shows the slow captures and the corresponding targets.
Number of Series Captured in Current Round	Shows the number of series captured in the current round c
Number of Historical Captured Series	Shows the number of historical captured series in a time series chart.
Time Consumed for Series Capture in Current Round	Shows the time consumed for the series capture in the current round in a time series chart.
Time Consumed for Historical Captured Series	Shows the time consumed for the historical captured series in a time series chart.
Number of scrapeSeries	Shows the changes in the number of scrapeSeries.
Number of relabelConfigs	Shows the changes in the number of relabelConfigs.
Number of addSeries	Shows the change in the number of addSeries.

FAQ and troubleshooting

You can use the self-monitoring dashboard of the Prometheus agent to monitor your cluster, identify issues, and analyze causes in a timely manner.

This section provides answers to the frequently asked questions (FAQ) about the self-monitoring dashboard of the Prometheus agent. Make sure that the Prometheus agent is running as expected.

How do I check whether the Prometheus agent is running as expected?

If Cluster Status (Current) in the Overview section of the self-monitoring dashboard is displayed as Normal (marked as 1), the Prometheus agent is running as expected. However, if the Prometheus agent is frequently restarted, it also needs to meet the following requirements:

The actual number of running replicas and the number of expected replicas displayed in Agent Status (Current) are the same. The number of collected targets and the number of captured series are not 0 (marked as 2).
All data in the Agent Status Overview (Current) list is in the Normal state. No data is Abnormal in red (3).

Issue 1: What do I do if some metrics are missing or breakpoints exist?

The following table lists the possible causes and the corresponding methods that you can use to troubleshoot the issue.

Cause	Method
Metrics cannot be collected.	In the Job Service Discovery section of the self-monitoring dashboard, view the Jobs with No Targets Found metric and the Jobs Capturing No Series in Current Round metric. If no target is collected or no series are captured, some metrics may be missing or breakpoints may exist.
Metrics are discarded.	Log on to the ARMS console. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances. Click the name of the Prometheus instance. In the left-side navigation pane, click Service Discovery. On the Metrics tab, click Configure Discarded Metrics. In the dialog box that appears, check whether the expected metrics are discarded. If the expected metrics are discarded, some metrics may be missing or breakpoints may exist.
Data delivery is unstable.	In the Agent Status section of the self-monitoring dashboard, view the Success Rates of Data Delivery metric and check whether the data delivery to ARMS is stable, and view the Accumulated Data Delivery Errors metric and check whether some data fails to be delivered. If the data delivery to ARMS is unstable and some data fails to be delivered, some metrics may be missing or breakpoints may exist.
The agent has been restarted or evicted.	View the value of the restartTotal field in the Agent Status Overview (Current) list in the Overview section of the self-monitoring dashboard. If the value is not 0, some metrics may be missing or breakpoints may exist.
The agent has been restarted or evicted.	In the Agent Status section of the self-monitoring dashboard, view the Expected & Actual Running Replicas and Heartbeats Reported by Replicas metrics, and select a time range to check whether the agent has been restarted. If a new podName appears and the old one dies, the agent has been restarted. In this case, some metrics may be missing or breakpoints may exist.

Issue 2: What do I do if ServiceMonitor or PodMonitor does not take effect?

The following table lists the possible causes and the corresponding methods that you can use to troubleshoot the issue.

Cause	Method
The configurations are invalid.	Log on to the ARMS console. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances. Click the name of the Prometheus instance. In the left-side navigation pane, click Settings. On the Settings tab, click Agent Logs. In the dialog box that appears, enter `\|= "<sm/pm Name>"` to view the agent logs. If the configurations fail to be loaded, ServiceMonitor or PodMonitor does not take effect. In the left-side navigation pane, click Service Discovery. On the Configurations tab, view the configuration items of ServiceMonitor or PodMonitor and modify them.
No matching pods exist.	View the services and pods in the cluster based on the matching conditions configured in ServiceMonitor and PodMonitor to check whether valid matching pods exist. If no valid matching pods exist, ServiceMonitor or PodMonitor do not take effect.
Service discovery is not enabled.	Log on to the ARMS console. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances. Click the name of the Prometheus instance. In the left-side navigation pane, click Service Discovery. On the Configurations tab, click the ServiceMonitor and PodMonitor tab to check whether the service discovery of ServiceMonitor or PodMonitor is enabled. If the service discovery of ServiceMonitor or PodMonitor is enabled, ServiceMonitor or PodMonitor does not take effect. Enable the service discovery to fix the issue.

Issue 3: What do I do if some metrics are missing or breakpoints exist in the default Grafana dashboard?

Cause	Method
Data failed to be delivered to the default backend storage of ARMS.	In the Agent Status section of the self-monitoring dashboard, view the Success Rates of Data Delivery metric and check whether the data delivery to ARMS is stable, and view the Accumulated Data Delivery Errors metric and check whether some data fails to be delivered. If the data delivery to ARMS is unstable and some data fails to be delivered, some metrics may be missing or breakpoints may exist in the default Grafana dashboard.
Metrics cannot be collected.	In the Job Service Discovery section of the self-monitoring dashboard, view the Jobs with No Targets Found metric and the Jobs Capturing No Series in Current Round metric. If no target is collected or no series are captured, some metrics may be missing or breakpoints may exist in the default Grafana dashboard.
Metrics are discarded.	Log on to the ARMS console. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances. Click the name of the Prometheus instance. In the left-side navigation pane, click Service Discovery. On the Metrics tab, click Configure Discarded Metrics. In the dialog box that appears, check whether the expected metrics are discarded. If the expected metrics are discarded, some metrics may be missing or breakpoints may exist in the default Grafana dashboard.

Issue 4: What do I do if Remote Write metrics are missing or breakpoints exist in Remote Write data?

Cause	Method
The Remote Write server is unavailable or unstable. Therefore, data cannot be delivered.	In the Agent Status section of the self-monitoring dashboard, view the Accumulated Data Delivery Errors metric to check whether some data fails to be delivered. If some data fails to be delivered, Remote Write metrics may be missing or breakpoints may exist in Remote Write data.
The AccessKey pair configurations or other configurations of Remote Write are invalid.	Log on to the ARMS console. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances. Click the name of the Prometheus instance. In the left-side navigation pane, click Settings. On the Settings tab, click Edit Prometheus.yaml. In the dialog box that appears, view the configurations of Remote Write. If the Remote Write configuration is in an invalid format or the AccessKey pair configurations are invalid, Remote Write metrics may be missing or breakpoints may exist in Remote Write data. If the extraction rule of the write_relabel_configs metric does not match the expected metrics, Remote Write metrics may be missing or breakpoints may exist in Remote Write data.
The extraction rule of the write_relabel_configs metric does not match the expected metrics.