ACK Pro managed clusters expose metrics for their control plane components and provide component dashboards. This topic describes how to use a self-managed Prometheus instance to collect metrics from control plane components. You can then use these metrics to configure alerts and integrate them with your own monitoring system.
Before you begin
Ensure that your self-managed Prometheus instance can access the API Server of the ACK Pro managed cluster and has read permissions for the
/metricspath.You can deploy the self-managed Prometheus instance either inside or outside the cluster.
ACK Pro managed clusters expose metrics for control plane components, including
kube-apiserver,etcd,kube-scheduler,kube-controller-manager, andcloud-controller-manager. Review the following documents to understand the exposed metrics and their descriptions:You can also use Alibaba Cloud Managed Service for Prometheus for monitoring within your cluster. Managed Service for Prometheus automatically monitors and collects data, provides real-time Grafana dashboards, and lets you create alerts delivered through channels such as email, SMS, and DingTalk.
Configure the Prometheus scrape configuration
To collect metrics from the control plane components with a self-managed Prometheus instance, you must configure corresponding scrape jobs in the prometheus.yaml file. In the provided example, each component corresponds to a job_name. For component-specific configuration details, refer to the respective metrics documentation.
For information about how to configure prometheus.yaml for a standard Prometheus instance, see the official Prometheus documentation.global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: ack-api-server
......
- job_name: ack-etcd
......
- job_name: ack-scheduler
......Configure Prometheus alert rules
To learn how to configure alert rules for open source Prometheus, see Alerting rules.
In-cluster monitoring
If your self-managed Prometheus is deployed within the cluster, use the following configurations to collect metrics via the API Server proxy.
Step 1: Verify API Server Architecture
Run the following command to determine your cluster's networking architecture:
kubectl get endpoints kubernetesENI Direct Connection: The output shows 2+ IP addresses (such as 10.0.0.1:6443, 10.0.0.2:6443).
CLB Forwarding: The output shows only 1 IP address (the internal IP of the CLB).
Step 2: Component-specific configurations
kube-apiserver
The API server is the gateway for all metrics. Based on your architecture (ENI versus CLB), ensure you are targeting the correct endpoints.
Scrape configuration:
- job_name: ack-api-server
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
action: keep
regex: apiserver
- source_labels: [__meta_kubernetes_service_label_provider]
action: keep
regex: kubernetes
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: httpsRecommended alerting rules:
- alert: AckApiServerWarning
expr: (absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1
for: 5m
labels:
severity: critical
annotations:
message: "APIServer is not available. Please check the Prometheus job and target status."etcd
The etcd job uses the API server as a proxy to fetch metrics from the distributed key-value store.
Scrape configuration:
- job_name: ack-etcd
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
honor_labels: true
params:
hosting: ["true"]
job: ["etcd"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
action: keep
regex: apiserver
- source_labels: [__meta_kubernetes_service_label_provider]
action: keep
regex: kubernetes
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: https
- source_labels: [__meta_kubernetes_service_label_component]
action: replace
target_label: job
replacement: ${1}Recommended alerting rules:
- alert: AckETCDLeaderMissing
expr: sum_over_time(etcd_server_has_leader[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
message: "Etcd cluster has no leader in the last 5 minutes. Check if the cluster is overloaded."
- alert: AckETCDDown
expr: (absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1
for: 5m
labels:
severity: critical
annotations:
message: "Etcd is unavailable. Check the Prometheus job and target status."kube-scheduler
This job monitors the health and scheduling latency of the Kubernetes scheduler.
Scrape configuration:
- job_name: ack-scheduler
scrape_interval: 30s
scheme: https
params:
hosting: ["true"]
job: ["ack-scheduler"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
action: keep
regex: apiserver
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: https
- source_labels: [__meta_kubernetes_service_label_component]
action: replace
target_label: job
replacement: ${1}Recommended alerting rules:
- alert: AckSchedulerWarning
expr: (absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
annotations:
message: "Scheduler is unavailable. Check the Prometheus job and target status."kube-controller-manager (KCM)
Monitor the controllers that manage core Kubernetes objects like nodes and namespaces.
Scrape configuration:
- job_name: ack-kcm
scrape_interval: 30s
scheme: https
params:
hosting: ["true"]
job: ["ack-kube-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
action: keep
regex: apiserver
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: https
- source_labels: [__meta_kubernetes_service_label_component]
action: replace
target_label: job
replacement: ${1}Recommended alerting rules:
- alert: AckKCMWarning
expr: (absent(up{job="ack-kcm",pod!=""}) or (count(up{job="ack-kcm",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
annotations:
message: "KCM is unavailable. Check the Prometheus job and target status."cloud-controller-manager (CCM)
You should update your alert rule to match the following standard for ACK Pro clusters. This ensures it only targets valid control plane pods and avoids "flapping" (false alarms during minor network blips).
Scrape configuration:
- job_name: ack-cloud-controller-manager
scrape_interval: 30s
scheme: https
params:
hosting: ["true"]
job: ["ack-cloud-controller-manager"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
action: keep
regex: apiserver
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: httpsRecommended alerting rules:
- alert: AckCCMWarning
expr: (absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1
for: 3m
labels:
severity: critical
annotations:
message: "CCM is unavailable. Check the Prometheus job and target status."Verify the results
Log in to your Prometheus console and navigate to the Graph page.
Run the
upcommand. Verify thatup{job="ack-api-server"}and other component jobs return a value of1.Verify specific metrics, such as
apiserver_request_total, to ensure time-series data is populating correctly.