Centrally manage and distribute alert rules for multiple clusters - Container Service for Kubernetes

Manage alert rules centrally on a Fleet instance and distribute them to all associated clusters automatically — so every cluster uses the same rules without manual per-cluster configuration.

Prerequisites

Before you begin, make sure you have:

Fleet management enabled
Two clusters associated with the Fleet instance (a service provider cluster and a service consumer cluster)
The latest version of Alibaba Cloud CLI installed and configured

How it works

The Fleet instance acts as the central control plane for alert rules. You create an AckAlertRule Custom Resource Definition (CRD) on the Fleet instance, then create a distribution rule (backed by KubeVela) to push the rule to the clusters you select. Any cluster newly associated with the Fleet instance can receive the same rules through the same distribution mechanism.

Step 1: Create a contact and contact group

Contacts and contact groups are shared across all ACK clusters under your Alibaba Cloud account.

Log on to the ACK console and click Clusters in the left navigation pane.
Click the name of any cluster. In the left pane, choose Operations > Alerts.
On the Alert Configuration page, click Start Installation. The console checks prerequisites and installs and upgrades the required components automatically.
On the Alerts page, create a contact:
1. Click the Alert Contacts tab, then click Create.
2. In the Create Alert Contact panel, fill in Name, Phone Number, and Email, then click OK. The system sends an activation message or email to the contact. Activate the contact as prompted.
Create a contact group:
1. Click the Alert Contact Groups tab, then click Create.
2. In the Create Alert Contact Group panel, set Group Name, select contacts in the Contacts section, and click OK. You can add contacts to or remove contacts from the Selected Contacts column.

Step 2: Get the contact group ID

Run the following command to query your contact groups:

aliyun cs GET /alert/contact_groups

Expected output:

{
    "contact_groups": [
        {
            "ali_uid": 14783****,
            "binding_info": "{\"sls_id\":\"ack_14783****_***\",\"cms_contact_group_name\":\"ack_Default Contact Group\",\"arms_id\":\"1****\"}",
            "contacts": null,
            "created": "2021-07-21T12:18:34+08:00",
            "group_contact_ids": [
                2***
            ],
            "group_name": "Default Contact Group",
            "id": 3***,
            "updated": "2022-09-19T19:23:57+08:00"
        }
    ],
    "page_info": {
        "page_number": 1,
        "page_size": 100,
        "total_count": 1
    }
}

Map the output fields to the contactGroups parameters you will use in the alert rule:

contactGroups:
- arms_contact_group_id: "1****"                       # contact_groups.binding_info.arms_id
  cms_contact_group_name: ack_Default Contact Group    # contact_groups.binding_info.cms_contact_group_name
  id: "3***"                                           # contact_groups.id

Step 3: Create an alert rule

The AckAlertRule CRD groups all supported alert rules under a single resource. The following constraints apply:

Important

The alert rule name must be default and the namespace must be kube-system. For the full list of supported rules, see the Configure alert rules by using CRDs section in the Alert management topic.

Choose which rule groups to enable

The template includes 11 rule groups. Enable only the groups relevant to your cluster configuration:

Rule group	What it monitors	Enable when
`error-events`	Cluster error events (SLS-based)	Always recommended
`warn-events`	Cluster warning events (SLS-based)	High-noise environments
`cluster-core-error`	API server, etcd, Scheduler, kube-controller-manager, cloud-controller-manager, CoreDNS, Ingress health	Core component monitoring is required
`cluster-error`	Node failures, GPU errors, image pull failures, node pool (NLC) errors	Node-level fault detection is needed
`res-exceptions`	CPU, memory, disk, network, inode, SLB utilization (default threshold: 85%)	Resource saturation alerting is needed
`cluster-scale`	Cluster Autoscaler scale-up, scale-down, and timeout events	Autoscaling is enabled
`workload-exceptions`	Job failures, Deployment replica errors, DaemonSet scheduling errors	Workload health monitoring is needed
`pod-exceptions`	Pod OOM, pod start failures, pod crash loops	Pod-level fault detection is needed
`cluster-storage-err`	CSI disk errors, PersistentVolume (PV) failures	Persistent storage is used
`cluster-network-err`	SLB sync failures, route errors, Terway allocation errors, Ingress reload errors	Terway CNI or SLB-backed services are used
`security-err`	Config audit high-risk findings	Security auditing is enabled

Apply the alert rule

Set rules.enable to enable for the rule groups you want to activate. In the example below, error-events is enabled.
Add the contactGroups block from Step 2.
Save the file as ackalertrule.yaml and apply it:
```
kubectl apply -f ackalertrule.yaml
```

The following is a complete example with error-events enabled:

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
  namespace: kube-system
spec:
  groups:
  - name: error-events
    rules:
    - enable: enable
      contactGroups:
      - arms_contact_group_id: "1****"
        cms_contact_group_name: ack_Default Contact Group
        id: "3***"
      expression: sls.app.ack.error
      name: error-event
      notification:
        message: kubernetes cluster error event.
      type: event
  - name: warn-events
    rules:
    - enable: disable
      expression: sls.app.ack.warn
      name: warn-event
      notification:
        message: kubernetes cluster warn event.
      type: event
  - name: cluster-core-error
    rules:
    - enable: disable
      expression: prom.apiserver.notHealthy.down
      name: apiserver-unhealthy
      notification:
        message: "Cluster APIServer not healthy. \nPromQL: ((sum(up{job=\"apiserver\"})
          <= 0) or (absent(sum(up{job=\"apiserver\"})))) > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.etcd.notHealthy.down
      name: etcd-unhealthy
      notification:
        message: "Cluster ETCD not healthy. \nPromQL: ((sum(up{job=\"etcd\"}) <= 0)
          or (absent(sum(up{job=\"etcd\"})))) > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.scheduler.notHealthy.down
      name: scheduler-unhealthy
      notification:
        message: "Cluster Scheduler not healthy. \nPromQL: ((sum(up{job=\"ack-scheduler\"})
          <= 0) or (absent(sum(up{job=\"ack-scheduler\"})))) > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.kcm.notHealthy.down
      name: kcm-unhealthy
      notification:
        message: "Custer kube-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-kube-controller-manager\"})
          <= 0) or (absent(sum(up{job=\"ack-kube-controller-manager\"})))) > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.ccm.notHealthy.down
      name: ccm-unhealthy
      notification:
        message: "Cluster cloud-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-cloud-controller-manager\"})
          <= 0) or (absent(sum(up{job=\"ack-cloud-controller-manager\"})))) > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.coredns.notHealthy.requestdown
      name: coredns-unhealthy-requestdown
      notification:
        message: "Cluster CoreDNS not healthy, continuously request down. \nPromQL:
          (sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or
          (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0)"
      type: metric-prometheus
    - enable: disable
      expression: prom.coredns.notHealthy.panic
      name: coredns-unhealthy-panic
      notification:
        message: "Cluster CoreDNS not healthy, continuously panic. \nPromQL: sum(rate(coredns_panic_count_total{}[3m]))
          > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.ingress.request.errorRateHigh
      name: ingress-err-request
      notification:
        message: Cluster Ingress Controller request error rate high (default error
          rate is 85%).
      type: metric-prometheus
    - enable: disable
      expression: prom.ingress.ssl.expire
      name: ingress-ssl-expire
      notification:
        message: "Cluster Ingress Controller SSL will expire in a few days (default
          14 days). \nPromQL: ((nginx_ingress_controller_ssl_expire_time_seconds -
          time()) / 24 / 3600) < 14"
      type: metric-prometheus
  - name: cluster-error
    rules:
    - enable: disable
      expression: sls.app.ack.docker.hang
      name: docker-hang
      notification:
        message: kubernetes node docker hang.
      type: event
    - enable: disable
      expression: sls.app.ack.eviction
      name: eviction-event
      notification:
        message: kubernetes eviction event.
      type: event
    - enable: disable
      expression: sls.app.ack.gpu.xid_error
      name: gpu-xid-error
      notification:
        message: kubernetes gpu xid error event.
      type: event
    - enable: disable
      expression: sls.app.ack.image.pull_back_off
      name: image-pull-back-off
      notification:
        message: kubernetes image pull back off event.
      type: event
    - enable: disable
      expression: sls.app.ack.node.down
      name: node-down
      notification:
        message: kubernetes node down event.
      type: event
    - enable: disable
      expression: sls.app.ack.node.restart
      name: node-restart
      notification:
        message: kubernetes node restart event.
      type: event
    - enable: disable
      expression: sls.app.ack.ntp.down
      name: node-ntp-down
      notification:
        message: kubernetes node ntp down.
      type: event
    - enable: disable
      expression: sls.app.ack.node.pleg_error
      name: node-pleg-error
      notification:
        message: kubernetes node pleg error event.
      type: event
    - enable: disable
      expression: sls.app.ack.ps.hang
      name: ps-hang
      notification:
        message: kubernetes ps hang event.
      type: event
    - enable: disable
      expression: sls.app.ack.node.fd_pressure
      name: node-fd-pressure
      notification:
        message: kubernetes node fd pressure event.
      type: event
    - enable: disable
      expression: sls.app.ack.node.pid_pressure
      name: node-pid-pressure
      notification:
        message: kubernetes node pid pressure event.
      type: event
    - enable: disable
      expression: sls.app.ack.ccm.del_node_failed
      name: node-del-err
      notification:
        message: kubernetes delete node failed.
      type: event
    - enable: disable
      expression: sls.app.ack.ccm.add_node_failed
      name: node-add-err
      notification:
        message: kubernetes add node failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.run_command_fail
      name: nlc-run-cmd-err
      notification:
        message: kubernetes node pool nlc run command failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.empty_task_cmd
      name: nlc-empty-cmd
      notification:
        message: kubernetes node pool nlc delete node failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.url_mode_unimpl
      name: nlc-url-m-unimp
      notification:
        message: kubernetes nodde pool nlc delete node failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.op_not_found
      name: nlc-opt-no-found
      notification:
        message: kubernetes node pool nlc delete node failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.destroy_node_fail
      name: nlc-des-node-err
      notification:
        message: kubernetes node pool nlc destory node failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.drain_node_fail
      name: nlc-drain-node-err
      notification:
        message: kubernetes node pool nlc drain node failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.restart_ecs_wait_fail
      name: nlc-restart-ecs-wait
      notification:
        message: kubernetes node pool nlc restart ecs wait timeout.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.restart_ecs_fail
      name: nlc-restart-ecs-err
      notification:
        message: kubernetes node pool nlc restart ecs failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.reset_ecs_fail
      name: nlc-reset-ecs-err
      notification:
        message: kubernetes node pool nlc reset ecs failed.
      type: event
    - enable: disable
      expression: sls.app.ack.nlc.repair_fail
      name: nlc-sel-repair-err
      notification:
        message: kubernetes node pool nlc self repair failed.
      type: event
  - name: res-exceptions
    rules:
    - enable: disable
      expression: cms.host.cpu.utilization
      name: node_cpu_util_high
      notification:
        message: kubernetes cluster node cpu utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.host.memory.utilization
      name: node_mem_util_high
      notification:
        message: kubernetes cluster node memory utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.host.disk.utilization
      name: node_disk_util_high
      notification:
        message: kubernetes cluster node disk utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.host.public.network.utilization
      name: node_public_net_util_high
      notification:
        message: kubernetes cluster node public network utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.host.fs.inode.utilization
      name: node_fs_inode_util_high
      notification:
        message: kubernetes cluster node file system inode utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.slb.qps.utilization
      name: slb_qps_util_high
      notification:
        message: kubernetes cluster slb qps utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.slb.traffic.tx.utilization
      name: slb_traff_tx_util_high
      notification:
        message: kubernetes cluster slb traffic utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.slb.max.connection.utilization
      name: slb_max_con_util_high
      notification:
        message: kubernetes cluster max connection utilization too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: percent
        value: "85"
      type: metric-cms
    - enable: disable
      expression: cms.slb.drop.connection
      name: slb_drop_con_high
      notification:
        message: kubernetes cluster drop connection count per second too high.
      thresholds:
      - key: CMS_ESCALATIONS_CRITICAL_Threshold
        unit: count
        value: "1"
      type: metric-cms
    - enable: disable
      expression: sls.app.ack.node.disk_pressure
      name: node-disk-pressure
      notification:
        message: kubernetes node disk pressure event.
      type: event
    - enable: disable
      expression: sls.app.ack.resource.insufficient
      name: node-res-insufficient
      notification:
        message: kubernetes node resource insufficient.
      type: event
    - enable: disable
      expression: sls.app.ack.ip.not_enough
      name: node-ip-pressure
      notification:
        message: kubernetes ip not enough event.
      type: event
    - enable: disable
      expression: sls.app.ack.csi.no_enough_disk_space
      name: disk_space_press
      notification:
        message: kubernetes csi not enough disk space.
      type: event
  - name: cluster-scale
    rules:
    - enable: disable
      expression: sls.app.ack.autoscaler.scaleup_group
      name: autoscaler-scaleup
      notification:
        message: kubernetes autoscaler scale up.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.scaledown
      name: autoscaler-scaledown
      notification:
        message: kubernetes autoscaler scale down.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.scaleup_timeout
      name: autoscaler-scaleup-timeout
      notification:
        message: kubernetes autoscaler scale up timeout.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.scaledown_empty
      name: autoscaler-scaledown-empty
      notification:
        message: kubernetes autoscaler scale down empty node.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.scaleup_group_failed
      name: autoscaler-up-group-failed
      notification:
        message: kubernetes autoscaler scale up failed.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.cluster_unhealthy
      name: autoscaler-cluster-unhealthy
      notification:
        message: kubernetes autoscaler error, cluster not healthy.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.delete_started_timeout
      name: autoscaler-del-started
      notification:
        message: kubernetes autoscaler delete node started long ago.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.delete_unregistered
      name: autoscaler-del-unregistered
      notification:
        message: kubernetes autoscaler delete unregistered node.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.scaledown_failed
      name: autoscaler-scale-down-failed
      notification:
        message: kubernetes autoscaler scale down failed.
      type: event
    - enable: disable
      expression: sls.app.ack.autoscaler.instance_expired
      name: autoscaler-instance-expired
      notification:
        message: kubernetes autoscaler scale down instance expired.
      type: event
  - name: workload-exceptions
    rules:
    - enable: disable
      expression: prom.job.failed
      name: job-failed
      notification:
        message: "Cluster Job failed. \nPromQL: kube_job_status_failed{job=\"_kube-state-metrics\"}
          > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.deployment.replicaError
      name: deployment-rep-err
      notification:
        message: "Cluster Deployment replication status error. \nPromQL: kube_deployment_spec_replicas{job=\"_kube-state-metrics\"}
          != kube_deployment_status_replicas_available{job=\"_kube-state-metrics\"}"
      type: metric-prometheus
    - enable: disable
      expression: prom.daemonset.scheduledError
      name: daemonset-status-err
      notification:
        message: "Cluster Daemonset pod status or scheduled error. \nPromQL: ((100
          - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{}
          * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{}))
          > 0"
      type: metric-prometheus
    - enable: disable
      expression: prom.daemonset.misscheduled
      name: daemonset-misscheduled
      notification:
        message: "Cluster Daemonset misscheduled. \nPromQL: kube_daemonset_status_number_misscheduled{job=\"_kube-state-metrics\"}
          \ > 0"
      type: metric-prometheus
  - name: pod-exceptions
    rules:
    - enable: disable
      expression: sls.app.ack.pod.oom
      name: pod-oom
      notification:
        message: kubernetes pod oom event.
      type: event
    - enable: disable
      expression: sls.app.ack.pod.failed
      name: pod-failed
      notification:
        message: kubernetes pod start failed event.
      type: event
    - enable: disable
      expression: prom.pod.status.notHealthy
      name: pod-status-err
      notification:
        message: 'Pod status exception. \nPromQL: min_over_time(sum by (namespace,
          pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed", job="_kube-state-metrics"})[${mins}m:1m])
          > 0'
      type: metric-prometheus
    - enable: disable
      expression: prom.pod.status.crashLooping
      name: pod-crashloop
      notification:
        message: 'Pod status exception. \nPromQL: sum_over_time(increase(kube_pod_container_status_restarts_total{job="_kube-state-metrics"}[1m])[${mins}m:1m])
          > 3'
      type: metric-prometheus
  - name: cluster-storage-err
    rules:
    - enable: disable
      expression: sls.app.ack.csi.invalid_disk_size
      name: csi_invalid_size
      notification:
        message: kubernetes csi invalid disk size.
      type: event
    - enable: disable
      expression: sls.app.ack.csi.disk_not_portable
      name: csi_not_portable
      notification:
        message: kubernetes csi not protable.
      type: event
    - enable: disable
      expression: sls.app.ack.csi.deivce_busy
      name: csi_device_busy
      notification:
        message: kubernetes csi disk device busy.
      type: event
    - enable: disable
      expression: sls.app.ack.csi.no_ava_disk
      name: csi_no_ava_disk
      notification:
        message: kubernetes csi no available disk.
      type: event
    - enable: disable
      expression: sls.app.ack.csi.disk_iohang
      name: csi_disk_iohang
      notification:
        message: kubernetes csi ioHang.
      type: event
    - enable: disable
      expression: sls.app.ack.csi.latency_too_high
      name: csi_latency_high
      notification:
        message: kubernetes csi pvc latency load too high.
      type: event
    - enable: disable
      expression: prom.pv.failed
      name: pv-failed
      notification:
        message: 'Cluster PersistentVolume failed. \nPromQL: kube_persistentvolume_status_phase{phase=~"Failed|Pending",
          job="_kube-state-metrics"} > 0'
      type: metric-prometheus
  - name: cluster-network-err
    rules:
    - enable: disable
      expression: sls.app.ack.ccm.no_ava_slb
      name: slb-no-ava
      notification:
        message: kubernetes slb not available.
      type: event
    - enable: disable
      expression: sls.app.ack.ccm.sync_slb_failed
      name: slb-sync-err
      notification:
        message: kubernetes slb sync failed.
      type: event
    - enable: disable
      expression: sls.app.ack.ccm.del_slb_failed
      name: slb-del-err
      notification:
        message: kubernetes slb delete failed.
      type: event
    - enable: disable
      expression: sls.app.ack.ccm.create_route_failed
      name: route-create-err
      notification:
        message: kubernetes create route failed.
      type: event
    - enable: disable
      expression: sls.app.ack.ccm.sync_route_failed
      name: route-sync-err
      notification:
        message: kubernetes sync route failed.
      type: event
    - enable: disable
      expression: sls.app.ack.terway.invalid_resource
      name: terway-invalid-res
      notification:
        message: kubernetes terway have invalid resource.
      type: event
    - enable: disable
      expression: sls.app.ack.terway.alloc_ip_fail
      name: terway-alloc-ip-err
      notification:
        message: kubernetes terway allocate ip error.
      type: event
    - enable: disable
      expression: sls.app.ack.terway.parse_fail
      name: terway-parse-err
      notification:
        message: kubernetes terway parse k8s.aliyun.com/ingress-bandwidth annotation
          error.
      type: event
    - enable: disable
      expression: sls.app.ack.terway.allocate_failure
      name: terway-alloc-res-err
      notification:
        message: kubernetes parse resource error.
      type: event
    - enable: disable
      expression: sls.app.ack.terway.dispose_failure
      name: terway-dispose-err
      notification:
        message: kubernetes dispose resource error.
      type: event
    - enable: disable
      expression: sls.app.ack.terway.virtual_mode_change
      name: terway-virt-mod-err
      notification:
        message: kubernetes virtual mode changed.
      type: event
    - enable: disable
      expression: sls.app.ack.terway.config_check
      name: terway-ip-check
      notification:
        message: kubernetes terway execute pod ip config check.
      type: event
    - enable: disable
      expression: sls.app.ack.ingress.err_reload_nginx
      name: ingress-reload-err
      notification:
        message: kubernetes ingress reload config error.
      type: event
  - name: security-err
    rules:
    - enable: disable
      expression: sls.app.ack.si.config_audit_high_risk
      name: si-c-a-risk
      notification:
        message: kubernetes high risks have be found after running config audit.
      type: event
  ruleVersion: v1.0.9

The alert rule is created on the Fleet instance but does not take effect on any cluster until you create a distribution rule in Step 4.

Step 4: Distribute the alert rule to clusters

Distribution rules use KubeVela to push Kubernetes resources from the Fleet instance to associated clusters. For more information about application distribution, see Application distribution.

Choose a distribution method based on how you want to target clusters:

Method	Use when
By label	You want to target a dynamic set of clusters (for example, all production clusters)
By cluster ID	You want to target specific, fixed clusters

Method 1: Distribute by label

Query associated cluster IDs and add a label to the clusters you want to target:

kubectl get managedclusters
kubectl label managedclusters <clusterid> production=true

Create ackalertrule-app.yaml with the following content:

apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: alertrules
  namespace: kube-system
  annotations:
    app.oam.dev/publishVersion: version1
spec:
  components:
    - name: alertrules
      type: ref-objects
      properties:
        objects:
          - resource: ackalertrules
            name: default
  policies:
    - type: topology
      name: prod-clusters
      properties:
        clusterSelector:
          production: "true"  # Selects clusters with this label

Method 2: Distribute by cluster ID

Create ackalertrule-app.yaml with the target cluster IDs:

apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: alertrules
  namespace: kube-system
  annotations:
    app.oam.dev/publishVersion: version1
spec:
  components:
    - name: alertrules
      type: ref-objects
      properties:
        objects:
          - resource: ackalertrules
            name: default
  policies:
    - type: topology
      name: prod-clusters
      properties:
        clusters: ["<clusterid1>", "<clusterid2>"]  # Replace with actual cluster IDs

Apply and verify the distribution rule

Apply the distribution rule:
```
kubectl apply -f ackalertrule-app.yaml
```

Check the distribution status:

kubectl amc appstatus alertrules -n kube-system --tree --detail

If the distribution succeeds, the output shows updated for each cluster:

CLUSTER                  NAMESPACE       RESOURCE             STATUS    APPLY_TIME          DETAIL
c565e4**** (cluster1)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **
cbaa12**** (cluster2)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **

If a cluster shows a status other than updated, verify that the cluster is still associated with the Fleet instance and that the AckAlertRule resource was created successfully on the Fleet instance. For alert management details, see Alert management.

Update alert rules

To change an alert rule after distribution:

Edit ackalertrule.yaml and apply the changes:
```
kubectl apply -f ackalertrule.yaml
```
Increment the app.oam.dev/publishVersion annotation value in ackalertrule-app.yaml (for example, change version1 to version2), then apply:
```
kubectl apply -f ackalertrule-app.yaml
```
Updating the annotation triggers KubeVela to re-distribute the modified rule to all targeted clusters.

What's next

Alert management — configure alert rules directly on individual clusters
Configure alert rules by using CRDs — full reference for AckAlertRule fields and supported expressions
Application distribution — distribute other Kubernetes resources across clusters using the same mechanism