Override alert rules for specific clusters in a fleet - Container Service for Kubernetes

By default, alert rules propagated from a Fleet instance are identical across all associated clusters. When different clusters need different alert configurations—such as separate thresholds, contact groups, or GPU-specific alerts—use override policies to deliver cluster-specific alert rules.

Prerequisites

Before you begin, ensure that you have:

The Fleet management feature enabled
Two clusters associated with the Fleet instance: the service provider cluster and the service consumer cluster

How it works

ACK uses KubeVela to define and propagate override policies at the Fleet level, following the same model as application differentiated configuration. Define a baseline set of alert rules on the Fleet instance, then apply override policies for specific clusters.

The following figure shows how alert rules are differentiated across clusters. An override policy is created on the Fleet instance. The cluster-specific configuration is delivered to ACK Cluster 2, while ACK Cluster 1 keeps the original alert rules.

YAML resource types

The ackalertrule-app-override.yaml file in this topic defines four types of KubeVela resources:

Resource type	Kind	Purpose
Topology policy	`Policy` (type: `topology`)	Specifies which clusters receive the alert rules
Override policy	`Policy` (type: `override`)	Defines the cluster-specific configuration changes
Workflow	`Workflow`	Orchestrates deployment steps: deploys baseline rules to some clusters and overridden rules to others
Application	`Application`	Ties all components, policies, and the workflow together

Step 1: Create a contact and a contact group

Step 2: Propagate differentiated alert rules

Get the IDs of the clusters to which you want to propagate the alert rules:

kubectl get managedcluster

Expected output:

NAME            HUB ACCEPTED   MANAGED CLUSTER URLS   JOINED   AVAILABLE   AGE
c565e4****      true                                  True     True        12d
cbaa12****      true                                  True     True        12d

Note

To select clusters by label instead of by ID, see Select a cluster to distribute applications.

Create a file named ackalertrule-app-override.yaml with the following content. In this example, ack-cluster-1 is a CPU-accelerated cluster and ack-cluster-2 is a GPU-accelerated cluster. The override policy targets ack-cluster-2, enabling GPU alerting, modifying the alert threshold, and changing the contact group.

apiVersion: core.oam.dev/v1alpha1  # Topology policy: routes baseline alert rules to ack-cluster-1.
kind: Policy
metadata:
  name: cluster-cpu
  namespace: kube-system
type: topology
properties:
  clusters: ["<ack-cluster-1>"] # Replace <ack-cluster-1> with the cluster ID of ack-cluster-1.
---
apiVersion: core.oam.dev/v1alpha1  # Topology policy: routes overridden alert rules to ack-cluster-2.
kind: Policy
metadata:
  name: cluster-gpu
  namespace: kube-system
type: topology
properties:
  clusters: ["<ack-cluster-2>"] # Replace <ack-cluster-2> with the cluster ID of ack-cluster-2.
---
apiVersion: core.oam.dev/v1alpha1  # Override policy: defines the cluster-specific changes for ack-cluster-2.
kind: Policy
metadata:
  name: override-gpu
  namespace: kube-system
type: override
properties:
  components:
  - name: ackalertrules  # Must match the component name in the Application.
    traits:
    - type: alert-rule   # The alert-rule trait modifies alert rules on the target cluster.
      properties:
        groups:           # Override configurations. The structure mirrors that of the alert rules.
        - name: res-exceptions      # Name of the alert group to override.
          rules:
          - contactGroups:           # Override the contact group.
            - arms_contact_group_id: "12345"
              cms_contact_group_name: ack_Default Contact Group
              id: "1234"
            enable: enable           # Set to enable.
            name: node_cpu_util_high # Name of the alert rule to override.
            thresholds:              # Override the alert threshold.
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: "60"
        - name: cluster-error    # Name of the alert group to override.
          rules:
          - enable: enable       # Set to enable.
            name: gpu-xid-error  # Name of the alert rule to override.
---
apiVersion: core.oam.dev/v1alpha1  # Workflow: defines the deployment steps.
kind: Workflow
metadata:
  name: deploy-ackalertrules
  namespace: kube-system
steps:
  - type: deploy
    name: deploy-cpu
    properties:
      policies: ["cluster-cpu"]   # Deploy baseline alert rules to ack-cluster-1.
  - type: deploy
    name: deploy-gpu
    properties:
      policies: ["override-gpu", "cluster-gpu"]  # Apply the override policy and deploy to ack-cluster-2.
---
apiVersion: core.oam.dev/v1beta1   # Application: the top-level KubeVela resource.
kind: Application
metadata:
  name: alertrules
  namespace: kube-system
  annotations:
    app.oam.dev/publishVersion: version1  # Increment this value each time you update the alert rules to trigger re-propagation.
spec:
  components:
    - name: ackalertrules
      type: ref-objects
      properties:
        objects:
          - resource: ackalertrules    # References the alert rules created in Step 3.
            name: default
  workflow:
    ref: deploy-ackalertrules

Apply the override policy:

kubectl apply -f ackalertrule-app-override.yaml

Check the propagation status:

kubectl amc appstatus alertrules -n kube-system --tree --detail

Expected output:

CLUSTER                       NAMESPACE       RESOURCE             STATUS    APPLY_TIME          DETAIL
c565e4**** (ack-cluster-1)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **
cbaa12**** (ack-cluster-2)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **

Both clusters show updated, confirming that the override policy was applied and the differentiated alert rules were propagated successfully.