How to configure differentiated alert rules for multi-cluster - Container Service for Kubernetes

The multi-cluster alert management feature allows you to centrally create or modify alert rules on a Fleet instance. However, the propagated alert rules are identical across all associated clusters. When distinct alert rules are required per cluster, override the alert rules to allow different clusters to use different alert configurations.

Prerequisites

The Fleet management feature is enabled.
Two clusters are associated with the Fleet instance (the service provider cluster and service consumer cluster).

Background Information

The principle of alert rule differentiated configuration aligns with application differentiated configuration. It uses the open-source KubeVela to define and propagate override policies on a Fleet instance. You can define unified alert rules on a Fleet instance and create override policies for differentiated configurations on specific clusters. Examples include enabling GPU-related alerts, setting different alert thresholds, and configuring different contacts. The alert rules with override policies applied are then propagated to targeted associated clusters.

The following figure shows how alerting configurations are differentiated for clusters. An override policy is created on the Fleet instance. The differentiated configurations with the override policy applied is delivered to ACK Cluster 2, while ACK Cluster 1 retains the original alert configurations.

Step 1: Create a contact and a contact group

Step 2: Propagate differentiated alert rules

Alert rule differentiation is implemented through KubeVela, where override policies are defined and propagated at the Fleet level.

Run the following command to query the IDs of the clusters to which you want to propagate the alert rules:
```
kubectl get managedcluster 
```
Expected output:
```
NAME            HUB ACCEPTED   MANAGED CLUSTER URLS   JOINED   AVAILABLE   AGE
c565e4****      true                                  True     True        12d
cbaa12****      true                                  True     True        12d
```
Note
You can also select clusters by specifying cluster labels. For more information, see the Method 2: Specify a label in the cluster selector section of the "Select a cluster to distribute applications" topic.

Create a file named ackalertrule-app-override.yaml based on the following content to define the configurations to override:

In this example, ack-cluster-1 is a CPU-accelerated cluster and ack-cluster-2 is a GPU-accelerated cluster. This example shows how to override the alert rules of ack-cluster-2. The override policy enables GPU alerting, modifies the alert thresholds, and changes the contacts.

apiVersion: core.oam.dev/v1alpha1  # Specify the cluster to which the alert rules are propagated by cluster ID. 
kind: Policy
metadata:
  name: cluster-cpu
  namespace: kube-system
type: topology
properties:
  clusters: ["<ack-cluster-1>"] # Replace <ack-cluster-1> with the cluster ID of ack cluster 1. 
---
apiVersion: core.oam.dev/v1alpha1 # Specify the cluster to which the alert rules are propagated by cluster ID. 
kind: Policy
metadata:
  name: cluster-gpu
  namespace: kube-system
type: topology
properties:
  clusters: ["<ack-cluster-2>"] # Replace <ack-cluster-2> with the cluster ID of ack cluster 2. 
---
apiVersion: core.oam.dev/v1alpha1 # Define an override policy. 
kind: Policy
metadata:
  name: override-gpu
  namespace: kube-system
type: override
properties:
  components:
  - name: ackalertrules  # The component name in the associated application. 
    traits:
    - type: alert-rule   # alert-rule trait is used to modify the alert rules. 
      properties:
        groups:           # The override configurations, whose structure is the same as that of the alert rules. You can define multiple groups and alert rules to be overridden. 
        - name: res-exceptions      # Specify the name of the alert group to be overridden. 
          rules:
          - contactGroups:           # Override the contact group. 
            - arms_contact_group_id: "12345"
              cms_contact_group_name: ack_Default Contact Group
              id: "1234"
            enable: enable           # Change the value to enable. 
            name: node_cpu_util_high # Specify the name of the alert rule to be overridden.
            thresholds:              # Modify the threshold. 
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: "60"
        - name: cluster-error    # Specify the name of the alert group to override. 
          rules:
          - enable: enable       # Change the value to enable. 
            name: gpu-xid-error  # Specify the name of the alert rule to override. 
---
apiVersion: core.oam.dev/v1alpha1  # Define a KubeVela workflow. 
kind: Workflow
metadata:
  name: deploy-ackalertrules
  namespace: kube-system
steps:
  - type: deploy
    name: deploy-cpu
    properties:
      policies: ["cluster-cpu"]   # Deploy the alert rules to cluster-cpu. 
  - type: deploy
    name: deploy-gpu
    properties:
      policies: ["override-gpu", "cluster-gpu"]  # Apply the override policy to override the alert rules of cluster-gpu. 
---
apiVersion: core.oam.dev/v1beta1   # Define a KubeVela application. 
kind: Application
metadata:
  name: alertrules
  namespace: kube-system
  annotations:
    app.oam.dev/publishVersion: version1  # Repropagate the alert rules when resources are updated. The value of publishVersion must be modified. 
spec:
  components:
    - name: ackalertrules
      type: ref-objects
      properties:
        objects:
          - resource: ackalertrules    # Reference the alert rules created in Step 3. 
            name: default
  workflow:
    ref: deploy-ackalertrules  # Use the propagate rules defined in the workflow to propagate the alert rules.

Run the following command to apply the override policy and override the alert rules:
```
kubectl apply -f ackalertrule-app-override.yaml
```

Run the following command to view the propagation progress of the alert rules:

kubectl amc appstatus alertrules -n kube-system --tree --detail

Expected output:

CLUSTER                       NAMESPACE       RESOURCE             STATUS    APPLY_TIME          DETAIL
c565e4**** (ack-cluster-1)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **
cbaa12**** (ack-cluster-2)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **