The multi-cluster alert management feature allows you to centrally create or modify alert rules on a Fleet instance. However, the propagated alert rules are identical across all associated clusters. When distinct alert rules are required per cluster, override the alert rules to allow different clusters to use different alert configurations.
Prerequisites
The Fleet management feature is enabled.
Two clusters are associated with the Fleet instance (the service provider cluster and service consumer cluster).
Background Information
The principle of alert rule differentiated configuration aligns with application differentiated configuration. It uses the open-source KubeVela to define and propagate override policies on a Fleet instance. You can define unified alert rules on a Fleet instance and create override policies for differentiated configurations on specific clusters. Examples include enabling GPU-related alerts, setting different alert thresholds, and configuring different contacts. The alert rules with override policies applied are then propagated to targeted associated clusters.
The following figure shows how alerting configurations are differentiated for clusters. An override policy is created on the Fleet instance. The differentiated configurations with the override policy applied is delivered to ACK Cluster 2, while ACK Cluster 1 retains the original alert configurations.
Step 1: Create a contact and a contact group
Step 2: Propagate differentiated alert rules
Alert rule differentiation is implemented through KubeVela, where override policies are defined and propagated at the Fleet level.
Run the following command to query the IDs of the clusters to which you want to propagate the alert rules:
kubectl get managedclusterExpected output:
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE c565e4**** true True True 12d cbaa12**** true True True 12dNoteYou can also select clusters by specifying cluster labels. For more information, see the Method 2: Specify a label in the cluster selector section of the "Select a cluster to distribute applications" topic.
Create a file named ackalertrule-app-override.yaml based on the following content to define the configurations to override:
In this example,
ack-cluster-1is a CPU-accelerated cluster andack-cluster-2is a GPU-accelerated cluster. This example shows how to override the alert rules ofack-cluster-2. The override policy enables GPU alerting, modifies the alert thresholds, and changes the contacts.apiVersion: core.oam.dev/v1alpha1 # Specify the cluster to which the alert rules are propagated by cluster ID. kind: Policy metadata: name: cluster-cpu namespace: kube-system type: topology properties: clusters: ["<ack-cluster-1>"] # Replace <ack-cluster-1> with the cluster ID of ack cluster 1. --- apiVersion: core.oam.dev/v1alpha1 # Specify the cluster to which the alert rules are propagated by cluster ID. kind: Policy metadata: name: cluster-gpu namespace: kube-system type: topology properties: clusters: ["<ack-cluster-2>"] # Replace <ack-cluster-2> with the cluster ID of ack cluster 2. --- apiVersion: core.oam.dev/v1alpha1 # Define an override policy. kind: Policy metadata: name: override-gpu namespace: kube-system type: override properties: components: - name: ackalertrules # The component name in the associated application. traits: - type: alert-rule # alert-rule trait is used to modify the alert rules. properties: groups: # The override configurations, whose structure is the same as that of the alert rules. You can define multiple groups and alert rules to be overridden. - name: res-exceptions # Specify the name of the alert group to be overridden. rules: - contactGroups: # Override the contact group. - arms_contact_group_id: "12345" cms_contact_group_name: ack_Default Contact Group id: "1234" enable: enable # Change the value to enable. name: node_cpu_util_high # Specify the name of the alert rule to be overridden. thresholds: # Modify the threshold. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "60" - name: cluster-error # Specify the name of the alert group to override. rules: - enable: enable # Change the value to enable. name: gpu-xid-error # Specify the name of the alert rule to override. --- apiVersion: core.oam.dev/v1alpha1 # Define a KubeVela workflow. kind: Workflow metadata: name: deploy-ackalertrules namespace: kube-system steps: - type: deploy name: deploy-cpu properties: policies: ["cluster-cpu"] # Deploy the alert rules to cluster-cpu. - type: deploy name: deploy-gpu properties: policies: ["override-gpu", "cluster-gpu"] # Apply the override policy to override the alert rules of cluster-gpu. --- apiVersion: core.oam.dev/v1beta1 # Define a KubeVela application. kind: Application metadata: name: alertrules namespace: kube-system annotations: app.oam.dev/publishVersion: version1 # Repropagate the alert rules when resources are updated. The value of publishVersion must be modified. spec: components: - name: ackalertrules type: ref-objects properties: objects: - resource: ackalertrules # Reference the alert rules created in Step 3. name: default workflow: ref: deploy-ackalertrules # Use the propagate rules defined in the workflow to propagate the alert rules.Run the following command to apply the override policy and override the alert rules:
kubectl apply -f ackalertrule-app-override.yamlRun the following command to view the propagation progress of the alert rules:
kubectl amc appstatus alertrules -n kube-system --tree --detailExpected output:
CLUSTER NAMESPACE RESOURCE STATUS APPLY_TIME DETAIL c565e4**** (ack-cluster-1)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: ** cbaa12**** (ack-cluster-2)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: **