The multi-cluster alert management feature allows you to create or modify alert rules on a Fleet instance. However, the Fleet instance can propagate only the same alert rules to clusters that are associated with the Fleet instance. You may want your clusters to use different alert rules to meet business requirements. This topic describes how to override alerting configurations to allow different clusters to use different alert configurations.
Prerequisites
The Fleet management feature is enabled. For more information, see Enable Fleet management.
Multiple clusters are associated with the Fleet instance. For more information, see Associate clusters with a Fleet instance.
Components required for alert management are installed in the clusters that you want to manage. For more information, see Install and update the components.
Background information
The multi-cluster management feature allows you to create KubeVela override policies on a Fleet instance to override alerting configurations or application configurations. You can create alert rules on a Fleet instance and then create an override policy to override the alert rules of specific clusters. For example, you can create an override policy to enable GPU alerting, set different alert thresholds, and specify different contacts. After you complete the configuration, you can use the Fleet instance to propagate the alert rules to the associated clusters and then apply the override policy.
The following figure shows how alerting configurations are overridden for specific clusters. An override policy is created on the Fleet instance and applied to ACK Cluster 2 to override its alerting configurations. ACK Cluster 1 still uses the original alerting configurations.
Step 1: Create a contact and a contact group
Create a contact and a contact group. For more information, see Step 1: Create a contact and a contact group.
Step 2: Obtain the contact group ID
Obtain the contact group ID. For more information, see Step 2: Obtain the contact group ID.
Step 3: Create alert rules
Create alert rules. For more information, see Step 3: Create an alert rule.
Step 4: Create an override policy and apply the policy to override the alert rules
KubeVela is used to create an override policy on the Fleet instance and then apply the policy from the Fleet instance to override the alert rules. To do this, perform the following steps.
Run the following command to query the IDs of the clusters to which you want to propagate the alert rules:
kubectl get managedcluster
Expected output:
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE c565e4**** true True True 12d cbaa12**** true True True 12d
NoteYou can also select clusters by specifying cluster labels. For more information, see Method 2: Specify a label as the cluster ID.
Create a file named ackalertrule-app-override.yaml based on the following content to define the configurations to override:
In this example,
ack-cluster-1
is a CPU-accelerated cluster andack-cluster-2
is a GPU-accelerated cluster. This example shows how to override the alert rules ofack-cluster-2
. The override policy enables GPU alerting, modifies the alert thresholds, and changes the contacts.apiVersion: core.oam.dev/v1alpha1 # Specify the cluster to which the alert rules are propagated by cluster ID. kind: Policy metadata: name: cluster-cpu namespace: kube-system type: topology properties: clusters: ["<ack-cluster-1>"] # Replace <ack-cluster-1> with the cluster ID of ack cluster 1. --- apiVersion: core.oam.dev/v1alpha1 # Specify the cluster to which the alert rules are propagated by cluster ID. kind: Policy metadata: name: cluster-gpu namespace: kube-system type: topology properties: clusters: ["<ack-cluster-2>"] # Replace <ack-cluster-2> with the cluster ID of ack cluster 2. --- apiVersion: core.oam.dev/v1alpha1 # Define an override policy. kind: Policy metadata: name: override-gpu namespace: kube-system type: override properties: components: - name: ackalertrules # The component name in the associated application. traits: - type: alert-rule # alert-rule trait is used to modify the alert rules. properties: groups: # The configurations to override, whose structure is the same as that of the alert rules. You can define multiple groups and alert rules to override. - name: res-exceptions # Specify the alert group to override. rules: - contactGroups: # Override the contact group. - arms_contact_group_id: "12345" cms_contact_group_name: ack_Default Contact Group id: "1234" enable: enable # Change the value to enable. name: node_cpu_util_high # Specify the name of the alert to override. thresholds: # Modify the threshold. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "60" - name: cluster-error # Specify the alert group to override. rules: - enable: enable # Change the value to enable. name: gpu-xid-error # Specify the name of the alert to override. --- apiVersion: core.oam.dev/v1alpha1 # Define a KubeVela workflow. kind: Workflow metadata: name: deploy-ackalertrules namespace: kube-system steps: - type: deploy name: deploy-cpu properties: policies: ["cluster-cpu"] # Deploy the alert rules to cluster-cpu. - type: deploy name: deploy-gpu properties: policies: ["override-gpu", "cluster-gpu"] # Apply the override policy to override the alert rules of cluster-gpu. --- apiVersion: core.oam.dev/v1beta1 # Define a KubeVela application. kind: Application metadata: name: alertrules namespace: kube-system annotations: app.oam.dev/publishVersion: version1 # Repropagate the alert rules when resources are updated. The value of publishVersion must be modified. spec: components: - name: ackalertrules type: ref-objects properties: objects: - resource: ackalertrules # Reference the alert rules created in Step 3. name: default workflow: ref: deploy-ackalertrules # Use the propagate rules defined in the workflow to propagate the alert rules.
Run the following command to apply the override policy and override the alert rules:
kubectl apply -f ackalertrule-app-override.yaml
Run the following command to view the propagation progress of the alert rules:
kubectl amc appstatus alertrules -n kube-system --tree --detail
Expected output:
CLUSTER NAMESPACE RESOURCE STATUS APPLY_TIME DETAIL c565e4**** (ack-cluster-1)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: ** cbaa12**** (ack-cluster-2)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: **