如何进行多集群报警差异化配置 - Container Service for Kubernetes

The multi-cluster alert management feature allows you to create or modify alert rules on a Fleet instance. However, the Fleet instance can propagate only the same alert rules to clusters that are associated with the Fleet instance. You may want your clusters to use different alert rules to meet business requirements. This topic describes how to override alerting configurations to allow different clusters to use different alert configurations.

Prerequisites

The Fleet management feature is enabled. For more information, see Enable Fleet management.
Multiple clusters are associated with the Fleet instance. For more information, see Associate clusters with a Fleet instance.
Components required for alert management are installed in the clusters that you want to manage. For more information, see Install and update the components.

Background information

The multi-cluster management feature allows you to create KubeVela override policies on a Fleet instance to override alerting configurations or application configurations. You can create alert rules on a Fleet instance and then create an override policy to override the alert rules of specific clusters. For example, you can create an override policy to enable GPU alerting, set different alert thresholds, and specify different contacts. After you complete the configuration, you can use the Fleet instance to propagate the alert rules to the associated clusters and then apply the override policy.

The following figure shows how alerting configurations are overridden for specific clusters. An override policy is created on the Fleet instance and applied to ACK Cluster 2 to override its alerting configurations. ACK Cluster 1 still uses the original alerting configurations.

Override alerting configurations

Step 1: Create a contact and a contact group

Create a contact and a contact group. For more information, see Step 1: Create a contact and a contact group.

Step 2: Obtain the contact group ID

Obtain the contact group ID. For more information, see Step 2: Obtain the contact group ID.

Step 3: Create alert rules

Create alert rules. For more information, see Step 3: Create an alert rule.

Step 4: Create an override policy and apply the policy to override the alert rules

KubeVela is used to create an override policy on the Fleet instance and then apply the policy from the Fleet instance to override the alert rules. To do this, perform the following steps.

Run the following command to query the IDs of the clusters to which you want to propagate the alert rules:

kubectl get managedcluster

Expected output:

NAME            HUB ACCEPTED   MANAGED CLUSTER URLS   JOINED   AVAILABLE   AGE
c565e4****      true                                  True     True        12d
cbaa12****      true                                  True     True        12d

Note

You can also select clusters by specifying cluster labels. For more information, see Method 2: Specify a label as the cluster ID.

Create a file named ackalertrule-app-override.yaml based on the following content to define the configurations to override:

In this example, ack-cluster-1 is a CPU-accelerated cluster and ack-cluster-2 is a GPU-accelerated cluster. This example shows how to override the alert rules of ack-cluster-2. The override policy enables GPU alerting, modifies the alert thresholds, and changes the contacts.

apiVersion: core.oam.dev/v1alpha1  # Specify the cluster to which the alert rules are propagated by cluster ID. 
kind: Policy
metadata:
  name: cluster-cpu
  namespace: kube-system
type: topology
properties:
  clusters: ["<ack-cluster-1>"] # Replace <ack-cluster-1> with the cluster ID of ack cluster 1. 
---
apiVersion: core.oam.dev/v1alpha1 # Specify the cluster to which the alert rules are propagated by cluster ID. 
kind: Policy
metadata:
  name: cluster-gpu
  namespace: kube-system
type: topology
properties:
  clusters: ["<ack-cluster-2>"] # Replace <ack-cluster-2> with the cluster ID of ack cluster 2. 
---
apiVersion: core.oam.dev/v1alpha1 # Define an override policy. 
kind: Policy
metadata:
  name: override-gpu
  namespace: kube-system
type: override
properties:
  components:
  - name: ackalertrules  # The component name in the associated application. 
    traits:
    - type: alert-rule   # alert-rule trait is used to modify the alert rules. 
      properties:
        groups:           # The configurations to override, whose structure is the same as that of the alert rules. You can define multiple groups and alert rules to override. 
        - name: res-exceptions      # Specify the alert group to override. 
          rules:
          - contactGroups:           # Override the contact group. 
            - arms_contact_group_id: "12345"
              cms_contact_group_name: ack_Default Contact Group
              id: "1234"
            enable: enable           # Change the value to enable. 
            name: node_cpu_util_high # Specify the name of the alert to override.
            thresholds:              # Modify the threshold. 
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: "60"
        - name: cluster-error    # Specify the alert group to override. 
          rules:
          - enable: enable       # Change the value to enable. 
            name: gpu-xid-error  # Specify the name of the alert to override. 
---
apiVersion: core.oam.dev/v1alpha1  # Define a KubeVela workflow. 
kind: Workflow
metadata:
  name: deploy-ackalertrules
  namespace: kube-system
steps:
  - type: deploy
    name: deploy-cpu
    properties:
      policies: ["cluster-cpu"]   # Deploy the alert rules to cluster-cpu. 
  - type: deploy
    name: deploy-gpu
    properties:
      policies: ["override-gpu", "cluster-gpu"]  # Apply the override policy to override the alert rules of cluster-gpu. 
---
apiVersion: core.oam.dev/v1beta1   # Define a KubeVela application. 
kind: Application
metadata:
  name: alertrules
  namespace: kube-system
  annotations:
    app.oam.dev/publishVersion: version1  # Repropagate the alert rules when resources are updated. The value of publishVersion must be modified. 
spec:
  components:
    - name: ackalertrules
      type: ref-objects
      properties:
        objects:
          - resource: ackalertrules    # Reference the alert rules created in Step 3. 
            name: default
  workflow:
    ref: deploy-ackalertrules  # Use the propagate rules defined in the workflow to propagate the alert rules.

Run the following command to apply the override policy and override the alert rules:
```
kubectl apply -f ackalertrule-app-override.yaml
```

Run the following command to view the propagation progress of the alert rules:

kubectl amc appstatus alertrules -n kube-system --tree --detail

Expected output:

CLUSTER                       NAMESPACE       RESOURCE             STATUS    APPLY_TIME          DETAIL
c565e4**** (ack-cluster-1)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **
cbaa12**** (ack-cluster-2)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **