您可以通過多叢集統一警示管理能力,在Fleet執行個體中配置或修改警示規則。Fleet執行個體會將警示規則統一下發到指定關聯集群,並保證各叢集中的規則一致性。當有新叢集關聯時,Fleet執行個體會自動同步警示規則。本文介紹如何?艦隊中多叢集的統一警示管理。
前提條件
背景資訊
在多叢集管理情境中,若所有叢集採用統一的警示規則配置,則傳統的逐個叢集登入控制台修改方式不僅流程繁瑣,還存在配置不一致的潛在風險。通過Fleet執行個體提供的多叢集統一警示管理功能,管理員可在中心化介面集中定義警示規則,包括需觸發警示的異常類型和關聯的通知對象。更多資訊,請參見Container Service警示管理。多叢集管理統一警示架構如下圖所示:
步驟一:建立警示連絡人與連絡人分組
您可以通過以下步驟建立警示連絡人和連絡人分組,警示連絡人和連絡人分組建立一次即可,在所有Container Service叢集內共用。
登入Container Service管理主控台,在左側導覽列選擇叢集列表。
在叢集列表頁面,單擊目的地組群名稱,然後在左側導覽列,選擇。
在警示配置頁面,單擊開始安裝,控制台會自動檢查條件,進行安裝、升級組件。
在警示配置頁面,按照以下步驟完成連絡人建立和連絡人分組建立。
單擊連絡人管理頁簽,然後單擊建立。
在建立連絡人頁面,輸入姓名、電話和郵箱。然後單擊確定。
連絡人建立完成後,您將會收到驗證啟用簡訊或驗證啟用郵件,請按相應提示進行啟用操作。
單擊連絡人分組管理頁簽,然後單擊建立。
在建立分組頁面,輸入分組名稱,然後選擇分組連絡人,最後單擊確定。
選擇分組連絡人時,將可選連絡人添加到已選連絡人清單,也可移除已選連絡人。
步驟二:擷取警示連絡人分組ID
使用如下Aliyun CLI查詢連絡人分組,擷取其在其他雲端服務中的內部ID,用於後續配置警示規則。
aliyun cs GET /alert/contact_groups預期輸出:
{ "contact_groups": [ { "ali_uid": 14783****, "binding_info": "{\"sls_id\":\"ack_14783****_***\",\"cms_contact_group_name\":\"ack_Default Contact Group\",\"arms_id\":\"1****\"}", "contacts": null, "created": "2021-07-21T12:18:34+08:00", "group_contact_ids": [ 2*** ], "group_name": "Default Contact Group", "id": 3***, "updated": "2022-09-19T19:23:57+08:00" } ], "page_info": { "page_number": 1, "page_size": 100, "total_count": 1 } }在查詢結果資訊中提取資訊,構建contactGroups。
contactGroups: - arms_contact_group_id: "1****" #從上步查詢結果的contact_groups.binding_info.arms_id擷取。 cms_contact_group_name: ack_Default Contact Group #從上步查詢結果的contact_groups.binding_info.cms_contact_group_name擷取。 id: "3***" #從上步查詢結果的contact_groups.id擷取。
步驟三:建立警示規則
您可以使用如下模板建立警示規則,模板中預置了所有Container ServiceACK支援的警示規則,下面以開啟error-events警示規則為例說明警示規則開啟步驟。
警示規則的名稱必須為default,命名空間必須為kube-system。詳細的規則描述,請參見Container Service警示管理。
您在Fleet執行個體中建立警示規則後,警示實際並未生效,還需要建立分發規則將警示規則分發到關聯集群中,使得警示規則在各關聯集群中生效。
修改error-events警示規則對應的
rules.enable為enable。添加從上一步產生的contactGroups欄位。將修改後的警示規則模板另存新檔ackalertrule.yaml。
執行命令
kubectl apply -f ackalertrule.yaml,在Fleet執行個體中建立警示規則。警示規則模板如下:
apiVersion: alert.alibabacloud.com/v1beta1 kind: AckAlertRule metadata: name: default namespace: kube-system spec: groups: - name: error-events rules: - enable: enable contactGroups: - arms_contact_group_id: "1****" cms_contact_group_name: ack_Default Contact Group id: "3***" expression: sls.app.ack.error name: error-event notification: message: kubernetes cluster error event. type: event - name: warn-events rules: - enable: disable expression: sls.app.ack.warn name: warn-event notification: message: kubernetes cluster warn event. type: event - name: cluster-core-error rules: - enable: disable expression: prom.apiserver.notHealthy.down name: apiserver-unhealthy notification: message: "Cluster APIServer not healthy. \nPromQL: ((sum(up{job=\"apiserver\"}) <= 0) or (absent(sum(up{job=\"apiserver\"})))) > 0" type: metric-prometheus - enable: disable expression: prom.etcd.notHealthy.down name: etcd-unhealthy notification: message: "Cluster ETCD not healthy. \nPromQL: ((sum(up{job=\"etcd\"}) <= 0) or (absent(sum(up{job=\"etcd\"})))) > 0" type: metric-prometheus - enable: disable expression: prom.scheduler.notHealthy.down name: scheduler-unhealthy notification: message: "Cluster Scheduler not healthy. \nPromQL: ((sum(up{job=\"ack-scheduler\"}) <= 0) or (absent(sum(up{job=\"ack-scheduler\"})))) > 0" type: metric-prometheus - enable: disable expression: prom.kcm.notHealthy.down name: kcm-unhealthy notification: message: "Custer kube-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-kube-controller-manager\"}) <= 0) or (absent(sum(up{job=\"ack-kube-controller-manager\"})))) > 0" type: metric-prometheus - enable: disable expression: prom.ccm.notHealthy.down name: ccm-unhealthy notification: message: "Cluster cloud-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-cloud-controller-manager\"}) <= 0) or (absent(sum(up{job=\"ack-cloud-controller-manager\"})))) > 0" type: metric-prometheus - enable: disable expression: prom.coredns.notHealthy.requestdown name: coredns-unhealthy-requestdown notification: message: "Cluster CoreDNS not healthy, continuously request down. \nPromQL: (sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0)" type: metric-prometheus - enable: disable expression: prom.coredns.notHealthy.panic name: coredns-unhealthy-panic notification: message: "Cluster CoreDNS not healthy, continuously panic. \nPromQL: sum(rate(coredns_panic_count_total{}[3m])) > 0" type: metric-prometheus - enable: disable expression: prom.ingress.request.errorRateHigh name: ingress-err-request notification: message: Cluster Ingress Controller request error rate high (default error rate is 85%). type: metric-prometheus - enable: disable expression: prom.ingress.ssl.expire name: ingress-ssl-expire notification: message: "Cluster Ingress Controller SSL will expire in a few days (default 14 days). \nPromQL: ((nginx_ingress_controller_ssl_expire_time_seconds - time()) / 24 / 3600) < 14" type: metric-prometheus - name: cluster-error rules: - enable: disable expression: sls.app.ack.docker.hang name: docker-hang notification: message: kubernetes node docker hang. type: event - enable: disable expression: sls.app.ack.eviction name: eviction-event notification: message: kubernetes eviction event. type: event - enable: disable expression: sls.app.ack.gpu.xid_error name: gpu-xid-error notification: message: kubernetes gpu xid error event. type: event - enable: disable expression: sls.app.ack.image.pull_back_off name: image-pull-back-off notification: message: kubernetes image pull back off event. type: event - enable: disable expression: sls.app.ack.node.down name: node-down notification: message: kubernetes node down event. type: event - enable: disable expression: sls.app.ack.node.restart name: node-restart notification: message: kubernetes node restart event. type: event - enable: disable expression: sls.app.ack.ntp.down name: node-ntp-down notification: message: kubernetes node ntp down. type: event - enable: disable expression: sls.app.ack.node.pleg_error name: node-pleg-error notification: message: kubernetes node pleg error event. type: event - enable: disable expression: sls.app.ack.ps.hang name: ps-hang notification: message: kubernetes ps hang event. type: event - enable: disable expression: sls.app.ack.node.fd_pressure name: node-fd-pressure notification: message: kubernetes node fd pressure event. type: event - enable: disable expression: sls.app.ack.node.pid_pressure name: node-pid-pressure notification: message: kubernetes node pid pressure event. type: event - enable: disable expression: sls.app.ack.ccm.del_node_failed name: node-del-err notification: message: kubernetes delete node failed. type: event - enable: disable expression: sls.app.ack.ccm.add_node_failed name: node-add-err notification: message: kubernetes add node failed. type: event - enable: disable expression: sls.app.ack.nlc.run_command_fail name: nlc-run-cmd-err notification: message: kubernetes node pool nlc run command failed. type: event - enable: disable expression: sls.app.ack.nlc.empty_task_cmd name: nlc-empty-cmd notification: message: kubernetes node pool nlc delete node failed. type: event - enable: disable expression: sls.app.ack.nlc.url_mode_unimpl name: nlc-url-m-unimp notification: message: kubernetes nodde pool nlc delete node failed. type: event - enable: disable expression: sls.app.ack.nlc.op_not_found name: nlc-opt-no-found notification: message: kubernetes node pool nlc delete node failed. type: event - enable: disable expression: sls.app.ack.nlc.destroy_node_fail name: nlc-des-node-err notification: message: kubernetes node pool nlc destory node failed. type: event - enable: disable expression: sls.app.ack.nlc.drain_node_fail name: nlc-drain-node-err notification: message: kubernetes node pool nlc drain node failed. type: event - enable: disable expression: sls.app.ack.nlc.restart_ecs_wait_fail name: nlc-restart-ecs-wait notification: message: kubernetes node pool nlc restart ecs wait timeout. type: event - enable: disable expression: sls.app.ack.nlc.restart_ecs_fail name: nlc-restart-ecs-err notification: message: kubernetes node pool nlc restart ecs failed. type: event - enable: disable expression: sls.app.ack.nlc.reset_ecs_fail name: nlc-reset-ecs-err notification: message: kubernetes node pool nlc reset ecs failed. type: event - enable: disable expression: sls.app.ack.nlc.repair_fail name: nlc-sel-repair-err notification: message: kubernetes node pool nlc self repair failed. type: event - name: res-exceptions rules: - enable: disable expression: cms.host.cpu.utilization name: node_cpu_util_high notification: message: kubernetes cluster node cpu utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.host.memory.utilization name: node_mem_util_high notification: message: kubernetes cluster node memory utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.host.disk.utilization name: node_disk_util_high notification: message: kubernetes cluster node disk utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.host.public.network.utilization name: node_public_net_util_high notification: message: kubernetes cluster node public network utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.host.fs.inode.utilization name: node_fs_inode_util_high notification: message: kubernetes cluster node file system inode utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.slb.qps.utilization name: slb_qps_util_high notification: message: kubernetes cluster slb qps utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.slb.traffic.tx.utilization name: slb_traff_tx_util_high notification: message: kubernetes cluster slb traffic utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.slb.max.connection.utilization name: slb_max_con_util_high notification: message: kubernetes cluster max connection utilization too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: "85" type: metric-cms - enable: disable expression: cms.slb.drop.connection name: slb_drop_con_high notification: message: kubernetes cluster drop connection count per second too high. thresholds: - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: count value: "1" type: metric-cms - enable: disable expression: sls.app.ack.node.disk_pressure name: node-disk-pressure notification: message: kubernetes node disk pressure event. type: event - enable: disable expression: sls.app.ack.resource.insufficient name: node-res-insufficient notification: message: kubernetes node resource insufficient. type: event - enable: disable expression: sls.app.ack.ip.not_enough name: node-ip-pressure notification: message: kubernetes ip not enough event. type: event - enable: disable expression: sls.app.ack.csi.no_enough_disk_space name: disk_space_press notification: message: kubernetes csi not enough disk space. type: event - name: cluster-scale rules: - enable: disable expression: sls.app.ack.autoscaler.scaleup_group name: autoscaler-scaleup notification: message: kubernetes autoscaler scale up. type: event - enable: disable expression: sls.app.ack.autoscaler.scaledown name: autoscaler-scaledown notification: message: kubernetes autoscaler scale down. type: event - enable: disable expression: sls.app.ack.autoscaler.scaleup_timeout name: autoscaler-scaleup-timeout notification: message: kubernetes autoscaler scale up timeout. type: event - enable: disable expression: sls.app.ack.autoscaler.scaledown_empty name: autoscaler-scaledown-empty notification: message: kubernetes autoscaler scale down empty node. type: event - enable: disable expression: sls.app.ack.autoscaler.scaleup_group_failed name: autoscaler-up-group-failed notification: message: kubernetes autoscaler scale up failed. type: event - enable: disable expression: sls.app.ack.autoscaler.cluster_unhealthy name: autoscaler-cluster-unhealthy notification: message: kubernetes autoscaler error, cluster not healthy. type: event - enable: disable expression: sls.app.ack.autoscaler.delete_started_timeout name: autoscaler-del-started notification: message: kubernetes autoscaler delete node started long ago. type: event - enable: disable expression: sls.app.ack.autoscaler.delete_unregistered name: autoscaler-del-unregistered notification: message: kubernetes autoscaler delete unregistered node. type: event - enable: disable expression: sls.app.ack.autoscaler.scaledown_failed name: autoscaler-scale-down-failed notification: message: kubernetes autoscaler scale down failed. type: event - enable: disable expression: sls.app.ack.autoscaler.instance_expired name: autoscaler-instance-expired notification: message: kubernetes autoscaler scale down instance expired. type: event - name: workload-exceptions rules: - enable: disable expression: prom.job.failed name: job-failed notification: message: "Cluster Job failed. \nPromQL: kube_job_status_failed{job=\"_kube-state-metrics\"} > 0" type: metric-prometheus - enable: disable expression: prom.deployment.replicaError name: deployment-rep-err notification: message: "Cluster Deployment replication status error. \nPromQL: kube_deployment_spec_replicas{job=\"_kube-state-metrics\"} != kube_deployment_status_replicas_available{job=\"_kube-state-metrics\"}" type: metric-prometheus - enable: disable expression: prom.daemonset.scheduledError name: daemonset-status-err notification: message: "Cluster Daemonset pod status or scheduled error. \nPromQL: ((100 - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{} * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{})) > 0" type: metric-prometheus - enable: disable expression: prom.daemonset.misscheduled name: daemonset-misscheduled notification: message: "Cluster Daemonset misscheduled. \nPromQL: kube_daemonset_status_number_misscheduled{job=\"_kube-state-metrics\"} \ > 0" type: metric-prometheus - name: pod-exceptions rules: - enable: disable expression: sls.app.ack.pod.oom name: pod-oom notification: message: kubernetes pod oom event. type: event - enable: disable expression: sls.app.ack.pod.failed name: pod-failed notification: message: kubernetes pod start failed event. type: event - enable: disable expression: prom.pod.status.notHealthy name: pod-status-err notification: message: 'Pod status exception. \nPromQL: min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed", job="_kube-state-metrics"})[${mins}m:1m]) > 0' type: metric-prometheus - enable: disable expression: prom.pod.status.crashLooping name: pod-crashloop notification: message: 'Pod status exception. \nPromQL: sum_over_time(increase(kube_pod_container_status_restarts_total{job="_kube-state-metrics"}[1m])[${mins}m:1m]) > 3' type: metric-prometheus - name: cluster-storage-err rules: - enable: disable expression: sls.app.ack.csi.invalid_disk_size name: csi_invalid_size notification: message: kubernetes csi invalid disk size. type: event - enable: disable expression: sls.app.ack.csi.disk_not_portable name: csi_not_portable notification: message: kubernetes csi not protable. type: event - enable: disable expression: sls.app.ack.csi.deivce_busy name: csi_device_busy notification: message: kubernetes csi disk device busy. type: event - enable: disable expression: sls.app.ack.csi.no_ava_disk name: csi_no_ava_disk notification: message: kubernetes csi no available disk. type: event - enable: disable expression: sls.app.ack.csi.disk_iohang name: csi_disk_iohang notification: message: kubernetes csi ioHang. type: event - enable: disable expression: sls.app.ack.csi.latency_too_high name: csi_latency_high notification: message: kubernetes csi pvc latency load too high. type: event - enable: disable expression: prom.pv.failed name: pv-failed notification: message: 'Cluster PersistentVolume failed. \nPromQL: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="_kube-state-metrics"} > 0' type: metric-prometheus - name: cluster-network-err rules: - enable: disable expression: sls.app.ack.ccm.no_ava_slb name: slb-no-ava notification: message: kubernetes slb not available. type: event - enable: disable expression: sls.app.ack.ccm.sync_slb_failed name: slb-sync-err notification: message: kubernetes slb sync failed. type: event - enable: disable expression: sls.app.ack.ccm.del_slb_failed name: slb-del-err notification: message: kubernetes slb delete failed. type: event - enable: disable expression: sls.app.ack.ccm.create_route_failed name: route-create-err notification: message: kubernetes create route failed. type: event - enable: disable expression: sls.app.ack.ccm.sync_route_failed name: route-sync-err notification: message: kubernetes sync route failed. type: event - enable: disable expression: sls.app.ack.terway.invalid_resource name: terway-invalid-res notification: message: kubernetes terway have invalid resource. type: event - enable: disable expression: sls.app.ack.terway.alloc_ip_fail name: terway-alloc-ip-err notification: message: kubernetes terway allocate ip error. type: event - enable: disable expression: sls.app.ack.terway.parse_fail name: terway-parse-err notification: message: kubernetes terway parse k8s.aliyun.com/ingress-bandwidth annotation error. type: event - enable: disable expression: sls.app.ack.terway.allocate_failure name: terway-alloc-res-err notification: message: kubernetes parse resource error. type: event - enable: disable expression: sls.app.ack.terway.dispose_failure name: terway-dispose-err notification: message: kubernetes dispose resource error. type: event - enable: disable expression: sls.app.ack.terway.virtual_mode_change name: terway-virt-mod-err notification: message: kubernetes virtual mode changed. type: event - enable: disable expression: sls.app.ack.terway.config_check name: terway-ip-check notification: message: kubernetes terway execute pod ip config check. type: event - enable: disable expression: sls.app.ack.ingress.err_reload_nginx name: ingress-reload-err notification: message: kubernetes ingress reload config error. type: event - name: security-err rules: - enable: disable expression: sls.app.ack.si.config_audit_high_risk name: si-c-a-risk notification: message: kubernetes high risks have be found after running config audit. type: event ruleVersion: v1.0.9
步驟四:分發警示規則到關聯集群中
警示規則實際也是一種Kubernetes資源。警示規則分發的原理和應用分發的原理一樣,都是通過開源Kubevela,將Fleet執行個體上的Kubernetes資源分發到關聯集群中。分發警示規則步驟如下:
使用以下模板建立分發規則ackalertrule-app.yaml。
方式一:將警示規則分發到打標production=true的關聯集群中。
執行以下命令,為關聯集群打標。
kubectl get managedclusters #擷取關聯集群clusterid。 kubectl label managedclusters <clusterid> production=true將警示規則分發到打標production=true的關聯集群中。
apiVersion: core.oam.dev/v1beta1 kind: Application metadata: name: alertrules namespace: kube-system annotations: app.oam.dev/publishVersion: version1 spec: components: - name: alertrules type: ref-objects properties: objects: - resource: ackalertrules name: default policies: - type: topology name: prod-clusters properties: clusterSelector: production: "true" #通過標籤選擇叢集。方式二:可以直接輸入集群ID,將警示規則分發到指定的關聯集群。
替換以下
<clusterid>為您需要下發的關聯集群的ID。apiVersion: core.oam.dev/v1beta1 kind: Application metadata: name: alertrules namespace: kube-system annotations: app.oam.dev/publishVersion: version1 spec: components: - name: alertrules type: ref-objects properties: objects: - resource: ackalertrules name: default policies: - type: topology name: prod-clusters properties: clusters: ["<clusterid1>", "<clusterid2>"] #通過clusterid選擇叢集。
執行以下命令,建立分發規則。
kubectl apply -f ackalertrule-app.yaml執行以下命令,查看分發執行狀態。
kubectl amc appstatus alertrules -n kube-system --tree --detail預期輸出:
CLUSTER NAMESPACE RESOURCE STATUS APPLY_TIME DETAIL c565e4**** (cluster1)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: ** cbaa12**** (cluster2)─── kube-system─── AckAlertRule/default updated 2022-**-** **:**:** Age: **
修改警示規則
您可以通過以下步驟修改警示規則。
修改警示規則模板ackalertrule.yaml,並執行命令
kubectl apply -f ackalertrule.yaml建立警示規則。修改分發模板ackalertrule-app.yaml,更新
annotations: app.oam.dev/publishVersion,並執行kubectl apply -f ackalertrule-app.yaml分發警示規則。