×
Community Blog Use Terraform to Implement SLS Alert

Use Terraform to Implement SLS Alert

This article describes how to use Terraform to implement a simple automated configuration to complete the alert configuration without interfaces.

By Huolang, from Alibaba Cloud Storage

Preface

Terraform is an automated orchestration tool for IT the basic architecture open sourced by HashiCorp. Write, Plan, and Create Infrastructure as Code. The command line interface (CLI) of Terraform provides a simple mechanism for deploying configuration files to Alibaba Cloud or any other supported cloud and implementing version control.

SLS Alert provides a comprehensive intelligent O&M platform to monitor alerts, reduce noise, manage transactions, and assign notifications. It includes modules, such as log and time series storage, alert monitoring, alert management, and notification management. Powerful features also require automated configuration. This article describes how to use Terraform to implement a simple automated configuration to complete the alert configuration without interfaces.

Install and Configure Terraform

Please refer to the official link of Alibaba Cloud Terraform for the installation and configuration of Terraform. The Terraform command line has been integrated into Cloud Shell.

An Introduction to Resources Related to SLS Alert

SLS Alert mainly involves three operations:

  • Initialize alert resources
  • Manage the rules of alert monitoring
  • Manage alert policy and resource data

1

Initialize Alert Resources

  • Initialize Alert Resources

    • Central Project: The name is sls-alert-{uid}-{region}, where uid is the Alibaba Cloud primary account, and region is the region of the central project specified by the user.
    • Central Logstore: The name is internal-alert-center-log. This logstore is mounted to the central project and is free of charge. It is mainly used to store the execution history and diagnostic information during alert execution.
    • Built-In Alert Dashboard: It includes a global alert troubleshooting center, a global alert link center, a global alert rule center, and an open alert center.
    • Each Alibaba Cloud primary account only needs to be initialized once; idempotent is operated multiple times.
  • Initialize Alert Resources of Project

    • Rules of alert monitoring must be mounted to a project of SLS. You need to initialize the alert resources under the project before creating an alert rule under a project.
    • Logstore of Alert History Statistics: The name is internal-alert-history. It is a free logstore that stores the evaluation history of all alert rules in the current project, including the status of each evaluation and the status of alert triggering.
    • Built-In Dashboard of Alert History Statistics: The name is internal-alert-analysis. It is a built-in dashboard that shows the success rate of executing rules of alert monitoring.
    • Each project only needs to be initialized once; idempotent is operated multiple times.

Manage Rules of Alert Monitoring

The rules of alert monitoring can set monitoring settings for data sources (such as time series and logs), which include collaborative monitoring, group evaluation, triggering condition setting, severity setting, non-data alert, alert recovery, and other conditional parameters.

Manage Alert Resource Data

In SLS Alert, after monitoring rules are triggered, a triggered alarm message will match the preset alarm policy. The alarm policy includes noise reduction processing, such as merging, silence, and suppression. After noise reduction processing, the triggered alarm message will be sent to the specified action policy that can be simply understood as a notification channel.

Notification channels include text messages, voice messages, emails, webhooks, DingTalk, WeChat, Feishu, Function Compute, and EventBridge. Managing alert resource data involves the management of users, user groups, and webhooks.

The preceding alert policy, action policy, users, user groups, and webhooks are collectively referred to as alert resource data in SLS.

Use Terraform to Manage SLS Alert

Configure Identity Information and the Central Area Related to Alerts

export ALICLOUD_ACCESS_KEY="LTAIUrZCw3********"
export ALICLOUD_SECRET_KEY="zfwwWAMWIAiooj14GQ2*************"
export ALICLOUD_REGION="cn-heyuan"

Initialize Alibaba Cloud Alert Resources

The following configuration creates resources under the ALICLOUD_REGION:

  • Project: The format of the name is sls-alert-{uid}-{region}.
  • Logstore: Internal-alert-center-log (This logstore is free of charge.)
  • Built-In Dashboard of Project: Global alert troubleshooting center, global alert link center, global alert rule center, and open alert center
  • Please refer to alicloud_log_alert_resource for the meaning of specific parameters.
data "alicloud_log_alert_resource" "example" {
  type          = "user"
  lang          = "cn"
}

Initialize Alert Resources of Project

The following configuration creates resources in the test-project:

  • Logstore: Internal-alert-log (This logstore is free of charge.)
  • Alert Dashboard
  • Note: The test-project is required to be in the region of ALICLOUD_REGION.
  • Please refer to alicloud_log_alert_resource for the meaning of specific parameters.
data "alicloud_log_alert_resource" "example" {
  type          = "project"
  project       = "test-project"
}

Create Alert Rules

The following configurations will create the rules of alert monitoring, including the following contents:

  • Alert name, timing policy, and non-data alerts
  • Query List: You can specify the logstore and metricstore queries.
  • Label, label configuration, group evaluation, severity configuration
  • The configuration of alert policies and action policies
  • Please refer to alicloud_log _alert for the meaning of specific parameters.
resource "alicloud_log_alert" "example" {
  version           = "2.0"
  type              = "default"
  project_name      = "test-project"
  alert_name        = "tf-test-alert-2"
  alert_displayname = "tf-test-alert-displayname-2"
  dashboard         = "tf-test-dashboard"
  mute_until        = "1632486684"
  no_data_fire      = "false"
  no_data_severity  = 8
  send_resolved     = true
  schedule_interval = "5m"
  schedule_type     = "FixedRate"
  query_list {
    store       = "tf-test-logstore"
    store_type  = "log"
    project     = "test-project"
    region      = "cn-heyuan"
    chart_title = "chart_title"
    start       = "-60s"
    end         = "20s"
    query       = "* AND aliyun | select count(1) as cnt"
    time_span_type = "Custom"
  }
  query_list {
    store       = "tf-test-logstore-5"
    store_type  = "log"
    project     = "test-project"
    region      = "cn-heyuan"
    chart_title = "chart_title"
    start       = "-60s"
    end         = "20s"
    query       = "error | select count(1) as error_cnt"
    time_span_type = "Custom"
  }
  join_configurations {
      type = "cross_join"
      condition = ""
  }
  labels {
    key = "env"
    value = "test"
  }
  labels {
    key = "env1"
    value = "test1"
  }
  annotations {
    key = "title"
    value = "alert title-1"
  }
  annotations {
    key = "desc"
    value = "alert desc"
  }
  annotations {
    key = "test_key"
    value = "test value"
  }
  group_configuration {
    type   = "custom"
    fields = ["a", "b", "d"]
  }
  severity_configurations {
    severity = 8
    eval_condition = {
      condition = "cnt > 3"
      count_condition = "__count__ > 3"
    }
  }
  severity_configurations {
    severity = 6
    eval_condition = {
      condition = ""
      count_condition = "__count__ > 0"
    }
  }
  severity_configurations {
    severity = 2
    eval_condition = {
      condition = ""
      count_condition = ""
    }
  }
  
  policy_configuration {
    alert_policy_id  = "sls.builtin.dynamic"
    action_policy_id = "sls_test_action"
    repeat_interval  = "1m"
  }
}

Create Alert Resources

Alert resources mainly include users, user groups, on-duty groups, webhook integration, alert policies, action policies, content templates, default logs, and channel quotas. Next, this article takes user creation as an example to introduce the Terraform format. The introduction to the list of relevant resources and structure is attached.

User Creation

  • The resource_name uses the sls.common.user in the resource type table.
  • The record_id indicates the ID of the user.
  • The tag indicates the user name.
  • The value is a JSON string. It refers to the structure example in the following table.
resource "alicloud_log_resource_record" "user" {
  resource_name         = "sls.common.user"
  record_id             = "test_tf_user"
  tag                   = "test tf user" 
  value                 = "{\n\t\"user_name\": \"test tf user\", \n\t\"sms_enabled\": true, \n\t\"phone\": \"18888888889\", \n\t\"voice_enabled\": false, \n\t\"email\": [\n\t\t\"test@qq.com\"\n\t], \n\t\"enabled\": true, \n\t\"user_id\": \"test_tf_user\", \n\t\"country_code\": \"86\"\n}"
}

List of Related Resources

Resource Type: Users

resource_name: sls.common.user

record_id: The value is the same as user_id.

Tag: The value is the same as user_name.

Example of value structure:

{
    "user_id": "xiaoming",
    "user_name": "Xiaoming",
    "email": [
        "xiaoming@example.com"
    ],
    "country_code": "86",
    "phone": "13334567890",
    "enabled": true,
    "sms_enabled": true,
    "voice_enabled": true
}

Resource Type: User group

resource_name: sls.common_user_group

record_id: The value is the same as user_group_id.

Tag: The value is the same as user_group_name.

Example of value structure:

{ 
    "user_group_id": "group-xiaoming",
    "user_group_name": "Group-Xiaoming",
    "enabled": true,
    "members": [
        "xiaoming"
    ]
}

Remarks:

  • record_id: user_id
  • tag: user_name

Resource Type: On-Duty Group

resource_name: sls.alert.oncall_group

record_id: The value is the same as oncall_id.

Tag: The value is the same as oncall_name.

Example of value structure:

{
    "oncall_id": "default_oncall",
    "oncall_name": "default oncall",
    "enabled": true,
    "overrides": [],
    "rotations": [
        {
            "targets": [
                {
                    "type": "user",
                    "target_id": "jizhi"
                },
                {
                    "type": "user_group",
                    "target_id": "alert-dev"
                }
            ],
            "end_time": 0,
            "shift_day": "",
            "shift_time": "12:00",
            "shift_type": "day",
            "start_time": 1633017600,
            "shift_minute": 0,
            "end_time_type": "none",
            "shift_interval": 1,
            "shift_week_custom": null,
            "restriction_date_type": "workday",
            "restriction_time_type": "allday",
            "restriction_week_range": null,
            "restriction_time_custom_range": null
        }
    ],
    "calendar_id": "default_calendar"
}

Resource Type: Webhook Integration

resource_name: sls.alert.action_webhook

record_id: The value is the same as the id.

Tag: The value is the same as the name.

Example of value structure:

{
    "id": "custom-test",
    "name": "customized webhook test",
    "type": "custom",
    "url": "http://localhost:9099/data/webhook",
    "method": "POST",
    "headers": [
        {
            "key": "Content-Type",
            "value": "application/json"
        },
        {
            "key": "Foo",
            "value": "bar"
        }
    ]
}

Remarks:

Types include:

  • DingTalk
  • WeChat
  • Lark
  • Slack
  • Custom
  • Some types have method fields that are fixed as POST (except for the custom type) and headers that are fixed as empty arrays.

Resource Type: Alert Policy

resource_name: sls.alert.alert_policy

record_id: The value is the same as policy_id.

Tag: The value is the same as policy_name.

Example of value structure:

{
    "policy_id": "sls.builtin",
    "policy_name": "built-in alert policy",
    "parent_id": "sls.root",
    "is_default": false,
    "group_script": "fire(action_policy=\"sls.builtin\", group={\"project\": \"__a__\", \"uid\": alert.aliuid}, group_wait=\"5s\", group_interval=\"2m\", repeat_interval=\"2m\")\nstop()\nfire(action_policy=\"sls.builtin\", group={\"alert_id\": alert.alert_id}, group_wait=\"5s\", group_interval=\"10s\", repeat_interval=\"2m\")\nif alert.labels.name ~= \"^\\\\w+s$\":\n\tfire(action_policy=\"sls.builtin\", group={\"product\": \"xxs\"}, group_wait=\"5s\", group_interval=\"10s\", repeat_interval=\"2m\")\n\tstop()\nstop()\nfire(action_policy=\"sls.builtin\", group={\"label_name\": alert.labels.name}, group_wait=\"10s\", group_interval=\"10s\", repeat_interval=\"2m\")",
    "inhibit_script": "if alert.severity >= 8:\n    silence alert.severity < 6",
    "silence_script": ""
}

Remarks:

  • The is_default is fixed as false.
  • The group_script is the route consolidation policy.
  • The inhibit_script is the suppression policy.
  • The silence_script is a silence policy.
  • If we use Terraform configuration, primary_policy_script and secondary_policy_script only contain DSL script information, and no UI configuration information. There are no graphics displayed on the console.

Resource Type: Action Policy

resource_name: sls.alert.action_policy

record_id: The value is the same as action_policy_id.

Tag: The value is the same as action_policy_name.

Example of value structure:

{
    "action_policy_id": "sls.builtin",
    "action_policy_name": "default action policy",
    "labels": {},
    "is_default": false,
    "primary_policy_script": "fire(type=\"webhook_integration\", integration_type=\"dingtalk\", webhook_id=\"dingtalk-test\", template_id=\"default-template\", period=\"any\")",
    "secondary_policy_script": "fire(type=\"voice\", users=[\"jizhi\"], groups=[\"group-jizhi\"], template_id=\"default-template\")",
    "escalation_start_enabled": false,
    "escalation_start_timeout": "10s",
    "escalation_inprogress_enabled": false,
    "escalation_inprogress_timeout": "10s",
    "escalation_enabled": false,
    "escalation_timeout": "4h0m0s"
}

Remarks:

  • The is_default is fixed as false.
  • The labels are reserved fields and are fixed as {}.
  • The primary_policy_script is the first action strategy.
  • The secondary_policy_script as the second action strategy.
  • The escalation_* is the configuration to control whether the second action policy is enabled. Please see the configuration items on the console for more information.
  • Through Terraform configuration, primary_policy_script and secondary_policy_script only contain DSL script information but no UI configuration information, so the corresponding graphics are not displayed on the console.

Resource Type: Content Template

resource_name: sls.alert.content_template

record_id: The value is the same as template_id.

Tag: The value is the same as template_name.

Example of value structure

{
    "template_id": "default-template",
    "template_name": "default template",
    "is_default": false,
    "templates": {
        "fc": {
            "limit": 0,
            "locale": "zh-CN",
            "content": "",
            "send_type": "merged"
        },
        "sms": {
            "locale": "zh-CN",
            "content": ""
        },
        "lark": {
            "title": "Alerthub alert test ${alert_name}",
            "locale": "zh-CN",
            "content": ""
        },
        "email": {
            "locale": "zh-CN",
            "content": "",
            "subject": "SLS alert test -jizhi-test"
        },
        "slack": {
            "title": "Alerthub alert test ${alert_name}",
            "locale": "zh-CN",
            "content": ""
        },
        "voice": {
            "locale": "zh-CN",
            "content": ""
        },
        "wechat": {
            "title": "Alerthub alert test ${alert_name}",
            "locale": "zh-CN",
            "content": ""
        },
        "webhook": {
            "limit": 0,
            "locale": "zh-CN",
            "content": "",
            "send_type": "merged"
        },
        "dingtalk": {
            "title": "Alerthub alert test ${alert_name}",
            "locale": "zh-CN",
            "content": ""
        },
        "event_bridge": {
            "locale": "zh-CN",
            "content": "",
            "subject": "wkb-test"
        },
        "message_center": {
            "locale": "zh-CN",
            "content": ""
        }
    }
}

Remarks:

  • The is_default is fixed as false.
  • Templates contain the template configuration of each channel. If the content of a channel is empty, the default template configuration of the system is used. Please refer to Default Content Templates for more information.
  • The locale value is zh-CN or en-US.
  • The send_type value of the webhook and fc channels is single or merged, which indicates one-by-one notifications or merged notifications.

Resource Type: Default Calendar

resource_name: sls.common.calender

record_id: The value is the same as calender_id.

Tag: The value is the same as calender_name.

Example of value structure:

{
    "calendar_id": "default_calendar",
    "calendar_name": "default calendar",
    "timezone": "Asia/Shanghai",
    "workdays": [
        1,
        2,
        3,
        4,
        5
    ],
    "worktime": [
        {
            "end_time": "21:00",
            "start_time": "09:00"
        }
    ],
    "reset_days": [],
    "holiday_sync": "china"
}

Remarks:

  • The id of the default calendar is fixed as default_calendar.
  • The calendar configuration is complicated. We recommend using the console to operate it.

Resource Type: Channel Quota

resource_name: sls.alert.channel_quota

record_id: The value is the same as the id.

Tag: The value is empty.

Example of value structure:

{
    "id": "default",
    "quota_script": "if user in [\"jizhi\"]:\n    set_limit(sms=5, voice=5, email=5)\nset_limit(sms=100, voice=100, email=100)"
}

Remarks:

  • The id is fixed as default.
  • The quota_script is a DSL configuration. If it is configured through Terraform, the UI configuration information will be lost. Therefore, the corresponding graphics are not displayed on the console.
  • We recommend using the console to implement the configuration.

Common Commands of Terraform

  • Create the terraform.tf file, enter the preceding content, and then save the file to the current executing directory.
  • terraform init: Initialize Terraform configuration
  • terraform plan: You can view the difference between terraform.tf that will be modified and that has been applied. The result is displayed in the form of diff.
  • terraform apply: Create and update resources in the terraform.tf file
  • terraform destroy: Destroy resources
  • terraform import: Import existing resources that are created and managed through non-Terraform

References

0 1 0
Share on

Alibaba Cloud Community

864 posts | 196 followers

You may also like

Comments