After you enable monitoring for data transformation jobs, Log Service sends alert notifications if exceptions occur during data transformation. This helps you handle exceptions at the earliest opportunity. This topic describes how to enable monitoring for data transformation jobs.

Prerequisites

A data transformation job is created. For more information, see Create a data transformation job.

Background information

  • After you create a data transformation job, Log Service automatically creates a dashboard named Data Transformation Troubleshooting for the data transformation job. We recommend that you take note of the following metrics on the Data Transformation Troubleshooting dashboard:
    • System metrics: the data consumption delay and relevant exceptions.
    • Application metrics: the number of received logs and number of delivered logs.
    For more information, see Data transformation dashboard.
  • Log Service provides built-in alert monitoring rules, action policy, and alert templates for data transformation. You can use built-in resources based on the following rules:
    • You can enable the alert instances of built-in alert monitoring rules without the need to write SQL statements. For example, you can enable the rule that triggers an alert when delay, exceptions, or failures occur during data transformation. For more information, see Monitoring rules for data transformation.
    • You can specify notification methods and alert templates in the built-in action policy for data transformation.
    • You can specify the content of alert notifications in a built-in alert template for data transformation.

Step 1: Configure an action policy

By default, built-in alert monitoring rules for data transformation are associated with a built-in action policy whose ID is sls.app.etl.builtin. Before you enable the alert instances of built-in alert monitoring rules for data transformation, you must specify one or more notification methods in the action policy.

  1. Log on to the Log Service console.
  2. Go to the Action Policy tab.
    1. In the Projects section, click the name of the project that you want to view.
    2. In the left-side navigation pane, click Alerts.
    3. On the Alert Center page, choose Alert Management > Action Policy.
  3. On the Action Policy tab, find the built-in action policy whose ID is sls.app.etl.builtin and click Edit in the Actions column.
  4. In the Edit Action Policy dialog box, click the Primary Action Policy tab. On the Primary Action Policy tab, set the Request URL parameter in the DingTalk-Custom section to the webhook URL of your DingTalk chatbot. Use the default settings for other parameters and click OK.
    For information about how to obtain the webhook URL of a DingTalk chatbot, see DingTalk-Custom. You can use other alert notification methods based on your business requirements. For more information, see Notification methods.

Step 2: Enable an alert instance

Log Service provides built-in alert monitoring rules. You can enable the alert instances of the related alert monitoring rules based on your business requirements.

  1. On the Alert Center page, click Alert Rules/Incidents.
  2. On the Alert Rules/Incidents tab, click SLS Data Transformation.
  3. In the alert monitoring rule list, find the alert monitoring rule that you want to use and click Enable in the Actions column.
    After you enable an alert instance, Log Service monitors all data transformation jobs in real time.
    • To enable multiple alert instances, click Add.
    • If you want to monitor only specific data transformation jobs, click Settings and specify the IDs of the data transformation jobs that you want to monitor.

    For information about the parameters of alert monitoring rules, see Monitoring rules for data transformation.

Related operations

Operation Description
Configure allowlists You can configure allowlists for specific alert monitoring jobs. This way, alerts are not triggered by specific data transformation jobs based on these rules.
Add alert instances You can add an alert instance for an alert monitoring rule. You can also configure the alert instance to monitor specific data transformation jobs.
Disable alert instances If you disable an alert instance, the value in the Status column of the alert instance changes to Not Enabled, and no more alerts are triggered based on the alert instance.

The configurations of the alert instance are not deleted. If you want to re-enable the alert instance to monitor data, you do not need to reconfigure the parameters of the alert instance.

Pause alert instances If you pause an alert instance, no alerts are triggered based on the alert instance within a specified period of time.
Resume alert instances You can resume paused alert instances.
Delete alert instances If you delete an alert instance, the value in the Status column of the alert instance changes to Not Created.

The configurations of the alert instance, such as the IDs of data transformation jobs, are deleted. If you want to recreate the alert instance to monitor data, you must reconfigure the parameters of the alert instance.

Modify alert instances You can modify the parameters of an alert instance, such as the alert name, the IDs of data transformation jobs that you want to monitor, threshold, action policy, and severity.

Monitoring rules for data transformation

Log Service provides the following built-in monitoring rules for data transformation. For information about how to manage alert monitoring rules, see Related operations.

The following tables describe the functionalities, parameters, and associated dashboard metrics of the built-in monitoring rules that are provided by Log Service for data transformation. The tabled also describe the handling methods that are used to clear alerts.

  • Data Transformation Delay Monitor rule
    Item Description
    Rule name Data Transformation Delay Monitor
    Functionality This rule monitors the latency that occurs when data is consumed from shards in data transformation jobs. If the latency during data transformation exceeds the value of the Threshold parameter, an alert is triggered.
    Parameters
    • Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.

      Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).

    • Threshold: If the latency of a data transformation job exceeds the value of this parameter, an alert is triggered. Default value: 300. Unit: seconds.
    • Action Policy: the action policy that is associated with the current alert monitoring rule. Log Service sends alert notifications to the specified users based on the action policy. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
    • Severity: the severity of an alert.
    • Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
    Associated dashboard Data Transformation Troubleshooting > shard consumption delay (seconds)
    Handling method You can clear triggered alerts based on the following rules:
    1. If the data volume in the source Logstore significantly increases, perform the following operations based on your business requirements:
      • If the value of the Transform speed (lines/s) metric increases and the value of the shard consumption delay (seconds) metric decreases, the data transformation job is automatically scaling up resources due to the increasing data volume in the source Logstore. In this case, wait for 5 minutes and then check whether the latency is less than the specified threshold. If not, proceed to the next step.
      • If the value of the Transform speed (lines/s) metric does not increase or the value of the shard consumption delay (seconds) metric continues to increase, the number of shards in the source Logstore may be insufficient and the extension of resources for data transformation is limited. In this case, you must split the shards in the source Logstore. For more information, see Split a shard. After you split the shards, wait for 5 minutes and then check whether the latency is less than the specified threshold. If not, proceed to the next step.
    2. If alerts are triggered based on the Data Transformation Error Monitor rule, you must clear the alerts first. After you clear the alerts, wait for 5 minutes and then check whether the latency is less than the specified threshold. If not, proceed to the next step.
    3. If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.
  • Data Transformation Error Monitor rule
    Item Description
    Rule name Data Transformation Error Monitor
    Functionality This rule monitors exceptions in data transformation jobs. If an exception occurs during data transformation, an alert is triggered.
    Parameters
    • Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.

      Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).

    • Action Policy: the action policy that is associated with the current alert monitoring rule. Log Service sends alert notifications to the specified users based on the action policy. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
    • Severity: the severity of an alert.
    • Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
    Associated dashboard Data Transformation Troubleshooting > Exception detail
    Handling method Fix exceptions based on the related error messages.
    • If the error message contains Unauthorized, InvalidAccessKeyId, or SignatureNotMatch, the data transformation job does not have the required permissions to read data from the source Logstore or write data to the destination Logstore. For more information, see Authorization overview.
    • If the error message contains ProjectNotExist or LogStoreNotExist, the related project or Logstore of the data transformation job does not exist. In this case, log on to the Log Service console to identify and fix the error.
    • If the error message contains SettingError, the configurations of the data transformation job are invalid. For example, if the specified parameter in a function is invalid or the configuration of an external Alibaba Cloud resource such as an Object Storage Service (OSS) bucket or ApsaraDB RDS for MySQL instance is invalid, an error occurs. For more information, see Function overview.
    • If the error message contains TransformError, the raw data in the source Logstore does not meet the logic of the current data transformation job. This error may occur when new types of data are imported to the source Logstore. In this case, locate the raw data based on the error message, update the data transformation job, and then try again. For more information, see Manage a data transformation job.
  • Data Transformation Flow (Absolute Value) Monitor rule
    Item Description
    Rule name Data Transformation Flow (Absolute Value) Monitor
    Functionality This rule monitors the average number of logs that are transformed by data transformation jobs within 5 minutes. If the average number of transformed logs is less than the value of the Threshold parameter, an alert is triggered.
    Parameters
    • Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.

      Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).

    • Threshold: If the average number of transformed logs is less than the value of this parameter, an alert is triggered. Default value: 40000. Unit: lines/s.
    • Action Policy: the action policy that is associated with the current alert monitoring rule. Log Service sends alert notifications to the specified users based on the action policy. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
    • Severity: the severity of an alert.
    • Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
    Associated dashboard Data Transformation Troubleshooting > Transform speed (lines/s)
    Handling method You can clear triggered alerts based on the following rules:
    1. If the value change trend in the Transform speed (lines/s) metric is consistent with the increase or decrease trend in the data volume in the source Logstore, the number of transformed logs is limited by the data volume in the source Logstore. If not, proceed to the next step.
    2. If alerts are triggered based on the Data Transformation Delay Monitor rule, you must clear the alerts first. After you clear the alerts, wait for 15 minutes. If the latency is less than 1 minute but the trend in the amount of the transformed data is inconsistent with the increase or decrease trend in the data volume in the source Logstore, proceed to the next step.
    3. If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.
  • Data Transformation Flow (Daily Compare) Monitor rule
    Item Description
    Rule name Data Transformation Flow (Daily Compare) Monitor
    Functionality This rule monitors the increase rate and decrease rate of the transformed data in data transformation jobs within 5 minutes compared with the same period of the previous day. If the increase rate is greater than the value of the Asc Threshold parameter or the decrease rate is greater than the value of the Desc Threshold parameter, an alert is triggered.
    Parameters
    • Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.

      Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).

    • Asc Threshold: If the daily increase rate of transformed data is greater than the value of this parameter, an alert is triggered. Default value: 40%.
    • Desc Threshold: If the daily decrease rate of transformed data is greater than the value of this parameter, an alert is triggered. Default value: 20%.
    • Action Policy: the action policy that is used to send alert notifications. The action policy contains notification methods and alert templates. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
    • Severity: the severity of an alert.
    • Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
    Associated dashboard Data Transformation Troubleshooting > Transform speed (lines/s)
    Handling method You can clear triggered alerts based on the following rules:
    1. If the value change trend in the Transform speed (lines/s) metric is consistent with the increase or decrease trend in the data volume in the source Logstore, the number of transformed logs is limited by the data volume in the source Logstore. If not, proceed to the next step.
    2. If alerts are triggered based on the Data Transformation Delay Monitor rule, you must clear the alerts first. After you clear the alerts, wait for 15 minutes. If the latency is less than 1 minute but the trend in the amount of the transformed data is inconsistent with the increase or decrease trend in the data volume in the source Logstore, proceed to the next step.
    3. If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.
  • Data Transformation Failed Lines Monitor rule
    Item Description
    Rule name Data Transformation Failed Lines Monitor
    Functionality This rule monitors the number of logs that fail to be transformed by data transformation jobs within 15 minutes. If the number of logs that fail to be transformed during data transformation exceeds the value of the Threshold parameter, an alert is triggered.
    Parameters
    • Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.

      Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).

    • Threshold: If the number of logs that fail to be transformed exceeds the value of this parameter, an alert is triggered. Default value: 10.
    • Action Policy: the action policy that is used to send alert notifications. The action policy contains notification methods and alert templates. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
    • Severity: the severity of an alert.
    • Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
    Associated dashboard Data Transformation Troubleshooting > Total logs failed
    Handling method You can clear triggered alerts based on the following rules:
    1. Clear the alerts by using the method that is provided by the Data Transformation Error Monitor rule. If no error message is reported, proceed to the next step.
    2. If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.