After you enable monitoring for data transformation jobs, Log Service sends alert notifications if exceptions occur during data transformation. This helps you handle exceptions at the earliest opportunity. This topic describes how to enable monitoring for data transformation jobs.
Prerequisites
Background information
- After you create a data transformation job, Log Service automatically creates a dashboard
named Data Transformation Troubleshooting for the data transformation job. We recommend
that you take note of the following metrics on the Data Transformation Troubleshooting dashboard:
- System metrics: the data consumption delay and relevant exceptions.
- Application metrics: the number of received logs and number of delivered logs.
- Log Service provides built-in alert monitoring rules, action policy, and alert templates
for data transformation. You can use built-in resources based on the following rules:
- You can enable the alert instances of built-in alert monitoring rules without the need to write SQL statements. For example, you can enable the rule that triggers an alert when delay, exceptions, or failures occur during data transformation. For more information, see Monitoring rules for data transformation.
- You can specify notification methods and alert templates in the built-in action policy for data transformation.
- You can specify the content of alert notifications in a built-in alert template for data transformation.
Step 1: Configure an action policy
By default, built-in alert monitoring rules for data transformation are associated with a built-in action policy whose ID is sls.app.etl.builtin. Before you enable the alert instances of built-in alert monitoring rules for data transformation, you must specify one or more notification methods in the action policy.
Step 2: Enable an alert instance
Log Service provides built-in alert monitoring rules. You can enable the alert instances of the related alert monitoring rules based on your business requirements.
Related operations
Operation | Description |
---|---|
Configure allowlists | You can configure allowlists for specific alert monitoring jobs. This way, alerts are not triggered by specific data transformation jobs based on these rules. |
Add alert instances | You can add an alert instance for an alert monitoring rule. You can also configure the alert instance to monitor specific data transformation jobs. |
Disable alert instances | If you disable an alert instance, the value in the Status column of the alert instance changes to Not Enabled, and no more alerts are triggered based on the alert instance.
The configurations of the alert instance are not deleted. If you want to re-enable the alert instance to monitor data, you do not need to reconfigure the parameters of the alert instance. |
Pause alert instances | If you pause an alert instance, no alerts are triggered based on the alert instance within a specified period of time. |
Resume alert instances | You can resume paused alert instances. |
Delete alert instances | If you delete an alert instance, the value in the Status column of the alert instance changes to Not Created.
The configurations of the alert instance, such as the IDs of data transformation jobs, are deleted. If you want to recreate the alert instance to monitor data, you must reconfigure the parameters of the alert instance. |
Modify alert instances | You can modify the parameters of an alert instance, such as the alert name, the IDs of data transformation jobs that you want to monitor, threshold, action policy, and severity. |
Monitoring rules for data transformation
The following tables describe the functionalities, parameters, and associated dashboard metrics of the built-in monitoring rules that are provided by Log Service for data transformation. The tabled also describe the handling methods that are used to clear alerts.
- Data Transformation Delay Monitor rule
Item Description Rule name Data Transformation Delay Monitor Functionality This rule monitors the latency that occurs when data is consumed from shards in data transformation jobs. If the latency during data transformation exceeds the value of the Threshold parameter, an alert is triggered. Parameters - Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).
- Threshold: If the latency of a data transformation job exceeds the value of this parameter, an alert is triggered. Default value: 300. Unit: seconds.
- Action Policy: the action policy that is associated with the current alert monitoring rule. Log Service sends alert notifications to the specified users based on the action policy. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
- Severity: the severity of an alert.
- Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
Associated dashboard Data Transformation Troubleshooting > shard consumption delay (seconds) Handling method You can clear triggered alerts based on the following rules: - If the data volume in the source Logstore significantly increases, perform the following
operations based on your business requirements:
- If the value of the Transform speed (lines/s) metric increases and the value of the shard consumption delay (seconds) metric decreases, the data transformation job is automatically scaling up resources due to the increasing data volume in the source Logstore. In this case, wait for 5 minutes and then check whether the latency is less than the specified threshold. If not, proceed to the next step.
- If the value of the Transform speed (lines/s) metric does not increase or the value of the shard consumption delay (seconds) metric continues to increase, the number of shards in the source Logstore may be insufficient and the extension of resources for data transformation is limited. In this case, you must split the shards in the source Logstore. For more information, see Split a shard. After you split the shards, wait for 5 minutes and then check whether the latency is less than the specified threshold. If not, proceed to the next step.
- If alerts are triggered based on the Data Transformation Error Monitor rule, you must clear the alerts first. After you clear the alerts, wait for 5 minutes and then check whether the latency is less than the specified threshold. If not, proceed to the next step.
- If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.
- Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
- Data Transformation Error Monitor rule
Item Description Rule name Data Transformation Error Monitor Functionality This rule monitors exceptions in data transformation jobs. If an exception occurs during data transformation, an alert is triggered. Parameters - Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).
- Action Policy: the action policy that is associated with the current alert monitoring rule. Log Service sends alert notifications to the specified users based on the action policy. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
- Severity: the severity of an alert.
- Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
Associated dashboard Data Transformation Troubleshooting > Exception detail Handling method Fix exceptions based on the related error messages. - If the error message contains Unauthorized, InvalidAccessKeyId, or SignatureNotMatch, the data transformation job does not have the required permissions to read data from the source Logstore or write data to the destination Logstore. For more information, see Authorization overview.
- If the error message contains ProjectNotExist or LogStoreNotExist, the related project or Logstore of the data transformation job does not exist. In this case, log on to the Log Service console to identify and fix the error.
- If the error message contains SettingError, the configurations of the data transformation job are invalid. For example, if the specified parameter in a function is invalid or the configuration of an external Alibaba Cloud resource such as an Object Storage Service (OSS) bucket or ApsaraDB RDS for MySQL instance is invalid, an error occurs. For more information, see Function overview.
- If the error message contains TransformError, the raw data in the source Logstore does not meet the logic of the current data transformation job. This error may occur when new types of data are imported to the source Logstore. In this case, locate the raw data based on the error message, update the data transformation job, and then try again. For more information, see Manage a data transformation job.
- Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
- Data Transformation Flow (Absolute Value) Monitor rule
Item Description Rule name Data Transformation Flow (Absolute Value) Monitor Functionality This rule monitors the average number of logs that are transformed by data transformation jobs within 5 minutes. If the average number of transformed logs is less than the value of the Threshold parameter, an alert is triggered. Parameters - Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).
- Threshold: If the average number of transformed logs is less than the value of this parameter, an alert is triggered. Default value: 40000. Unit: lines/s.
- Action Policy: the action policy that is associated with the current alert monitoring rule. Log Service sends alert notifications to the specified users based on the action policy. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
- Severity: the severity of an alert.
- Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
Associated dashboard Data Transformation Troubleshooting > Transform speed (lines/s) Handling method You can clear triggered alerts based on the following rules: - If the value change trend in the Transform speed (lines/s) metric is consistent with the increase or decrease trend in the data volume in the source Logstore, the number of transformed logs is limited by the data volume in the source Logstore. If not, proceed to the next step.
- If alerts are triggered based on the Data Transformation Delay Monitor rule, you must clear the alerts first. After you clear the alerts, wait for 15 minutes. If the latency is less than 1 minute but the trend in the amount of the transformed data is inconsistent with the increase or decrease trend in the data volume in the source Logstore, proceed to the next step.
- If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.
- Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
- Data Transformation Flow (Daily Compare) Monitor rule
Item Description Rule name Data Transformation Flow (Daily Compare) Monitor Functionality This rule monitors the increase rate and decrease rate of the transformed data in data transformation jobs within 5 minutes compared with the same period of the previous day. If the increase rate is greater than the value of the Asc Threshold parameter or the decrease rate is greater than the value of the Desc Threshold parameter, an alert is triggered. Parameters - Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).
- Asc Threshold: If the daily increase rate of transformed data is greater than the value of this parameter, an alert is triggered. Default value: 40%.
- Desc Threshold: If the daily decrease rate of transformed data is greater than the value of this parameter, an alert is triggered. Default value: 20%.
- Action Policy: the action policy that is used to send alert notifications. The action policy contains notification methods and alert templates. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
- Severity: the severity of an alert.
- Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
Associated dashboard Data Transformation Troubleshooting > Transform speed (lines/s) Handling method You can clear triggered alerts based on the following rules: - If the value change trend in the Transform speed (lines/s) metric is consistent with the increase or decrease trend in the data volume in the source Logstore, the number of transformed logs is limited by the data volume in the source Logstore. If not, proceed to the next step.
- If alerts are triggered based on the Data Transformation Delay Monitor rule, you must clear the alerts first. After you clear the alerts, wait for 15 minutes. If the latency is less than 1 minute but the trend in the amount of the transformed data is inconsistent with the increase or decrease trend in the data volume in the source Logstore, proceed to the next step.
- If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.
- Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
- Data Transformation Failed Lines Monitor rule
Item Description Rule name Data Transformation Failed Lines Monitor Functionality This rule monitors the number of logs that fail to be transformed by data transformation jobs within 15 minutes. If the number of logs that fail to be transformed during data transformation exceeds the value of the Threshold parameter, an alert is triggered. Parameters - Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.
Default value: .*. This value indicates that all data transformation jobs are monitored. Separate multiple job IDs with vertical bars (|).
- Threshold: If the number of logs that fail to be transformed exceeds the value of this parameter, an alert is triggered. Default value: 10.
- Action Policy: the action policy that is used to send alert notifications. The action policy contains notification methods and alert templates. Default value: sls.app.etl.builtin. This value indicates that alert notifications are sent by using the webhook URL of a DingTalk chatbot.
- Severity: the severity of an alert.
- Repeat Interval: the interval at which Log Service sends one alert notification for repeated alerts. During each interval, Log Service does not send repeated alert notifications for repeated alerts. For example, if you set the Repeat Interval parameter to 1d, 2h, or 3m, Log Service sends only one alert notification within 1 day, 2 hours, or 3 minutes even if repeated alerts are triggered.
Associated dashboard Data Transformation Troubleshooting > Total logs failed Handling method You can clear triggered alerts based on the following rules: - Clear the alerts by using the method that is provided by the Data Transformation Error Monitor rule. If no error message is reported, proceed to the next step.
- If the alerts persist, prepare the information about the related project, Logstore, and data transformation job ID, and then submit a ticket to contact Alibaba Cloud technical support.
- Job ID: the ID of the data transformation job that you want to monitor. Example: dd2de8e7e23f3e42ffbb32fe05710372.