The intelligent monitoring system monitors the status of nodes. If the global rules, custom alert rules, and intelligent baselines that you configured are met, the intelligent monitoring system sends alert notifications based on the notification method that you specified, such as emails, text messages, phone calls, DingTalk group messages, and webhook addresses. This way, you can identify and handle exceptions at the earliest opportunity.

Background information

In addition to the basic monitoring operations that are supported by a traditional monitoring system, DataWorks provides the following capabilities:
  • Accurately identifies the nodes that you want to monitor based on your business requirements.

    DataWorks runs a large number of nodes, and the dependencies between the nodes are complex. This makes it difficult to find and monitor all the ancestor nodes of a node, even if you know which node is the most important. If you monitor all nodes, a large number of alerts are triggered. This causes you to miss the alerts of the nodes that you want to monitor, which leads to poor monitoring performance.

  • Provides different alerting methods for monitored nodes.

    For example, some monitoring tasks require the monitored nodes to run for more than 1 hour before alerts are triggered. Other monitoring tasks require the monitored nodes to run for more than 2 hours before alerts are triggered. A long period of time is required to configure alert rules for each node, and it is difficult to determine the alert threshold for each node.

  • Provides a switch to disable alerting.

    The intelligent monitoring system provides a switch to disable alerting.

The intelligent monitoring system provides comprehensive monitoring and alerting logic. You need only to specify the names of important nodes in your business. Then, the intelligent monitoring system automatically monitors the entire process of your nodes and generates standard alert rules for them. To create custom alert rules, you can configure the basic alert rule settings.

The full-path monitoring feature of the intelligent monitoring system ensures the overall data output of the important business within Alibaba Group. The intelligent monitoring system allows you to analyze ancestor and descendant node paths to identify risks and provide O&M information for business departments. The features that are provided by the intelligent monitoring system ensure the high stability of business within Alibaba Group.

Limits

The following features are supported only in DataWorks Standard Edition and more advanced editions: Baseline Instance, Baseline Management, and Event Management.

Considerations

If you want to receive alert notifications by text messages or phone calls as a RAM user, you must log on to the RAM console by using your Alibaba Cloud account. Then, enter the required information about this RAM user such as the phone number and email address. For more information about how to enter the personal information of a RAM user, see Modify the basic information about a RAM user.

Monitoring methods

The intelligent monitoring system provides the following monitoring methods: intelligent baseline and custom alert rules. This section describes the monitoring principles and configuration rules of the two monitoring methods.
  • Intelligent baseline
    • Monitoring scope

      A baseline is a group of nodes. You can monitor nodes by baseline.

      After you configure a baseline as a monitored object, all nodes in the baseline and the ancestor nodes of the nodes are monitored. After you configure a monitored baseline, the intelligent monitoring system does not monitor all nodes. A descendant node of a monitored node is monitored only if the descendant node is added to the monitored baseline. If the descendant node is not added to the monitored baseline, the intelligent monitoring system does not report an alert even if the descendant node fails. Monitoring scope

      In the preceding figure, DataWorks has six nodes, but only Nodes D and E are added to the monitored baseline. The intelligent monitoring system monitors Nodes D and E and all the ancestor nodes of these two nodes. If an exception such as an error or slowdown occurs on Node A, B, D, or E, the intelligent monitoring system detects the exception. However, the intelligent monitoring system does not monitor Node C or F.

    • Node capturing
      If an exception occurs on a node that is within the specified monitoring scope, the intelligent monitoring system generates an alert event and reports an alert based on the analysis of the alert event. Two types of node exceptions are monitored. You can go to the Event Management page to view the node exceptions. For more information, see Manage events.
      • Error: indicates that a node fails to run.
      • Slow: indicates that the running time of a node is significantly longer than the average running time of the node in the previous periods.
      Note If a node times out and then encounters an error, two events are generated.
    • Alerting time judgment
      Margin is the maximum period of time that the intelligent monitoring system allows you to wait before you start to run a node. The latest start time of a node is calculated by using the following formula: Baseline time - Average running time. Alerting time judgment

      In the preceding figure, to make sure that the baseline time of Baseline A is 05:00, you must set the latest start time of Node E to 04:10. The time 04:10 is obtained by subtracting the average running time of Node F and the average running time of Node E from the baseline time 05:00. In this example, the average running time of Node F is 20 minutes, and the average running time of Node E is 30 minutes. This time is the latest completion time of Node B in Baseline A.

      In the preceding figure, to make sure that the baseline time of Baseline B is 06:00, you must configure the latest completion time of Node B to 04:00. The time 04:00, which is earlier than 04:10, is obtained by subtracting the average running time of Node D from the baseline time 06:00. In this example, the average running time of Node D is 2 hours. To meet the baseline time of both Baseline A and Baseline B, you must set the latest completion time of Node B to 04:00.

      The latest completion time of Node A is 02:00, which is the average running time of Node B subtracted from 04:00. In this example, the average running time of Node B is 2 hours. The latest start time of Node A is 01:50, which is the average running time of Node A subtracted from 02:00. In this example, the average running time of Node A is 10 minutes. If Node A fails to run before 01:50, Baseline A may enter the Overtime state. If Node A fails to run at 01:00, the margin of Node A is 50 minutes, which is the difference between 01:00 and 01:50. As shown in this example, margin indicates the degree of caution for a node exception.

    • Baseline alerting

      The baseline alerting feature provides alerts for baselines for which baseline monitoring is enabled. You must set the Margin Threshold and Committed Time parameters for each baseline. When baseline alerting is enabled, the specified alert recipient is notified three times at intervals of 30 minutes when the baseline completion time that is estimated by the intelligent monitoring system exceeds the alert margin.

    • Notification method

      By default, baseline alerts are sent to the baseline owner. You can change the notification method based on your business requirements. To perform the change, go to the Rule Management page, find Global Baseline Alert Rule, and then click View Details in the Actions column. For more information, see Configure alert details.

    • Gantt chart
      The Gantt chart feature specifies the key path of a node. The Gantt chart feature is provided by the baseline instances of the intelligent monitoring system. Gantt chart
  • Custom alert rules
    To create a trigger for a custom alert, configure the following parameters based on your business requirements:
    • Object: You can specify nodes, baselines, workspaces, and workflows as objects.
    • Trigger Condition: Valid values are Completed, Uncompleted, Error, Uncompleted in Cycle, Overtime, and The error persists after the node automatically reruns.
    • Notification Method: Valid values are SMS, Email, and Phone.
    • Maximum Alerts: the maximum number of times that alerts can be reported. If the maximum number is exceeded, no alerts are reported.
    • Minimum Alert Interval: the time interval at which DataWorks reports alerts.
    • Quiet Hours: the period of time during which no alerts are reported.
    • Recipient: You can set this parameter to the node owner or another recipient.
    The following information shows the trigger conditions for custom alert rules:
    • Completed
      You can configure an alert rule for nodes, baselines, workspaces, and workflows. If all nodes of the specified object are completed within the specified time, an alert is reported.
      Note
      • If you configure an alert rule for a baseline, an alert is reported when all nodes of the baseline are completed.
      • If you configure an alert rule for a workflow, an alert is reported when all nodes of the workflow are completed.
      • If you configure an alert rule for multiple nodes, an alert is reported when all the configured nodes are completed.
      • You cannot configure such an alert for a workspace.
    • Uncompleted
      You can configure an alert rule for nodes, baselines, workspaces, and workflows. If one node of the specified object is not completed within the specified time, an alert is reported. The intelligent monitoring system starts to monitor the object that you specify in the Create Custom Rule dialog box from the time at which the object starts to run until the time specified by the Alert At parameter in the same dialog box.
      Note If you configure an alert rule for a node that is scheduled by hour or minute, an alert is reported if one of the instances of the node is not completed within the time specified by the Alert At parameter on the current day. Example:
      • You set Trigger Condition to Uncompleted and Alert At to 10:00 for a baseline. If a node in the baseline is not completed at 10:00, an alert that contains a list of the uncompleted nodes is reported.
      • If a node scheduled by hour is configured to run on an hourly basis from 00:00 to 23:59 every day and you set the Alert At parameter to 12:00, an alert is triggered every day. For a node that is scheduled by hour or minute, we recommend that you set Trigger Condition to Uncompleted in Cycle in the Create Custom Rule dialog box. This way, an alert is reported if one of the instances of the node scheduled by hour or minute is not completed within the specified cycle.
    • Error

      You can configure an alert rule for nodes, baselines, workspaces, and workflows. If an error is reported for a node, an alert that contains detailed information about the error is sent to the recipient.

      Note
      • If you configure an alert rule for a baseline, workspace, or workflow, an alert is reported when a node in the baseline, workspace, or workflow fails to run.
      • After you set Trigger Condition to Error, an alert is reported every time the node fails. If the node fails to rerun three times, the alert is triggered three times.
    • Uncompleted in Cycle

      If node instances continue to run at the end of the specified cycle, an alert is reported. In most cases, you can configure this alert rule for node instances that are scheduled by hour or minute. For example, Node A is scheduled to run every 2 hours, and the time to complete each run is 25 minutes. If Node A starts to run at 00:00 every day, Node A runs 12 times within 24 hours. The first cycle starts at 00:00, the second cycle starts at 02:00, and so on. The twelfth cycle starts at 22:00. If Node A runs as expected, the running of Node A is complete at the specified point in time in each cycle, such as 00:25 or 02:25. If the node continues to run at the end of a cycle, an alert is reported.

      Note By default, frozen instances are in the Completed state.
    • Overtime
      You can configure alerts for nodes, baselines, workspaces, and workflows. An alert is reported if the execution duration of a node of the monitored object exceeds the specified execution duration and the node status is unsuccessful.
      Note The execution duration is specified by the Timer parameter in the Create Custom Rule dialog box.
    • The error persists after the node automatically reruns

      You can configure an alert rule for nodes, baselines, workspaces, and workflows. If an error persists after a node of the monitored object automatically reruns, an alert is reported.

Features

To reduce your configuration costs and highlight the alerts of important nodes that you want to monitor, the intelligent monitoring system provides the following features:
  • Baseline Instance: allows you to view the information about a baseline. For more information, see Manage baseline instances.
  • Baseline Management: allows you to create and define a baseline. For more information, see Manage baselines.
  • Event Management: allows you to view all alert events that are related to slowdown or errors. For more information, see Manage events.
  • Rule Management: allows you to create custom alert rules, add a DingTalk chatbot, and obtain a webhook URL. The intelligent monitoring system monitors the status of your nodes based on your custom alert rules. This way, you can identify and handle exceptions at the earliest opportunity. For more information, see Manage custom alert rules.
  • Alert Management: allows you to view details about all alerts. This way, you can identify and handle exceptions at the earliest opportunity. For more information, see Manage alerts.
  • Schedule: allows you to use the shift schedule feature of DataWorks to specify a shift schedule for alert recipients during the O&M of nodes. DataWorks sends an alert notification to a recipient of a custom alert rule based on the shift schedule that you specify. This way, the alert recipient can identify and handle exceptions at the earliest opportunity. For more information, see Create and manage a shift schedule.
For more information on frequently asked questions about the intelligent monitoring system, see FAQ.