The intelligent monitoring system monitors the status of nodes. If the global rules, custom alert rules, and intelligent baselines that you configured are met, the intelligent monitoring system sends alert notifications based on the notification method that you specified, such as emails, text messages, phone calls, and DingTalk group messages. This way, you can detect and handle exceptions at the earliest opportunity.

Background information

In addition to the basic monitoring operations that are supported by a traditional monitoring system, DataWorks provides the following capabilities:
  • Accurately identifies the nodes that you want to monitor based on your business requirements.

    DataWorks runs a large number of nodes, and the dependencies between the nodes are complex. Therefore, it is difficult for you to find all the ancestor nodes of a node and monitor them all even if you know which node is the most important. If you monitor all nodes, a large number of alerts are triggered, and the alerts of the nodes that you want to monitor may be missed. This leads to poor monitoring performance.

  • Provides different alerting methods for monitored nodes.

    For example, some monitoring tasks require the relevant nodes to run for more than 1 hour before alerts are triggered. Other monitoring tasks require the relevant nodes to run for more than 2 hours before alerts are triggered. It is time-consuming to configure alert rules for each node, and difficult to determine the alert threshold for each node.

  • Supports different alerting time for monitored nodes.

    For example, alerts for less important nodes can be reported at the time after the alert contact you configured begins to work in the morning, but alerts for important nodes must be immediately reported after they are triggered. It is difficult for a conventional monitoring system to distinguish between important nodes and less important nodes.

  • Provides a switch to disable alerting.

    The intelligent monitoring system provides a switch to disable alerting.

The intelligent monitoring system provides comprehensive monitoring and alerting logic. You need to provide only the names of important nodes in your business. Then, the intelligent monitoring system automatically monitors the entire process of your nodes and generates standard alert rules for them. To create custom alert rules, you can configure the basic alert rule settings.

The full-path monitoring feature of the intelligent monitoring system ensures the overall data output of all the important business of Alibaba Group. The intelligent monitoring system allows you to analyze ancestor and descendant node paths to detect risks and provide O&M information for business departments. The features that are provided by the intelligent monitoring system ensure the high stability of business in Alibaba Group.

Limits

The following features are supported only in DataWorks Standard Edition and more advanced editions: Baseline Instance, Baseline Management, and Event Management.

Considerations

If you want to receive alert notifications by text messages or phone calls as a RAM user, you must log on to the RAM console by using your Alibaba Cloud account. Then, provide the required information about this RAM user such as the phone number and email address. For more information about how to provide the personal information of a RAM user, see Modify the basic information about a RAM user.

Monitoring methods

The intelligent monitoring system provides the following monitoring methods: intelligent baseline and custom alert rules. This section describes the monitoring principles and configuration rules of the two monitoring methods.
  • Intelligent baseline
    • Monitoring scope

      A baseline is a group of nodes. You can monitor nodes by baseline.

      After you configure a baseline as a monitored object, all nodes in the baseline and the ancestor nodes of these nodes are monitored. In this case, the intelligent monitoring system does not monitor all nodes. A descendant node of a monitored node is monitored only if the descendant node is added to a monitored baseline. If no descendant nodes are added to a monitored baseline, the intelligent monitoring system does not report an alert even if the node fails. Monitoring scope

      In the preceding figure, DataWorks has six nodes and Nodes D and E are added to a monitored baseline. The intelligent monitoring system monitors Nodes D and E and all the ancestor nodes of these two nodes. That is, if an exception (error or slowdown) occurs on Node A, B, D, or E, the intelligent monitoring system will detect the exception. However, the intelligent monitoring system does not monitor Node C or F.

    • Node capturing
      If an exception occurs on a node that is within the specified monitoring scope, the intelligent monitoring system generates an alert event and reports an alert based on the analysis of the alert event. Two types of node exceptions are monitored. You can go to the Event Management page to view them. For more information, see Manage events.
      • Error: indicates that a node fails to run.
      • Slow: indicates that the running time of a node is significantly longer than the average running time of the node in the previous periods.
      Note If a node times out and then encounters an error, two events are generated.
    • Alerting time judgment
      Margin is the maximum period of time that the intelligent monitoring system allows you to wait before you start to run a node. The latest start time of a node is calculated by using the following formula: Baseline time - Average running time. Alerting time judgment

      In the preceding figure, to make sure that the baseline time of Baseline A is 05:00, you must set the latest start time of Node E to 04:10. This time is calculated by subtracting the average running time of Node F (20 minutes) and that of Node E (30 minutes) from the baseline time 05:00. This time is the latest completion time of Node B in Baseline A.

      In the preceding figure, to make sure that the baseline time of Baseline B is 06:00, you must configure the latest completion time of Node B to 04:00. This time, which is earlier than 04:10, is calculated by subtracting the average running time of Node D (2 hours) from the baseline time 06:00. To meet the baseline time of both Baseline A and Baseline B, you must configure the latest completion time of Node B to 04:00.

      The latest completion time of Node A is 02:00, which is calculated by subtracting the average running time of Node B (2 hours) from 04:00. The latest start time of Node A is 01:50, which is calculated by subtracting the average running time of Node A (10 minutes) from 02:00. If node A fails to run before 01:50, Baseline A may enter the broken state. If node A fails to run at 01:00, its margin is 50 minutes, which is the difference between 01:00 and 01:50. As shown in this example, margin reflects the degree of caution for a node exception.

    • Baseline alerting

      The baseline alerting feature provides an alert for baselines with baseline monitoring enabled. You must set the Margin Threshold and Committed Time parameters for each baseline. Baseline alerting is the action of notifying the preset alert recipient three times at intervals of 30 minutes when the baseline completion time estimated by the intelligent monitoring system exceeds the alert margin.

    • Notification method

      By default, baseline alerts are sent to the baseline owner. You can change the notification method and the alerting action based on your business requirements. To perform the change, go to the Rule Management page, find Global Baseline Alert Rule, and then click View Details in the Actions column. For more information, see Configure alert details.

    • Gantt chart

      The Gantt chart feature reflects the key path of a node. The Gantt chart feature is provided by the baseline instances of the intelligent monitoring system.

  • Custom alert rules
    To create a trigger for a custom alert, configure the following parameters based on your business requirements:
    • Object: You can specify nodes, baselines, and workspaces as objects.
    • Trigger Condition: Valid values are Completed, Uncompleted, Error, Uncompleted in Cycle, and Overtime.
    • Notification Method: Valid values are SMS, Email, and Phone.
    • Maximum Alerts: the maximum number of times alerts can be reported. If the maximum number is exceeded, no alerts are reported.
    • Minimum Alert Interval: the time interval at which DataWorks reports alerts.
    • Quiet Hours: the period of time during which no alerts are reported.
    • Recipient: You can set this parameter to the node owner or another recipient.
    The following information shows the trigger conditions for custom alert rules:
    • Completed

      You can configure alerts for nodes, baselines, or workspaces that are completed at a specific point in time. After all nodes of the preset object are completed, an alert is reported. For example, if you configure an alert for a baseline, an alert is reported when all nodes of the baseline are completed.

    • Uncompleted

      You can configure alerts for nodes, baselines, or workspaces that are not completed at a specific point in time. For example, you require that a baseline be completed at 10:00. If a node in the baseline is not completed at 10:00, an alert that contains a list of uncompleted nodes is reported.

    • Error

      You can configure alerts for nodes, baselines, or workspaces for which errors are reported. If an error is reported for a node, an alert that contains detailed information about the error is sent to the recipient.

    • Uncompleted in Cycle

      In the alert triggers of hourly scheduled nodes, you can configure alerts for nodes that are not completed at a specific point in time during each scheduling cycle.

    • Overtime

      You can configure alerts for nodes, baselines, or workspaces where a timeout occurs. If a monitored node of the preset object is not completed within the specified period of time, an alert is reported.

Features

To reduce your configuration costs and highlight the alerts of important nodes that you want to monitor, the intelligent monitoring system provides the following features:
  • Baseline Instance: allows you to view the information about a baseline. For more information, see Manage baseline instances.
  • Baseline Management: allows you to create and define a baseline. For more information, see Manage baselines.
  • Event Management: allows you to view all alert events that are related to slowdown or errors. For more information, see Manage events.
  • Rule Management: allows you to create custom alert rules, and add a DingTalk chatbot and obtain a webhook URL. The intelligent monitoring system monitors the status of your nodes based on your custom alert rules. This way, you can detect and handle exceptions at the earliest opportunity. For more information, see Manage custom alert rules.
  • Alert Management: allows you to view details about all alerts. This way, you can detect and handle exceptions at the earliest opportunity. For more information, see Manage alerts.
  • Schedule: allows you to use the shift schedule feature of DataWorks to specify a shift schedule for alert recipients during the O&M of nodes. DataWorks sends an alert notification to a recipient of a custom alert rule based on the shift schedule that you specify. This way, the alert recipient can detect and handle exceptions at the earliest opportunity.
For more information on the answers to some frequently asked questions about the intelligent monitoring system, see FAQ.