DataWorks allows you to backfill data for an auto triggered node to run the node in a specified time range. You can backfill data for an auto triggered node and its descendant nodes. You can view the status of the generated data backfill instances, and stop, rerun, or unfreeze these instances on the Patch Data page in Operation Center. This topic describes how to backfill data for auto triggered nodes and manage data backfill instances.

Background information

After an auto triggered node is developed, committed, and deployed to the scheduling system, the scheduling system runs the node based on the scheduling configurations of the node. If you want to run the auto triggered node in a specified time range, you can backfill data for the node. For more information about how to backfill data for an auto triggered node, see Backfill data in this topic. The following data backfill modes are supported:
  • Backfill Data for Current Node: This mode allows you to backfill data for the current node.
  • Current and Descendent Nodes Retroactively: This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a small number of descendant nodes, we recommend that you use this mode. In this mode, you can specify the descendant nodes for which you want to backfill data.
  • Backfill Data for Massive Nodes: This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a large number of descendant nodes, we recommend that you use this mode. In this mode, you can filter the descendant nodes by workspace. You can configure a whitelist to backfill data for some nodes that are not in the selected workspaces. You can also configure a blacklist to avoid backfilling data for some nodes that are in the selected workspaces.
  • Advanced Mode: This mode allows you to backfill data for multiple nodes at a time. In this mode, you can select nodes that do not have dependencies with each other. You can select nodes for which you want to backfill data in the directed acyclic graph (DAG) of an auto triggered node or in the node list on the Cycle Task page.
    • In the DAG, you can use the node aggregation feature to group nodes by workspace, owner, or priority. This way, you can backfill data for multiple nodes at a time by specifying a node group. For more information about a DAG, see Manage instances in a DAG.
    • You can also select nodes in the node list on the Cycle Task page. You can filter nodes based on specific conditions and select the nodes for which you want to backfill data.

Limits

  • You can use the advanced mode only in workspaces that reside in the China (Shenzhen) and UAE (Dubai) regions.
  • Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
  • Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days).
  • Instances that run on exclusive resource groups for scheduling are retained for one month (30 days), and logs for the instances are also retained for one month (30 days).
  • If the number of logs of an instance in the Complete state is greater than 3 MB, it will be cleaned regularly every day.

Usage notes

  • When DataWorks backfills data for a node for a specific time range, if an instance generated for the node fails on a day within the time range, the status of the data backfill instance of the node for that day is also set to failed. In this case, DataWorks does not run the instances generated for this node for the next day. DataWorks runs the instances generated for a node on a day only after all instances generated for the node on the previous day are successfully run.
  • When an hourly or minute task generates data for a day, whether all instances are concurrently executed on that day is related to whether the task is self-dependent. When a self-dependent task is used to generate retroactive data, the retroactive data generation task cannot be triggered if the periodic instance of the first instance in the retroactive data generation task is not running on the day before. If the first instance for which data needs to be backfilled does not depend on an instance generated on the previous day, the data backfill instance of the node is directly run.
  • If both an auto triggered node instance and a data backfill instance are running for a node, you must stop the data backfill instance to ensure that the auto triggered node instance can run as expected.
  • If you backfill data for multiple instances or run a large number of data backfill instances in parallel, scheduling resources may be insufficient. Make sure that your configurations are appropriate based on your business requirements.
  • To prevent retroactive instances from consuming too many resources and affecting the running of periodic instances, the platform formulates the following rules for retroactive instances:
    • if you set the data generation date to yesterday (T-1), the priority of the data generation task is determined by the baseline priority of the task.
    • If you select a historical date (T-2) for retroactive data generation, the retroactive data generation task is downgraded based on the following rules:
      • The priority of level 7 and level 8 tasks is reduced to level 3.
      • The priority of level 5 and level 3 tasks is reduced to level 2.
      • The level 1 task priority remains unchanged.

Backfill data

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace that you want to manage resides. Find the workspace and click DataStudio in the Actions column.
  2. On the DataStudio page, click the Cycle Task icon in the upper-left corner and choose All Products > Operation Center.
  3. In the left-side navigation pane of the Operation Center page, choose Cycle Task Maintenance > Cycle Task.
  4. Backfill data for nodes.
    1. On the Cycle Task page, find the desired auto triggered node and click the node name to open the DAG of the node.
      You can also click the Show icon icon to show the node list. Then, find the desired node and click DAG in the Actions column to open the DAG of the node.
    2. In the DAG, right-click the node for which you want to backfill data. In the shortcut menu that appears, move the pointer over Run and select a data backfill mode. In the dialog box that appears, configure the parameters.
    Note You can also perform the following steps on the Cycle Task page to backfill data for an auto triggered node: Click the Show icon icon to show the auto triggered node list. Find the desired node, click Backfill Data in the Actions column, and then select a data backfill mode.
    Data backfill modesYou can configure the parameters required for each data backfill mode based on the following descriptions:
    • Backfill data in Backfill Data for Current Node mode.
      Backfill data for the current nodeThe following table describes the parameters required for this mode.
      Parameter Description
      Data Backfill Instance Name

      DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements.

      Node

      The name of the node for which you want to backfill data.

      Data Timestamp
      The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
      • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
      • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

        For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

      Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
      Parallelism
      Specifies whether to run multiple data backfill instances in parallel.
      • If you do not select Parallelism, the data backfill instances are run in sequence based on the data timestamps.
      • If you select Parallelism, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. The number of data backfill instances is specified by the Number of Concurrent Nodes parameter. Data backfill instances with different data timestamps can be run at the same time.
      Number of Concurrent Nodes
      The number of data backfill instances that are generated and run in parallel during data backfill.
      Note This parameter is required if Parallelism is selected.
      You can set the Number of Concurrent Nodes parameter to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
      • If the number of data timestamps is less than the value of the Number of Concurrent Nodes parameter, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the Number of Concurrent Nodes parameter to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
      • If the number of data timestamps is greater than the value of the Number of Concurrent Nodes parameter, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the Number of Concurrent Nodes parameter to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
      Order

      Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

    • Backfill data in Current and Descendent Nodes Retroactively mode.
      Backfill data for the current node and its descendant nodesThe following table describes the parameters required for this mode.
      Parameter Description
      Data Backfill Instance Name

      DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements.

      Data Timestamp
      The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
      • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
      • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

        For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

      Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
      Parallelism
      Specifies whether to run multiple data backfill instances in parallel.
      • If you do not select Parallelism, the data backfill instances are run in sequence based on the data timestamps.
      • If you select Parallelism, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. The number of data backfill instances is specified by the Number of Concurrent Nodes parameter. Data backfill instances with different data timestamps can be run at the same time.
      Number of Concurrent Nodes
      You can set the Number of Concurrent Nodes parameter to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
      • If the number of data timestamps is less than the value of the Number of Concurrent Nodes parameter, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the Number of Concurrent Nodes parameter to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
      • If the number of data timestamps is greater than the value of the Number of Concurrent Nodes parameter, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the Number of Concurrent Nodes parameter to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
      Order

      Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

      Nodes You can filter nodes by name and level and select the nodes for which you want to backfill data.
      Note
      • Fuzzy match is supported when you search for the desired node by node name. After you enter a keyword, all nodes whose names contain the keyword are displayed in the table below the search box.
      • The search scope includes the current node and its descendant nodes of all levels. You can select the current node and some or all of its descendant nodes.
    • Backfill data in Backfill Data for Massive Nodes mode.
      Backfill data for a large number of nodesThe following table describes the parameters required for this mode.
      Parameter Description
      Data Backfill Instance Name

      DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements.

      Data Timestamp
      The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
      • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
      • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

        For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

      Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
      Order

      Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

      Select Nodes Requiring Data Backfill by Workspace You can select workspaces in the Available Workspaces section and add them to the Selected Workspaces section. This way, you can backfill data for all the descendant nodes of the current node that are in the workspaces you select.
      Note
      • Fuzzy match is supported when you search for the desired workspace by keyword. After you enter a keyword, all workspaces whose names contain the keyword are displayed in both sections.
      • You can select only workspaces that reside in the current region.
      • You can configure a whitelist to backfill data for some nodes that are not in the selected workspaces. You can also configure a blacklist to avoid backfilling data for some nodes that are in the selected workspaces.
      • You can specify whether to backfill data for the current node.
        • If you select Current Node, the system backfills data for the current node and its descendant nodes.
        • If you clear Current Node, the current node is dry-run, and the system backfills data for the descendant nodes of the current node.
      Node Whitelist You can select the nodes that are not in the selected workspaces to backfill data for the nodes.
      Note You can search for nodes only by node ID.
      Node Blacklist You can select the nodes for which you do not want to backfill data in the selected workspaces.
      Note You can search for nodes only by node ID.
    • Backfill data in Advanced Mode.
      In advanced mode, you can use the node aggregation feature provided by the DAG of an auto triggered node to group nodes by condition such as node type or owner. You can backfill data for nodes that have no dependencies with each other. Backfill data in advanced modeTo backfill data in advanced mode, perform the following steps:
      1. Select the nodes for which you want to backfill data.
        • In the DAG of an auto triggered node, you can click the Not Aggregate, Aggregate By Workspace, Aggregate By Owner, or Aggregate By Priority icon in Section 1 to use the node aggregation feature. This way, you can group nodes by workspace, owner, or priority. You can select the check box in the upper-right corner of a group to select all the nodes in the group in Section 2. For more information about the node aggregation feature of a DAG, see Manage instances in a DAG.
        • You can also select nodes in the node list on the Cycle Task page. You can search for the desired nodes based on different conditions such as the node name, node type, owner, and resource group for scheduling in Section 3. You can select the auto triggered nodes for which you want to backfill data in Section 4 and click Add in the lower part of the page.
          Note This way, the system generates retroactive data for the current node and all its descendant nodes of the auto triggered node. If you want to generate retroactive data for only some of the descendant nodes of the auto triggered node, click the name of the auto triggered node to enter the DAG and select the descendant nodes for which you want to generate retroactive data.
      2. View the selected nodes.
        After the nodes for which you want to backfill data are selected, you can view the selected nodes in the Run dialog box in Section 5. You can also perform the following operations:
        • Click the Locate icon next to the name of a node to open the DAG of the node. You can re-select the nodes for which you want to backfill data.
        • Click the Delete icon icon next to the name of a node to remove the node.
      3. In the Run dialog box in Section 5, click Configure to configure the parameters for data backfill. Advanced modeThe following table describes the parameters required for this mode.
        Parameter Description
        Data Backfill Instance Name

        DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements.

        Selected Nodes The number of nodes for which you want to backfill data. You can click Change to change the nodes for which you want to backfill data.
        Data Timestamp
        The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
        • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
        • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

          For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

        Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
        Parallelism
        Specifies whether to run multiple data backfill instances in parallel.
        • If you do not select Parallelism, the data backfill instances are run in sequence based on the data timestamps.
        • If you select Parallelism, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. The number of data backfill instances is specified by the Number of Concurrent Nodes parameter. Data backfill instances with different data timestamps can be run at the same time.
        Number of Concurrent Nodes
        You can set the Number of Concurrent Nodes parameter to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
        • If the number of data timestamps is less than the value of the Number of Concurrent Nodes parameter, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the Number of Concurrent Nodes parameter to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
        • If the number of data timestamps is greater than the value of the Number of Concurrent Nodes parameter, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the Number of Concurrent Nodes parameter to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
        Order

        Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

  5. Click OK to start to backfill data.

Manage data backfill instances

After you make the preceding configurations, data backfill instances are generated. Then, you can view the details and status of a data backfill instance, and stop or rerun a data backfill instance on the Patch Data page of Operation Center. For more information about how to go to Operation Center, see the steps described in the Backfill data section. The following table describes the operations that you can perform in different sections shown in the following figure. Manage data backfill instances
Section Description
1 In this section, you can specify filter conditions to search for a data backfill instance.

For example, you can search for a data backfill instance by node name, node ID, or one or more of the following conditions: Retroactive Instance Name, Created By, Creation Date, Status, Data Timestamp, My Nodes, and Initiated by Me.

Note
  • You can click Show Search Options if you want to specify more filter conditions such as Node Type, Scheduling Resource Group, and Engine Instance.
  • Fuzzy match is supported when you search for the desired node by node name. After you enter a keyword, all nodes whose names contain the keyword are displayed.
2
In this section, you can view the following information about a data backfill instance:
  • Node Name: the name of the data backfill instance. Click the Show icon icon before the name of the data backfill instance and view the information about the instance in Section 3, such as the date when the data backfill instance is run and details about the nodes for which the instance is generated.
  • Check Status: the check status of the data backfill instance.
  • Running status: the status of the data backfill instance. The data backfill instance can be in the state of running, not running, waiting for resources, exception, or stopped.
  • Created By: the Alibaba Cloud account within which the data backfill instance is generated.
  • Creation Date: the date when the data backfill instance is generated.
  • Nodes: the number of nodes for which the data backfill instance is generated.
  • Data Timestamp: the date when the data backfill instance is run.
In this section, you can also perform the following operations on data backfill instances:
  • Stop: You can stop multiple data backfill instances that are running or waiting for resources at a time. After you perform this operation, the status of the instances is set to failed.
    Note
    • Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
    • You cannot stop data backfill instances that failed, are not running, or are successfully run.
  • Batch Rerun: You can rerun multiple data backfill instances at a time.
    Note You can rerun only failed data backfill instances at a time.
  • Reuse: You can reuse a data backfill instance. This way, you can quickly determine the nodes for which you want to backfill data.
    Note Data backfill instances that are generated for nodes whose data is backfilled in Backfill Data for Massive Nodes mode cannot be reused.
3
In this section, you can view the following information about each node for which the data backfill instance is generated:
  • Name: the name of the node. You can click the node name to open the DAG of the node and view the details about the node.
  • Owner: the owner of the workspace to which the node belongs.
  • Schedule: the scheduling time of the node.
  • Start run time: the time when the node starts to run.
  • End Time: the time when the node stops running.
  • Runtime: the time consumed to run the node.
In this section, you can also perform the following operations on a node:
  • Stop: If the node is running or waiting for resources, you can stop the node. Then, the status of the node is set to failed.
    Note You cannot stop nodes that failed, are not running, or are successfully run.
  • Rerun: You can rerun the node.
    Note You can rerun only nodes that failed or are successfully run.
  • Rerun Descendant Nodes: You can rerun the descendant nodes of the node.
  • Set Status to Successful: You can set the status of the node to successful.
  • Freeze: You can freeze the node to pause the scheduling of the node.
  • Unfreeze: If the node is frozen, you can unfreeze the node to resume the scheduling of the node.
  • View Lineage: You can view the lineage of the node.
4 You can select multiple nodes in Section 3 and click Stop or Rerun in Section 4 to stop or rerun the selected nodes at a time.

Instance status

No. Status Icon
1 Succeeded 1
2 Not Running 2
3 Run failed 3
4 Running 4
5 Waiting time 5
6 Freeze 6