You can backfill data of a historical period of time or a period of time in the future for an auto triggered node to write the data to time-based partitions. If scheduling parameters are used in the node code, the scheduling parameters are automatically replaced with specific values based on the data timestamp that you configure to backfill data for the node. The data that corresponds to the data timestamp is written to specific partitions based on the business code. The partitions to which the data is written are related to the logic and content of the node code. This topic describes how to backfill data for an auto triggered node and manage data backfill instances generated for the node.

Background information

After an auto triggered node is developed, committed, and deployed to the scheduling system, the scheduling system runs the node based on the scheduling configurations of the node. If you want to run the auto triggered node in a specified time range, you can backfill data for the node. For more information about how to backfill data for an auto triggered node, see Backfill data in this topic. The following data backfill modes are supported:
  • Backfill Data for Current Node: This mode allows you to backfill data for the current node.
  • Current and Descendent Nodes Retroactively: This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a small number of descendant nodes, we recommend that you use this mode. In this mode, you can specify the descendant nodes for which you want to backfill data.
  • Backfill Data for Massive Nodes: This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a large number of descendant nodes, we recommend that you use this mode. In this mode, you can filter the descendant nodes by workspace. You can configure a whitelist to backfill data for some nodes that are not in the selected workspaces. You can also configure a blacklist to avoid backfilling data for some nodes that are in the selected workspaces.
  • Advanced Mode: This mode allows you to backfill data for multiple nodes at a time. In this mode, you can select nodes that do not have dependencies with each other. You can select nodes for which you want to backfill data in the directed acyclic graph (DAG) of an auto triggered node or in the node list on the Cycle Task page.
    • In the DAG, you can use the node aggregation feature to group nodes by workspace, owner, or priority. This way, you can backfill data for multiple nodes at a time by specifying a node group. For more information about a DAG, see Appendix: Use the features provided in a DAG.
    • You can also select nodes in the node list on the Cycle Task page. You can filter nodes based on specific conditions and select the nodes for which you want to backfill data.

Limits

  • You can use the advanced mode only in workspaces that reside in the China (Shenzhen) and UAE (Dubai) regions.
  • Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
  • Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days).
  • Instances that run on exclusive resource groups for scheduling are retained for one month (30 days), and logs for the instances are also retained for one month (30 days).
  • The system regularly clears excess run logs every day when the size of run logs generated for the auto triggered node instances that finish running exceeds 3 MB.

Precautions

  • When DataWorks backfills data for a node for a specified time range, if an instance generated for the node fails on a day within the time range, the status of the data backfill instance of the node for that day is also set to failed. In this case, DataWorks does not run the instances generated for this node for the next day. DataWorks runs the instances generated for a node on a day only after all instances generated for the node on the previous day are successfully run.
  • If you backfill data for a node scheduled by hour or minute for a specific day, whether instances scheduled to run on that day for the node are run in parallel with data backfill instances for the node depends on whether you configure an instance of the node in the current cycle to depend on the instance of the node in the previous cycle. If you configure an instance of the node in the current cycle to depend on the instance of the node in the previous cycle, and the first instance for which data needs to be backfilled depends on an instance generated on the previous day but the instance failed to run on the previous day, the node for which you backfill data cannot be triggered to run. If the first instance for which data needs to be backfilled does not depend on an instance generated on the previous day, the data backfill instance of the node is directly run.
  • If both an auto triggered node instance and a data backfill instance are running for a node, you must stop the data backfill instance to ensure that the auto triggered node instance can be run as expected.
  • If you backfill data for multiple instances or run a large number of data backfill instances in parallel, scheduling resources may be insufficient. Make sure that your configurations are appropriate based on your business requirements.
  • To avoid data backfill instances from occupying large amounts of resources and affecting the running of auto triggered node instances, you must abide by the following rules that are formulated for data backfill instances:
    • If you backfill data for a node whose data timestamp is the previous day, the priority of a data backfill instance generated for the node is determined by the priority of the baseline to which the node belongs.
    • If you backfill data for a node whose data timestamp is the day before the previous day, you must abide by the following rules to downgrade the priority of the node:
      • If the priority of the node is 7 or 8, downgrade the priority of the node to 3.
      • If the priority of the node is 3 or 5, downgrade the priority of the node to 2.
      • If the priority of the node is 1, keep the priority unchanged.

Backfill data

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace that you want to manage resides. Find the workspace and click DataStudio in the Actions column.
  2. In the left-side navigation pane of the Operation Center page, choose Cycle Task Maintenance > Cycle Task.
  3. Backfill data for nodes.
    1. On the Cycle Task page, find the desired auto triggered node and click the node name to open the DAG of the node.
      You can also click the Show icon to show the node list. Then, find the desired node and click DAG in the Actions column to open the DAG of the node.
    2. In the DAG, right-click the node for which you want to backfill data. In the shortcut menu that appears, move the pointer over Run and select a data backfill mode. In the dialog box that appears, configure the parameters.
    Note You can also perform the following steps on the Cycle Task page to backfill data for an auto triggered node: Click the Show icon to show the auto triggered node list. Find the desired node, click Backfill Data in the Actions column, and then select a data backfill mode.
    Data backfill modesYou can configure the parameters required for each data backfill mode based on the following descriptions:
    • Backfill data in Backfill Data for Current Node mode.
      Backfill data for the current nodeThe following table describes the parameters required for this mode.
      ParameterDescription
      Data Backfill Instance Name

      DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.

      Node

      The name of the node for which you want to backfill data.

      Data Timestamp
      The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
      • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
      • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

        For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

      Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
      Concurrency
      Specifies whether to run multiple data backfill instances in parallel.
      • If you set Concurrency to No, the data backfill instances are run in sequence based on the data timestamps.
      • If you set Concurrency to Yes, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. Data backfill instances with different data timestamps can be run at the same time.
      Number of data backfill instances run in parallel
      The number of data backfill instances that are generated and run in parallel during data backfill.
      Note You must configure the number of data backfill instances that are run in parallel if you set Concurrency to Yes.
      You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
      • If the number of data timestamps is less than the number of data backfill instances that are run in parallel, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
      • If the number of data timestamps is greater than the number of data backfill instances that are run in parallel, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
      Order

      Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

      Resource Group for Scheduling
      Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
      • If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
      • If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
      Execution Period
      The period of time during which a data backfill instance is run.
      • If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
      • If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
    • Backfill data in Current and Descendent Nodes Retroactively mode.
      Backfill data for the current node and its descendant nodesThe following table describes the parameters required for this mode.
      ParameterDescription
      Data Backfill Instance Name

      DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.

      Data Timestamp
      The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
      • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
      • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

        For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

      Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
      Concurrency
      Specifies whether to run multiple data backfill instances in parallel.
      • If you set Concurrency to No, the data backfill instances are run in sequence based on the data timestamps.
      • If you set Concurrency to Yes, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. Data backfill instances with different data timestamps can be run at the same time.
      Number of data backfill instances run in parallel
      You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
      • If the number of data timestamps is less than the number of data backfill instances that are run in parallel, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
      • If the number of data timestamps is greater than the number of data backfill instances that are run in parallel, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
      Order

      Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

      NodesYou can filter nodes by name and level and select the nodes for which you want to backfill data.
      Note
      • A fuzzy search is supported when you search for the desired node by node name. After you enter a keyword, all nodes whose names contain the keyword are displayed in the table below the search box.
      • The search scope includes the current node and its descendant nodes of all levels. You can select the current node and some or all of its descendant nodes.
      Resource Group for Scheduling
      Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
      • If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
      • If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
      Execution Period
      The period of time during which a data backfill instance is run.
      • If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
      • If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
    • Backfill data in Backfill Data for Massive Nodes mode.
      Backfill data for a large number of nodesThe following table describes the parameters required for this mode.
      ParameterDescription
      Data Backfill Instance Name

      DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.

      Data Timestamp
      The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
      • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
      • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

        For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

      Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
      Order

      Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

      Select Nodes Requiring Data Backfill by WorkspaceYou can select workspaces in the Available Workspaces section and add them to the Selected Workspaces section. This way, you can backfill data for all the descendant nodes of the current node that are in the workspaces you select.
      Note
      • A fuzzy search is supported when you search for the desired workspace by keyword. After you enter a keyword, all workspaces whose names contain the keyword are displayed in both sections.
      • You can select only workspaces that reside in the current region.
      • You can configure a whitelist to backfill data for some nodes that are not in the selected workspaces. You can also configure a blacklist to avoid backfilling data for some nodes that are in the selected workspaces.
      • You can specify whether to backfill data for the current node.
        • If you select Current Node, the system backfills data for the current node and its descendant nodes.
        • If you clear Current Node, the current node performs a dry run, and the system backfills data for the descendant nodes of the current node.
        For information about dry-run instances, see Dry-run instances.
      Node WhitelistYou can select the nodes that are not in the selected workspaces to backfill data for the nodes.
      Note You can search for nodes only by node ID.
      Node BlacklistYou can select the nodes for which you do not want to backfill data in the selected workspaces.
      Note You can search for nodes only by node ID.
      Resource Group for Scheduling
      Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
      • If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
      • If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
      Execution Period
      The period of time during which a data backfill instance is run.
      • If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
      • If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
    • Backfill data in Advanced Mode.
      In advanced mode, you can use the node aggregation feature provided by the DAG of an auto triggered node to group nodes by condition such as node type or owner. You can backfill data for nodes that have no dependencies with each other. Backfill data in advanced modeTo backfill data in advanced mode, perform the following steps:
      1. Select the nodes for which you want to backfill data.
        • In the DAG of an auto triggered node, you can click the Not Aggregate, Aggregate By Workspace, Aggregate By Owner, or Aggregate By Priority icon in the area marked with 1 to use the node aggregation feature. This way, you can group nodes by workspace, owner, or priority. You can select the check box in the upper-right corner of a group to select all the nodes in the group in the area marked with 2. For more information about the node aggregation feature of a DAG, see Appendix: Use the features provided in a DAG.
        • You can also select nodes in the node list on the Cycle Task page. You can search for the desired nodes based on different conditions such as the node name, node type, owner, and resource group for scheduling in the area marked with 3. You can select the auto triggered nodes for which you want to backfill data in the area marked with 4 and click Add in the lower part of the page.
          Note This way, the system generates data backfill instances for all the selected auto triggered nodes. If you want to generate data backfill instances for a specific auto triggered node, click the name of the node in the node list to open the DAG of the node. In the DAG, right-click the node and select a data backfill mode to backfill data for the node based on your business requirements.
      2. View the selected nodes.
        After the nodes for which you want to backfill data are selected, you can view the selected nodes in the Run dialog box in the area marked with 5. You can also perform the following operations:
        • Click the Locate icon next to the name of a node to open the DAG of the node. You can re-select the nodes for which you want to backfill data.
        • Click the Delete icon next to the name of a node to remove the node.
      3. In the Run dialog box in the area marked with 5, click Configure to configure the parameters for data backfill. Advanced modeThe following table describes the parameters required for this mode.
        ParameterDescription
        Data Backfill Instance Name

        DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.

        Selected NodesThe number of nodes for which you want to backfill data. You can click Change to change the nodes for which you want to backfill data.
        Data Timestamp
        The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
        • If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
        • If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.

          For example, if the current date is August 24, 2021, the data timestamp of a data backfill instance is September 17, 2021, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on September 18, 2021.

        Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
        Concurrency
        Specifies whether to run multiple data backfill instances in parallel.
        • If you set Concurrency to No, the data backfill instances are run in sequence based on the data timestamps.
        • If you set Concurrency to Yes, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. Data backfill instances with different data timestamps can be run at the same time.
        Number of data backfill instances run in parallel
        You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
        • If the number of data timestamps is less than the number of data backfill instances that are run in parallel, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
        • If the number of data timestamps is greater than the number of data backfill instances that are run in parallel, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
        Order

        Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.

        Resource Group for Scheduling
        Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
        • If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
        • If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
        Execution Period
        The period of time during which a data backfill instance is run.
        • If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
        • If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
  4. Click OK to start to backfill data.

Manage data backfill instances

After you make the preceding configurations, data backfill instances are generated. Then, you can view the details and status of a data backfill instance, and stop or rerun a data backfill instance on the Patch Data page of Operation Center. For more information about how to go to Operation Center, see the steps described in the Backfill data section. The following table describes the operations that you can perform in different sections shown in the following figure. Manage data backfill instances
AreaDescription
1In this area, you can specify filter conditions to search for a data backfill instance.

For example, you can search for a data backfill instance by node name, node ID, or one or more of the following conditions: Retroactive Instance Name, Created By, Creation Date, Status, Data Timestamp, My Nodes, and Initiated by Me.

Note
  • You can click Show Search Options if you want to specify more filter conditions such as Node Type, Scheduling Resource Group, and Engine Instance.
  • Fuzzy match is supported when you search for the desired node by node name. After you enter a keyword, all nodes whose names contain the keyword are displayed.
2
In this area, you can view the following information about a data backfill instance:
  • Node Name: the name of the data backfill instance. Click the Show icon before the name of the data backfill instance and view the information about the instance in the area marked with 3, such as the date when the data backfill instance is run and details about the nodes for which the instance is generated.
  • Check Status: the check status of the data backfill instance.
  • Running status: the status of the data backfill instance. The data backfill instance can be in the state of running, not running, waiting for resources, exception, or stopped.
  • Created By: the Alibaba Cloud account within which the data backfill instance is generated.
  • Creation Date: the date when the data backfill instance is generated.
  • Nodes: the number of nodes for which the data backfill instance is generated.
  • Data Timestamp: the date when the data backfill instance is run.
In this area, you can also perform the following operations on data backfill instances:
  • Stop: You can stop multiple data backfill instances that are running or waiting for resources at a time. After you perform this operation, the status of the instances is set to failed.
    Note
    • Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
    • You cannot stop data backfill instances that failed, are not running, or are successfully run.
  • Batch Rerun: You can rerun multiple data backfill instances at a time.
    Note You can rerun only failed data backfill instances at a time.
  • Reuse: You can reuse a data backfill instance. This way, you can quickly determine the nodes for which you want to backfill data.
    Note Data backfill instances that are generated for nodes whose data is backfilled in Backfill Data for Massive Nodes mode cannot be reused.
3
In this area, you can view the following information about each node for which the data backfill instance is generated:
  • Name: the name of the node. You can click the node name to open the DAG of the node and view the details about the node.
  • Owner: the owner of the workspace to which the node belongs.
  • Schedule: the scheduling time of the node.
  • Start run time: the time when the node starts to run.
  • End Time: the time when the node stops running.
  • Runtime: the time consumed to run the node.
In this area, you can also perform the following operations on a node:
  • Stop: If the node is running or waiting for resources, you can stop the node. Then, the status of the node is set to failed.
    Note You cannot stop nodes that failed, are not running, or are successfully run.
  • Rerun: You can rerun the node.
    Note You can rerun only nodes that failed or are successfully run.
  • Rerun Descendant Nodes: You can rerun the descendant nodes of the node.
  • Set Status to Successful: You can set the status of the node to successful.
  • Freeze: You can freeze the node to pause the scheduling of the node.
  • Unfreeze: If the node is frozen, you can unfreeze the node to resume the scheduling of the node.
  • View Lineage: You can view the lineage of the node.
4You can select multiple nodes in the area marked with 3 and click Stop or Rerun in the area marked with 4 to stop or rerun the selected nodes at a time.

Instance status

No.StatusIcon
1Succeeded1
2Not Running2
3Run failed3
4Running4
5Waiting time5
6Freeze6

FAQ

For information about FAQ related to data backfill, see Data backfill.