You can backfill data of a historical period of time or a period of time in the future for an auto triggered node to write the data to time-based partitions. If scheduling parameters are used in the node code, the scheduling parameters are automatically replaced with specific values based on the data timestamp that you configure to backfill data for the node. The data that corresponds to the data timestamp is written to specific partitions based on the business code. The partitions to which the data is written are related to the logic and content of the node code. This topic describes how to backfill data for an auto triggered node and manage data backfill instances generated for the node.
Background information
- Backfill Data for Current Node: This mode allows you to backfill data for the current node.
- Current and Descendent Nodes Retroactively: This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a small number of descendant nodes, we recommend that you use this mode. In this mode, you can specify the descendant nodes for which you want to backfill data.
- Backfill Data for Massive Nodes: This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a large number of descendant nodes, we recommend that you use this mode. In this mode, you can filter the descendant nodes by workspace. You can configure a whitelist to backfill data for some nodes that are not in the selected workspaces. You can also configure a blacklist to avoid backfilling data for some nodes that are in the selected workspaces.
- Advanced Mode: This mode allows you to backfill data for multiple nodes at a time. In this mode, you can select nodes that do not have dependencies with each other. You can select nodes for which you want to backfill data in the directed acyclic graph (DAG) of an auto triggered node or in the node list on the Cycle Task page.
- In the DAG, you can use the node aggregation feature to group nodes by workspace, owner, or priority. This way, you can backfill data for multiple nodes at a time by specifying a node group. For more information about a DAG, see Appendix: Use the features provided in a DAG.
- You can also select nodes in the node list on the Cycle Task page. You can filter nodes based on specific conditions and select the nodes for which you want to backfill data.
Limits
- You can use the advanced mode only in workspaces that reside in the China (Shenzhen) and UAE (Dubai) regions.
- Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
- Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days).
- Instances that run on exclusive resource groups for scheduling are retained for one month (30 days), and logs for the instances are also retained for one month (30 days).
- The system regularly clears excess run logs every day when the size of run logs generated for the auto triggered node instances that finish running exceeds 3 MB.
Precautions
- When DataWorks backfills data for a node for a specified time range, if an instance generated for the node fails on a day within the time range, the status of the data backfill instance of the node for that day is also set to failed. In this case, DataWorks does not run the instances generated for this node for the next day. DataWorks runs the instances generated for a node on a day only after all instances generated for the node on the previous day are successfully run.
- If you backfill data for a node scheduled by hour or minute for a specific day, whether instances scheduled to run on that day for the node are run in parallel with data backfill instances for the node depends on whether you configure an instance of the node in the current cycle to depend on the instance of the node in the previous cycle. If you configure an instance of the node in the current cycle to depend on the instance of the node in the previous cycle, and the first instance for which data needs to be backfilled depends on an instance generated on the previous day but the instance failed to run on the previous day, the node for which you backfill data cannot be triggered to run. If the first instance for which data needs to be backfilled does not depend on an instance generated on the previous day, the data backfill instance of the node is directly run.
- If both an auto triggered node instance and a data backfill instance are running for a node, you must stop the data backfill instance to ensure that the auto triggered node instance can be run as expected.
- If you backfill data for multiple instances or run a large number of data backfill instances in parallel, scheduling resources may be insufficient. Make sure that your configurations are appropriate based on your business requirements.
- To avoid data backfill instances from occupying large amounts of resources and affecting the running of auto triggered node instances, you must abide by the following rules that are formulated for data backfill instances:
- If you backfill data for a node whose data timestamp is the previous day, the priority of a data backfill instance generated for the node is determined by the priority of the baseline to which the node belongs.
- If you backfill data for a node whose data timestamp is the day before the previous day, you must abide by the following rules to downgrade the priority of the node:
- If the priority of the node is 7 or 8, downgrade the priority of the node to 3.
- If the priority of the node is 3 or 5, downgrade the priority of the node to 2.
- If the priority of the node is 1, keep the priority unchanged.
Backfill data
- Go to the DataStudio page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- In the top navigation bar, select the region in which the workspace that you want to manage resides. Find the workspace and click DataStudio in the Actions column.
- In the left-side navigation pane of the Operation Center page, choose .
- Backfill data for nodes. Note You can also perform the following steps on the Cycle Task page to backfill data for an auto triggered node: Click the
icon to show the auto triggered node list. Find the desired node, click Backfill Data in the Actions column, and then select a data backfill mode.
You can configure the parameters required for each data backfill mode based on the following descriptions:
- Backfill data in Backfill Data for Current Node mode.
The following table describes the parameters required for this mode.
Parameter Description Data Backfill Instance Name DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.
Node The name of the node for which you want to backfill data.
Data Timestamp The data timestamp of the data backfill instance. A data timestamp specifies a specific date.- If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
- If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.
For example, if the current date is
August 24, 2021
, the data timestamp of a data backfill instance isSeptember 17, 2021
, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance onSeptember 18, 2021
.
Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.Concurrency Specifies whether to run multiple data backfill instances in parallel.- If you set Concurrency to No, the data backfill instances are run in sequence based on the data timestamps.
- If you set Concurrency to Yes, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. Data backfill instances with different data timestamps can be run at the same time.
Number of data backfill instances run in parallel The number of data backfill instances that are generated and run in parallel during data backfill.Note You must configure the number of data backfill instances that are run in parallel if you set Concurrency to Yes.You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:- If the number of data timestamps is less than the number of data backfill instances that are run in parallel, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
- If the number of data timestamps is greater than the number of data backfill instances that are run in parallel, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
Order Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.
Resource Group for Scheduling Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.- If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
- If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
Execution Period The period of time during which a data backfill instance is run.- If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
- If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
- Backfill data in Current and Descendent Nodes Retroactively mode.
The following table describes the parameters required for this mode.
Parameter Description Data Backfill Instance Name DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.
Data Timestamp The data timestamp of the data backfill instance. A data timestamp specifies a specific date.- If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
- If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.
For example, if the current date is
August 24, 2021
, the data timestamp of a data backfill instance isSeptember 17, 2021
, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance onSeptember 18, 2021
.
Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.Concurrency Specifies whether to run multiple data backfill instances in parallel.- If you set Concurrency to No, the data backfill instances are run in sequence based on the data timestamps.
- If you set Concurrency to Yes, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. Data backfill instances with different data timestamps can be run at the same time.
Number of data backfill instances run in parallel You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:- If the number of data timestamps is less than the number of data backfill instances that are run in parallel, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
- If the number of data timestamps is greater than the number of data backfill instances that are run in parallel, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
Order Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.
Nodes You can filter nodes by name and level and select the nodes for which you want to backfill data. Note- A fuzzy search is supported when you search for the desired node by node name. After you enter a keyword, all nodes whose names contain the keyword are displayed in the table below the search box.
- The search scope includes the current node and its descendant nodes of all levels. You can select the current node and some or all of its descendant nodes.
Resource Group for Scheduling Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.- If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
- If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
Execution Period The period of time during which a data backfill instance is run.- If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
- If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
- Backfill data in Backfill Data for Massive Nodes mode.
The following table describes the parameters required for this mode.
Parameter Description Data Backfill Instance Name DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.
Data Timestamp The data timestamp of the data backfill instance. A data timestamp specifies a specific date.- If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
- If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.
For example, if the current date is
August 24, 2021
, the data timestamp of a data backfill instance isSeptember 17, 2021
, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance onSeptember 18, 2021
.
Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.Order Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.
Select Nodes Requiring Data Backfill by Workspace You can select workspaces in the Available Workspaces section and add them to the Selected Workspaces section. This way, you can backfill data for all the descendant nodes of the current node that are in the workspaces you select. Note- A fuzzy search is supported when you search for the desired workspace by keyword. After you enter a keyword, all workspaces whose names contain the keyword are displayed in both sections.
- You can select only workspaces that reside in the current region.
- You can configure a whitelist to backfill data for some nodes that are not in the selected workspaces. You can also configure a blacklist to avoid backfilling data for some nodes that are in the selected workspaces.
- You can specify whether to backfill data for the current node.
- If you select Current Node, the system backfills data for the current node and its descendant nodes.
- If you clear Current Node, the current node performs a dry run, and the system backfills data for the descendant nodes of the current node.
Node Whitelist You can select the nodes that are not in the selected workspaces to backfill data for the nodes. Note You can search for nodes only by node ID.Node Blacklist You can select the nodes for which you do not want to backfill data in the selected workspaces. Note You can search for nodes only by node ID.Resource Group for Scheduling Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.- If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
- If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
Execution Period The period of time during which a data backfill instance is run.- If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
- If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
- Backfill data in Advanced Mode. In advanced mode, you can use the node aggregation feature provided by the DAG of an auto triggered node to group nodes by condition such as node type or owner. You can backfill data for nodes that have no dependencies with each other.
To backfill data in advanced mode, perform the following steps:
- Select the nodes for which you want to backfill data.
- In the DAG of an auto triggered node, you can click the Not Aggregate, Aggregate By Workspace, Aggregate By Owner, or Aggregate By Priority icon in the area marked with 1 to use the node aggregation feature. This way, you can group nodes by workspace, owner, or priority. You can select the check box in the upper-right corner of a group to select all the nodes in the group in the area marked with 2. For more information about the node aggregation feature of a DAG, see Appendix: Use the features provided in a DAG.
- You can also select nodes in the node list on the Cycle Task page. You can search for the desired nodes based on different conditions such as the node name, node type, owner, and resource group for scheduling in the area marked with 3. You can select the auto triggered nodes for which you want to backfill data in the area marked with 4 and click Add in the lower part of the page. Note This way, the system generates data backfill instances for all the selected auto triggered nodes. If you want to generate data backfill instances for a specific auto triggered node, click the name of the node in the node list to open the DAG of the node. In the DAG, right-click the node and select a data backfill mode to backfill data for the node based on your business requirements.
- View the selected nodes. After the nodes for which you want to backfill data are selected, you can view the selected nodes in the Run dialog box in the area marked with 5. You can also perform the following operations:
- Click the
icon next to the name of a node to open the DAG of the node. You can re-select the nodes for which you want to backfill data.
- Click the
icon next to the name of a node to remove the node.
- Click the
- In the Run dialog box in the area marked with 5, click Configure to configure the parameters for data backfill.
The following table describes the parameters required for this mode.
Parameter Description Data Backfill Instance Name DataWorks automatically generates a data backfill instance name. You can change the name based on your business requirements.
Selected Nodes The number of nodes for which you want to backfill data. You can click Change to change the nodes for which you want to backfill data. Data Timestamp The data timestamp of the data backfill instance. A data timestamp specifies a specific date.- If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
- If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system runs the data backfill instance immediately after the data timestamp passes.
For example, if the current date is
August 24, 2021
, the data timestamp of a data backfill instance isSeptember 17, 2021
, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance onSeptember 18, 2021
.
Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.Concurrency Specifies whether to run multiple data backfill instances in parallel.- If you set Concurrency to No, the data backfill instances are run in sequence based on the data timestamps.
- If you set Concurrency to Yes, a specific number of data backfill instances are generated based on the data timestamps and run in parallel. Data backfill instances with different data timestamps can be run at the same time.
Number of data backfill instances run in parallel You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:- If the number of data timestamps is less than the number of data backfill instances that are run in parallel, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
- If the number of data timestamps is greater than the number of data backfill instances that are run in parallel, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
Order Valid values: Ascending by Business Date and Descending by Business Date. You can backfill data in the ascending or descending order of data timestamps.
Resource Group for Scheduling Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.- If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
- If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
Execution Period The period of time during which a data backfill instance is run.- If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
- If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
- Select the nodes for which you want to backfill data.
- Backfill data in Backfill Data for Current Node mode.
- Click OK to start to backfill data.
Manage data backfill instances

Area | Description |
---|---|
1 | In this area, you can specify filter conditions to search for a data backfill instance. For example, you can search for a data backfill instance by node name, node ID, or one or more of the following conditions: Retroactive Instance Name, Created By, Creation Date, Status, Data Timestamp, My Nodes, and Initiated by Me. Note
|
2 | In this area, you can view the following information about a data backfill instance:
In this area, you can also perform the following operations on data backfill instances:
|
3 | In this area, you can view the following information about each node for which the data backfill instance is generated:
In this area, you can also perform the following operations on a node:
|
4 | You can select multiple nodes in the area marked with 3 and click Stop or Rerun in the area marked with 4 to stop or rerun the selected nodes at a time. |
Instance status
No. | Status | Icon |
---|---|---|
1 | Succeeded | ![]() |
2 | Not Running | ![]() |
3 | Run failed | ![]() |
4 | Running | ![]() |
5 | Waiting time | ![]() |
6 | Freeze | ![]() |
FAQ
For information about FAQ related to data backfill, see Data backfill.