You can backfill data of a historical period of time or a period of time in the future for an auto triggered node to write the data to time-based partitions. If scheduling parameters are used in the node code, the scheduling parameters are automatically replaced with specific values based on the data timestamp that you configure to backfill data for the node. The data that corresponds to the data timestamp is written to specific partitions based on the business code. The partitions to which the data is written are related to the logic and content of the node code. This topic describes how to backfill data for an auto triggered node and manage data backfill instances generated for the node.
Background information
After an auto triggered node is developed, committed, and deployed to the scheduling system, the scheduling system runs the node based on the scheduling configurations of the node. If you want to run the auto triggered node in a specified time range, you can backfill data for the node. For information about how to backfill data for an auto triggered node, see Backfill data for the node in this topic. The following table describes the supported data backfill modes.
Data backfill mode | Description |
Backfill Data for Current Node | This mode allows you to backfill data for the current node. |
Current and Descendent Nodes Retroactively | This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a small number of descendant nodes, we recommend that you use this mode. In this mode, you can specify the descendant nodes for which you want to backfill data. |
Backfill Data for Massive Nodes | This mode allows you to backfill data for the current node and its descendant nodes at a time. If the current node has a large number of descendant nodes, we recommend that you use this mode. In this mode, you can filter the descendant nodes for which you want to backfill data by workspace. You can configure a whitelist to backfill data for some nodes that are not in the selected workspaces. You can also configure a blacklist to avoid backfilling data for some nodes that are in the selected workspaces. |
Advanced Mode | This mode allows you to backfill data for multiple nodes at a time. In this mode, you can select nodes that do not have dependencies with each other. You can select nodes for which you want to backfill data in the directed acyclic graph (DAG) of an auto triggered node or in the node list on the Cycle Task page.
|
Limits
You can use the advanced mode only in workspaces that reside in the China (Shenzhen) and UAE (Dubai) regions.
Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days).
Instances that run on exclusive resource groups for scheduling are retained for one month (30 days), and logs for the instances are also retained for one month (30 days).
The system regularly clears excess run logs every day when the size of run logs generated for the auto triggered node instances that finish running exceeds 3 MB.
Precautions
When DataWorks backfills data for a node for a specified time range, if an instance generated for the node fails on a day within the time range, the status of the data backfill instance of the node for that day is also set to failed. In this case, DataWorks does not run the instances generated for this node for the next day. DataWorks runs the instances generated for a node on a day only after all instances generated for the node on the previous day are successfully run.
If you backfill data of a specific day for a node scheduled by hour or minute, whether instances including those scheduled to run on that day for the node and the data backfill instances for the node are run in parallel depends on whether you configure the self-dependency for the node.
If both an auto triggered node instance and a data backfill instance are running for a node, you must stop the data backfill instance to ensure that the auto triggered node instance can be run as expected.
If you backfill data for multiple instances or run a large number of data backfill instances in parallel, scheduling resources may be insufficient. Make sure that your configurations are appropriate based on your business requirements.
To avoid data backfill instances from occupying large amounts of resources and affecting the running of auto triggered node instances, you must abide by the following rules that are formulated for data backfill instances:
If you backfill data for a node whose data timestamp is the previous day, the priority of a data backfill instance generated for the node is determined by the priority of the baseline to which the node belongs.
If you backfill data for a node whose data timestamp is the day before the previous day, you must abide by the following rules to downgrade the priority of the node:
If the priority of the node is 7 or 8, downgrade the priority of the node to 3.
If the priority of the node is 3 or 5, downgrade the priority of the node to 2.
If the priority of the node is 1, keep the priority unchanged.
Go to the Patch Data page
Go to the Operation Center page.
Log on to the DataWorks console. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.
In the left-side navigation pane of the Operation Center page, choose .
Backfill data for the desired node.
Open the DAG of the desired node.
You can use one of the following methods to open the DAG of the desired node:
Method 1: Click the name of the desired node in the node list to open the DAG of the node.
Method 2: Click the
icon to show the node list. Click DAG in the Actions column of the desired node to open the DAG of the node.
In the DAG, right-click the desired node. In the shortcut menu that appears, move the pointer over Run and select a data backfill mode.
Backfill data for the node
After you select a data backfill mode, configure the parameters in the Backfill Data dialog box and click OK.
Backfill data for the current node
The following table describes the parameters required for this mode.
Parameter | Description |
Data Backfill Instance Name | DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements. |
Node | The name of the node for which you want to backfill data. |
Data Timestamp | The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources. |
Concurrency | Specifies whether to run multiple data backfill instances in parallel.
|
Number of data backfill instances run in parallel | The number of data backfill instances that are generated and run in parallel during data backfill. Note You must configure the number of data backfill instances that are run in parallel if you set Concurrency to Yes. You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
|
Alert for Data Backfill | Specifies whether to enable the alerting feature for data backfill.
|
Trigger Condition | The trigger condition of an alert for data backfill. Valid values:
Note This parameter is required only if you select Is for the Alert for Data Backfill parameter. |
Alert Notification Method | The notification method for an alert. The alert recipient must be the initiator for data backfill. Valid values: Text Message and Email, Text Message, Email. Note
|
Order | The sequence based on which data backfill instances are run. Valid values: Ascending by Business Date and Descending by Business Date. |
Resource Group for Scheduling | Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
|
Execution Period | The period of time during which a data backfill instance is run.
|
Backfill data for the current node and its descendant nodes
The following table describes the parameters required for this mode.
Parameter | Description |
Data Backfill Instance Name | DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements. |
Data Timestamp | The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources. |
Concurrency | Specifies whether to run multiple data backfill instances in parallel.
|
Number of data backfill instances run in parallel | The number of data backfill instances that are generated and run in parallel during data backfill. Note You must configure the number of data backfill instances that are run in parallel if you set Concurrency to Yes. You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
|
Alert for Data Backfill | Specifies whether to enable the alerting feature for data backfill.
|
Trigger Condition | The trigger condition of an alert for data backfill. Valid values:
Note This parameter is required only if you select Is for the Alert for Data Backfill parameter. |
Alert Notification Method | The notification method for an alert. The alert recipient must be the initiator for data backfill. Valid values: Text Message and Email, Text Message, Email. Note
|
Order | The sequence based on which data backfill instances are run. Valid values: Ascending by Business Date and Descending by Business Date. |
Resource Group for Scheduling | Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
|
Execution Period | The period of time during which a data backfill instance is run.
|
Nodes | You can filter nodes by name and level and select the nodes for which you want to backfill data. Note
|
Backfill data for a large number of nodes
The following table describes the parameters required for this mode.
Parameter | Description |
Data Backfill Instance Name | DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements. |
Data Timestamp | The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
Note We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources. |
Alert for Data Backfill | Specifies whether to enable the alerting feature for data backfill.
|
Trigger Condition | The trigger condition of an alert for data backfill. Valid values:
Note This parameter is required only if you select Is for the Alert for Data Backfill parameter. |
Alert Notification Method | The notification method for an alert. The alert recipient must be the initiator for data backfill. Valid values: Text Message and Email, Text Message, Email. Note
|
Order | The sequence based on which data backfill instances are run. Valid values: Ascending by Business Date and Descending by Business Date. |
Resource Group for Scheduling | Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
|
Execution Period | The period of time during which a data backfill instance is run.
|
Select Nodes Requiring Data Backfill by Workspace | You can select workspaces in the Available Workspaces section and add them to the Selected Workspaces section. This way, you can backfill data for the desired nodes in the selected workspaces. Note
|
Node Whitelist | You can select the nodes that are not in the selected workspaces to backfill data for the nodes. Note You can search for nodes only by node ID. |
Node Blacklist | You can select the nodes for which you do not want to backfill data in the selected workspaces. Note You can search for nodes only by node ID. |
Backfill data in advanced mode
In advanced mode, you can use the node aggregation feature provided by the DAG of an auto triggered node to group nodes by condition such as node type or owner. You can backfill data for nodes that have no dependencies with each other. To backfill data in advanced mode, perform the following steps:
Select the nodes for which you want to backfill data.
In the DAG of an auto triggered node, you can click the Not Aggregate, Aggregate By Workspace, Aggregate By Owner, or Aggregate By Priority icon in the area marked with 1 to use the node aggregation feature. This way, you can group nodes by workspace, owner, or priority. You can select the check box in the upper-right corner of a group to select all the nodes in the group in the area marked with 2. For more information about the node aggregation feature of a DAG, see Appendix: Use the features provided in a DAG.
You can also select nodes in the node list on the Cycle Task page. You can search for the desired nodes based on different conditions such as the node name, node type, owner, and resource group for scheduling in the area marked with 3. You can select the auto triggered nodes for which you want to backfill data in the area marked with 4 and click Add in the lower part of the page.
NoteThis way, the system generates data backfill instances for all the selected auto triggered nodes. If you want to generate data backfill instances for a specific auto triggered node, click the name of the node in the node list to open the DAG of the node. In the DAG, right-click the node and select a data backfill mode to backfill data for the node based on your business requirements.
View the selected nodes.
After the nodes for which you want to backfill data are selected, you can view the selected nodes in the Run dialog box in the area marked with 5. You can also perform the following operations:
Click the
icon next to the name of a node to open the DAG of the node. You can re-select the nodes for which you want to backfill data.
Click the
icon next to the name of a node to remove the node.
In the Run dialog box in the area marked with 5, click Configure to configure the parameters for data backfill.
The following table describes the parameters required for this mode.
Parameter
Description
Data Backfill Instance Name
DataWorks automatically generates a data backfill instance name. You can modify the name based on your business requirements.
Selected Nodes
The number of nodes for which you want to backfill data. You can click Change to change the nodes for which you want to backfill data.
Data Timestamp
The data timestamp of the data backfill instance. A data timestamp specifies a specific date.
If you want to backfill data for the node for multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
If the data timestamp that you specify for a data backfill instance is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system immediately runs the data backfill instance after the data timestamp passes.
For example, if the current date is
August 24, 2021
, the data timestamp of a data backfill instance isSeptember 17, 2021
, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance onSeptember 18, 2021
.
NoteWe recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
Concurrency
Specifies whether to run multiple data backfill instances in parallel.
If you set Concurrency to No, the data backfill instances are run in sequence based on the data timestamps.
If you set Concurrency to Yes, a specified number of data backfill instances are generated based on the data timestamps and run in parallel. Data backfill instances with different data timestamps are run at the same time.
NoteIf you backfill data of a specific day for a node scheduled by hour or minute, whether instances including those scheduled to run on that day for the node and the data backfill instances for the node are run in parallel depends on whether you configure the self-dependency for the node.
Number of data backfill instances run in parallel
The number of data backfill instances that are generated and run in parallel during data backfill.
NoteYou must configure the number of data backfill instances that are run in parallel if you set Concurrency to Yes.
You can set the number of data backfill instances that are run in parallel to an integer from 2 to 10. The following rules apply when multiple data backfill instances are run in parallel:
If the number of data timestamps is less than the number of data backfill instances that are run in parallel, the data backfill instances are run in parallel. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.
If the number of data timestamps is greater than the number of data backfill instances that are run in parallel, the system runs some data backfill instances in sequence and the other data backfill instances in parallel based on the data timestamps. For example, the data timestamps are from January 11 to January 13, and you set the number of data backfill instances that are run in parallel to 2. In this case, two data backfill instances are generated and run in parallel for once. One of the data backfill instances has two data timestamps and is separately run for the second time.
Alert for Data Backfill
Specifies whether to enable the alerting feature for data backfill.
Is: An alert is generated for data backfill if the trigger condition is met.
No: The alerting feature is disabled for data backfill.
Trigger Condition
The trigger condition of an alert for data backfill. Valid values:
Alert on Failure or Success: An alert is generated when data backfill succeeds or fails.
Alert on Success: An alert is generated when data backfill succeeds.
Alert on Failure: An alert is generated when data backfill fails.
NoteThis parameter is required only if you select Is for the Alert for Data Backfill parameter.
Alert Notification Method
The notification method for an alert. The alert recipient must be the initiator for data backfill. Valid values: Text Message and Email, Text Message, Email.
NoteThis parameter is required only if you select Is for the Alert for Data Backfill parameter.
You can click Inspection contact information to check whether the mobile phone number or email address of the alert recipient is registered. If not, you can refer to Configure and view alert contacts to configure an alert recipient.
Order
The sequence based on which data backfill instances are run. Valid values: Ascending by Business Date and Descending by Business Date.
Resource Group for Scheduling
Specifies whether to select another resource group for scheduling to run a data backfill instance. If you use another resource group for scheduling to run a data backfill instance, the data backfill instance does not need to compete for resources with auto triggered node instances.
If you set this parameter to Yes, a drop-down list appears, and you can select a resource group for scheduling to run the data backfill instance.
If you set this parameter to No, the resource group for scheduling that is configured for the current node is used to run the data backfill instance.
Execution Period
The period of time during which a data backfill instance is run.
If you set this parameter to Yes, a time picker appears and you can select a cycle based on which a data backfill instance is run and a specific point in time to start to run the data backfill instance.
If you set this parameter to No, the data backfill instance is immediately run in most cases. If you set Data Timestamp for the data backfill instance to the current date or a date later than the current date and you do not select Run Retroactive Instances Scheduled to Run after the Current Time, the data backfill instance is run as scheduled.
Manage data backfill instances
After you configure the preceding settings, data backfill instances are generated. Then, you can view the details and status of a data backfill instance, and stop or rerun a data backfill instance on the Patch Data page in Operation Center.
Area | Description |
1 | In this area, you can specify filter conditions to search for a data backfill instance. You can also terminate multiple running data backfill instances at a time. For example, you can search for a data backfill instance by node name, node ID, or one or more of the following conditions: Retroactive Instance Name, Created By, Creation Date, Status, Data Timestamp, My Nodes, and Initiated by Me. Note
|
2 | In this area, you can view the following information about a data backfill instance:
In this area, you can also perform the following operations on data backfill instances:
|
3 | In this area, you can view the following information about each node for which the data backfill instance is generated:
In this area, you can also perform the following operations on a node:
|
4 | You can select multiple nodes in the area marked with 3 and click Stop or Rerun in the area marked with 4 to stop or rerun the selected nodes at a time. |
Instance status
No. | Status | Icon |
1 | Succeeded | ![]() |
2 | Not Running | ![]() |
3 | Run failed | ![]() |
4 | Running | ![]() |
5 | Waiting time | ![]() |
6 | Freeze | ![]() |