You can backfill data of a historical or future period of time for an auto triggered task to write the data to time-based partitions. Scheduling parameters that are used in the task code are automatically replaced with specific values based on the data timestamp that you configure to backfill data for the task. The data that corresponds to the data timestamp is written to specific partitions based on the business code. The partitions to which the data is written are related to the logic and content of the task code. This topic describes how to backfill data for an auto triggered task and manage data backfill instances generated for the task on the Data Backfill page.
Background information
After an auto triggered task is developed, committed, and deployed to the scheduling system, the scheduling system runs the task based on the scheduling configurations of the task. If you want to run the auto triggered task in a specified time range, you can backfill data for the task. The following table describes the methods of selecting tasks for which you want to backfill data.
Task selection method | Description | Scenario |
Select one or more tasks as root tasks. This way, you can manually select specific descendant tasks of the root tasks for which you want to backfill data. Note
|
| |
Select a start task as the root task and one or more end tasks. Then, the system automatically determines that all tasks from the start task to the end task require data backfilling. | This method can be used to perform end-to-end data backfilling for tasks for which complex dependencies are configured. | |
Select a task as the root task, and determine the tasks for which you want to backfill data based on the workspaces to which descendant tasks of the root task belong. Note
| This method is suitable for scenarios in which descendant tasks of the current task belong to different workspaces and you want to backfill data for the descendant tasks. | |
Select a root task. Then, the system automatically determines that the root task and all its descendant tasks require data backfilling. Important You can view the tasks that are triggered to run only if the data backfill task is running. Proceed with caution. | This method can be used to backfill data for a root task and all its descendant tasks. |
Limits
Instance cleanup principles
Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days).
Instances that run on exclusive resource groups for scheduling are retained for one month (30 days), and logs for the instances are also retained for one month (30 days).
The system regularly clears excess run logs every day when the size of run logs generated for all the auto triggered task instances that finish running exceeds 3 MB.
Limits on permissions
For root tasks or their descendant tasks for which you want to backfill data, if you do not have required permissions on the workspaces to which the tasks belong, you cannot backfill data for these tasks. If a task in a workspace is an intermediate task for data backfilling, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks. Proceed with caution. If data needs to be backfilled for both ancestor tasks and descendant tasks of a task, the task is considered an intermediate task.
Precautions
Instance running
When DataWorks backfills data of a specified time range for a task, if an instance generated for the task fails on a day, the status of the other data backfill instances of the task for that day is also set to failed. In this case, DataWorks does not run the instances generated for this task on the next day. DataWorks runs the instances generated for a task on a day only after all instances generated for the task on the previous day are successfully run.
If you backfill data of a specific day for a task scheduled by hour or minute, whether all instances generated to run on that day for the task are run in parallel depends on whether you configure the self-dependency for the task.
If both an auto triggered task instance and a data backfill instance are triggered to run for a task, you must stop the data backfill instance to ensure that the auto triggered task instance can be run as expected.
You can add tasks that do not require data backfilling to a blacklist. If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Scheduling resources
If a large number of data backfill instances are run or a high data backfilling parallelism is configured, scheduling resources may be insufficient. Make sure that your configurations meet your business requirements.
To prevent data backfill instances from occupying large amounts of resources and affecting the running of auto triggered task instances, you must abide by the following rules that are formulated for data backfill instances:
If you backfill data for a task whose data timestamp is the previous day, the priority of a data backfill task created for the task is determined by the priority of the baseline to which the task belongs.
If you backfill data for a task whose data timestamp is the day before the previous day, you must abide by the following rules to downgrade the priority of the task:
If the priority of the task is 7 or 8, downgrade the priority of the task to 3.
If the priority of the task is 3 or 5, downgrade the priority of the task to 2.
If the priority of the task is 1, keep the priority unchanged.
Go to the Data Backfill page
Go to the Operation Center page.
Log on to the DataWorks console. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.
In the left-side navigation pane, choose
.
If you want to backfill data for a single auto triggered task, you can also perform the following operations: In the left-side navigation pane of the Operation Center page, choose
. On the page that appears, find the desired auto triggered task and click Backfill Data in the Actions column.Step 1: Create a data backfill task
On the Data Backfill page, click Create Data Backfill Task and configure parameters based on your business requirements.
Configure parameters in the Basic Information section.
DataWorks automatically generates a data backfill task name. You can change the name based on your business requirements.
Configure parameters in the Tasks That Require Data Backfill section.
You can backfill data for tasks on which you have required permissions by using one of the following methods: Manually Select, Select by Link, Select by Workspace, and Specify Task and All Descendant Tasks. You can also select other tasks for which you want to backfill data based on the tasks. The parameters that you can configure vary based on the method that you select.
Manually Select
Select one or more tasks as root tasks. This way, you can manually select specific descendant tasks of the root tasks for which you want to backfill data. The original plans of backfilling data for the current task, backfilling data for the current task and its descendant tasks, and backfilling data in advanced mode are compatible with this method.
The following table describes the parameters.
Parameter
Description
Task Selection Method
Select Manually Select.
Add Root Tasks
You can search for and add a root task by task name or ID. You can also click Batch Add and specify specific conditions such as resource group, scheduling cycle, and workspace, to add multiple root tasks at a time.
NoteYou can select only tasks in the workspaces to which you are added as a member.
Selected Root Tasks
The tasks for which you want to backfill data. The list displays the added root tasks. You can select descendant tasks for which you want to backfill data based on the root tasks.
NoteYou can filter descendant tasks based on dependency levels. Direct descendant tasks of root tasks are listed at Level 1.
You can select up to 500 root tasks and up to 2,000 total tasks.
Task Blacklist
If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist.
NoteYou can add only root tasks to the blacklist. If data does not need to be backfilled for descendant tasks of root tasks, remove the descendant tasks from the Selected Root Tasks list.
If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Select by Link
Select a start task as the root task and one or more end tasks. Then, the system automatically determines that all tasks from the start task to the end task require data backfilling.
The following table describes the parameters.
Parameter
Description
Task Selection Method
Select Select by Link.
Select Tasks
Enter a task name or task ID to search for and add a start task and use the same method to add one or more end tasks. Then, the system identifies intermediate tasks based on the start task and end tasks. An intermediate task serves as a direct or indirect descendant task of a start task and serves as a direct or indirect ancestor task of an end task.
Intermediate Tasks
The list of intermediate tasks that are automatically identified by the system based on the start task and end tasks.
NoteThe list can display up to 2,000 tasks. Extra tasks are not displayed in the list, but data is backfilled for all the tasks as expected.
Task Blacklist
If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist.
NoteIf a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Select by Workspace
Select a task as the root task, and determine the tasks for which you want to backfill data based on the workspaces to which descendant tasks of the root task belong. The original plan of backfilling data for massive nodes is compatible with this method.
Parameter
Description
Task Selection Method
Select Select by Workspace.
Add Root Tasks
You can search for and add a root task by task name or ID. Data is backfilled for tasks in the workspaces to which descendant tasks of the root task belong.
NoteYou can select only tasks in the workspaces to which you are added as a member.
Include Root Node
Specifies whether to backfill data for the root task.
Workspaces for Data Backfill
Select the workspaces in which tasks require data backfilling based on the workspaces to which descendant tasks of the root task belong.
NoteYou can select only workspaces that reside in the current region.
After you select a workspace, data will be backfilled for all tasks in the workspace by default. You can specify a custom task blacklist or whitelist based on your business requirements.
Add to Whitelist
Add other tasks for which you want to backfill data to the whitelist, in addition to the tasks in the workspaces that you selected.
Task Blacklist
Add the tasks that do not require data backfilling in the selected workspaces to the blacklist.
Specify Task and All Descendant Tasks
Select a root task. Then, the system automatically determines that the root task and all its descendant tasks require data backfilling.
ImportantYou can view the tasks that are triggered to run only if the data backfill task is running. Proceed with caution.
The following table describes the parameters.
Parameter
Description
Task Selection Method
Select Specify Task and All Descendant Tasks.
Add Root Task
You can search for and add a root task by task name or ID. Data will be backfilled for the selected root task and all its descendant tasks.
NoteYou can select only tasks in the workspaces to which you are added as a member.
If no task depends on the selected root task, data is backfilled for only the root task after you submit the data backfill task.
Task Blacklist
If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist.
NoteIf a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Configure parameters in the Data Backfill Policy section.
Configure information, including the running time of the data backfill task, whether to allow parallelism, whether to trigger an alert, and the resource group to be used, based on your business requirements.
The following table describes the parameters.
Parameter
Description
Data Timestamp
Specifies the data timestamp of data to be backfilled for selected tasks. The value of this parameter is accurate to the day.
If you want to backfill data of multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
If the specified data timestamp is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system immediately runs the data backfill instance after the data timestamp elapses.
For example, if the current date is
March 12, 2024
, the data timestamp isMarch 17, 2024
, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance onMarch 18, 2024
.
NoteIn batch processing, the most common scenario is to process data that was generated on the previous day on the current day. The previous day is the data timestamp. In the process of backfilling data for a task, DataWorks generates instances for the task based on the data timestamp that you selected. This way, you can backtrack the data at the specified time.
We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
Time Range
Specify the time period during which the selected tasks need to be run. Instances whose scheduling time is within the time period can be generated and run. You can configure this parameter to allow tasks that are scheduled by hour or minute to backfill only data in the specified time period. Default value:
00:00 to 23:59
.NoteInstances whose scheduling time is not within the time period are not generated. For example, if tasks scheduled by day depend on tasks scheduled by hour, isolated task instances may be generated, and task running is blocked.
We recommend that you modify this parameter only if data that is within a specific time period needs to be backfilled for tasks that are scheduled by hour or minute.
Parallelism
If you want to backfill data of multiple data timestamps for a task, you can set this parameter to Yes and specify the number of groups. Valid values:
Yes: The system will generate data backfill instances based on the specified number of groups and run the data backfill instances for different data timestamps in parallel.
No: Data backfill instances are run in sequence based on the data timestamps.
NoteIf you backfill data of a specific day for a task scheduled by hour or minute, whether instances for the task are run in parallel depends on whether you configure the self-dependency for the task.
The number of groups that you can specify ranges
from 2 to 10
. The following rules apply when multiple data backfill instances are run in parallel:If the number of data timestamps is less than the number of groups, all the data backfill instances are run in parallel.
For example, the data timestamps are
January 11 to January 13
, and you set the number of groups to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.If the number of data timestamps is greater than the number of groups, the system runs some tasks in sequence and the other tasks in parallel based on the data timestamps.
For example, the data timestamps are
January 11 to January 13
, and you set the number of groups to 2. In this case, two data backfill instances are generated and run in parallel. One of the data backfill instances has two data timestamps, and tasks correspond to the two data timestamps are run in sequence.
Alert for Data Backfill
Specifies whether to enable the alerting feature for data backfill.
Yes: An alert is generated for data backfill if the trigger condition is met.
No: The alerting feature is disabled for data backfill.
Trigger Condition
The trigger condition of an alert for data backfill. Valid values:
Alert on Failure or Success: An alert is generated regardless of whether data backfill is successful or fails.
Alert on Success: An alert is generated if data backfill is successful.
Alert on Failure: An alert is generated if data backfill fails.
NoteThis parameter is required only if you select Yes for the Alert for Data Backfill parameter.
Alert Notification Method
The notification method for an alert. The alert recipient is the initiator of the data backfill operation. Valid values: Text Message and Email, Text Message, and Email.
NoteThis parameter is required only if you select Yes for the Alert for Data Backfill parameter.
You can click Check Contact Information to check whether the mobile phone number or email address of the alert recipient is registered. If not, you can refer to Configure and view alert contacts to configure the information.
Order
The sequence based on which data backfill instances are run. Valid values: Ascending by Business Date and Descending by Business Date.
Resource Group for Scheduling
Specifies whether to select another resource group for scheduling to run a data backfill instance.
Follow Task Configuration: The resource group for scheduling that is configured for the current auto triggered task is used to run the data backfill instance.
Specify Resource Group for Scheduling: Select a resource group for scheduling to run the data backfill instance. This prevents the data backfill instance from competing for resources with auto triggered task instances.
NoteMake sure that network connections are established for the resource group. Otherwise, tasks may fail to run. If the specified resource group is not associated with the desired workspace, the resource group that is used to run the auto triggered task is used.
Execution Period
Specifies the time period during which data is backfilled. Valid values:
Follow Task Configuration: Data is backfilled when the scheduling time of data backfill instances arrives.
Specify Time Period: Data is backfilled within a specified time period. Specify a time period based on the number of tasks for which you want to backfill data.
NoteData is not backfilled for the tasks that are in the Not Run state when the time period elapses. Data is continuously backfilled for the tasks that are in the Running state when the time period elapses.
Configure parameters in the Data Backfill Task Verification section.
Configure the Terminate Task Running upon Verification Failure parameter to determine whether to terminate task running if the data backfill task verification fails. The system checks the basic information about the data backfill task and also checks potential risk items.
Basic information: Check the number of tasks involved in the data backfill operation, the number of generated instances, whether a task dependency loop is formed, whether task isolation occurs, and whether you have required permissions on workspaces.
Risk items: Check whether a task dependency loop is formed or whether task isolation occurs. If the risk detection fails, a task running exception will occur. You can enable the system to terminate the data backfill task when the risk detection fails.
Click Submit. A data backfill task is created.
Step 2: Run the data backfill task
When the scheduling time of the data backfill task arrives and no exception occurs, the data backfill task is automatically triggered to run.
A data backfill task cannot be run if one of the following conditions is met:
The verification feature is enabled for the data backfill task and the verification fails. For more information, see Step 4 in the "Step1: Create a data backfill task" section in this topic.
The extension-based check feature is enabled for the data backfill task, and the check fails. For more information, see Overview.
Manage data backfill instances
After you configure the preceding settings, data backfill instances are generated. Then, you can view the basic information and running details of the data backfill instances, and perform related operations on the data backfill instances. For example, you can terminate, rerun, or reuse a data backfill instance.
Area | Description |
1 | In this area, you can click Show Search Options and specify filter conditions, such as Retroactive Instance Name, Status, and Node Type, to search for data backfill instances. You can also terminate multiple running data backfill tasks at a time. |
2 | In this area, you can view the following information about a data backfill instance:
In this area, you can perform the following operations on data backfill instances:
|
3 | In this area, you can view the following information about each task for which the data backfill instance is generated:
In this area, you can also perform the following operations on a task:
|
4 | You can select multiple tasks in the area marked with 3 and click Stop or Rerun in this area to terminate or rerun the selected tasks at a time. |
Instance status
Status | Icon |
Successful | |
Not run | |
Failed | |
Running | |
Waiting | |
Frozen |