Backfill data and view data backfill instances - DataWorks

You can backfill data of a historical or future period of time for an auto triggered task to write the data to time-based partitions. Scheduling parameters that are used in the task code are automatically replaced with specific values based on the data timestamp that you configure to backfill data for the task. The data that corresponds to the data timestamp is written to specific partitions based on the business code. The partitions to which the data is written are related to the logic and content of the task code. This topic describes how to backfill data for an auto triggered task and manage data backfill instances generated for the task on the Data Backfill page.

Background information

After an auto triggered task is developed, committed, and deployed to the scheduling system, the scheduling system runs the task based on the scheduling configurations of the task. If you want to run the auto triggered task in a specified time range, you can backfill data for the task. The following table describes the methods of selecting tasks for which you want to backfill data.

Task selection method	Description	Scenario
Manually Select	Select one or more tasks as root tasks. This way, you can manually select specific descendant tasks of the root tasks for which you want to backfill data. Note The original plans of backfilling data for the current task, backfilling data for the current task and its descendant tasks, and backfilling data in advanced mode are compatible with this method. You can select up to 500 root tasks and up to 2,000 total tasks. The total tasks consist of root tasks and their descendant tasks.	This method can be used to backfill data for the current task and its descendant tasks at a time. This method can be used to backfill data for multiple tasks that may not have dependencies with each other at a time.
Select by Link	Select a start task as the root task and one or more end tasks. Then, the system automatically determines that all tasks from the start task to the end task require data backfilling.	This method can be used to perform end-to-end data backfilling for tasks for which complex dependencies are configured.
Select by Workspace	Select a task as the root task, and determine the tasks for which you want to backfill data based on the workspaces to which descendant tasks of the root task belong. Note The original plan of backfilling data for massive tasks is compatible with this method. You cannot configure a task blacklist.	This method is suitable for scenarios in which descendant tasks of the current task belong to different workspaces and you want to backfill data for the descendant tasks.
Specify Task and All Descendant Tasks	Select a root task. Then, the system automatically determines that the root task and all its descendant tasks require data backfilling. Important You can view the tasks that are triggered to run only if the data backfill task is running. Proceed with caution.	This method can be used to backfill data for a root task and all its descendant tasks.

Limits

Instance cleanup principles
- Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
- Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days).
- Instances that run on exclusive resource groups for scheduling are retained for one month (30 days), and logs for the instances are also retained for one month (30 days).
- The system regularly clears excess run logs every day when the size of run logs generated for all the auto triggered task instances that finish running exceeds 3 MB.
Limits on permissions
For root tasks or their descendant tasks for which you want to backfill data, if you do not have required permissions on the workspaces to which the tasks belong, you cannot backfill data for these tasks. If a task in a workspace is an intermediate task for data backfilling, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks. Proceed with caution. If data needs to be backfilled for both ancestor tasks and descendant tasks of a task, the task is considered an intermediate task.

Precautions

Instance running
- When DataWorks backfills data of a specified time range for a task, if an instance generated for the task fails on a day, the status of the other data backfill instances of the task for that day is also set to failed. In this case, DataWorks does not run the instances generated for this task on the next day. DataWorks runs the instances generated for a task on a day only after all instances generated for the task on the previous day are successfully run.
- If you backfill data of a specific day for a task scheduled by hour or minute, whether all instances generated to run on that day for the task are run in parallel depends on whether you configure the self-dependency for the task.
- If both an auto triggered task instance and a data backfill instance are triggered to run for a task, you must stop the data backfill instance to ensure that the auto triggered task instance can be run as expected.
- You can add tasks that do not require data backfilling to a blacklist. If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Scheduling resources
- If a large number of data backfill instances are run or a high data backfilling parallelism is configured, scheduling resources may be insufficient. Make sure that your configurations meet your business requirements.
- To prevent data backfill instances from occupying large amounts of resources and affecting the running of auto triggered task instances, you must abide by the following rules that are formulated for data backfill instances:
  - If you backfill data for a task whose data timestamp is the previous day, the priority of a data backfill task created for the task is determined by the priority of the baseline to which the task belongs.
  - If you backfill data for a task whose data timestamp is the day before the previous day, you must abide by the following rules to downgrade the priority of the task:
    - If the priority of the task is 7 or 8, downgrade the priority of the task to 3.
    - If the priority of the task is 3 or 5, downgrade the priority of the task to 2.
    - If the priority of the task is 1, keep the priority unchanged.

Go to the Data Backfill page

Go to the Operation Center page.
Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.
In the left-side navigation pane, choose O&M Assistant > Data Backfill.

Note

If you want to backfill data for a single auto triggered task, you can also perform the following operations: In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the page that appears, find the desired auto triggered task and click Backfill Data in the Actions column.

Step 1: Create a data backfill task

On the Data Backfill page, click Create Data Backfill Task and configure parameters based on your business requirements.

Configure parameters in the Basic Information section.
DataWorks automatically generates a data backfill task name. You can change the name based on your business requirements.

Configure parameters in the Tasks That Require Data Backfill section.

You can backfill data for tasks on which you have required permissions by using one of the following methods: Manually Select, Select by Link, Select by Workspace, and Specify Task and All Descendant Tasks. You can also select other tasks for which you want to backfill data based on the tasks. The parameters that you can configure vary based on the method that you select.

Manually Select

Select one or more tasks as root tasks. This way, you can manually select specific descendant tasks of the root tasks for which you want to backfill data. The original plans of backfilling data for the current task, backfilling data for the current task and its descendant tasks, and backfilling data in advanced mode are compatible with this method.

The following table describes the parameters.

Parameter	Description
Task Selection Method	Select Manually Select.
Add Root Tasks	You can search for and add a root task by task name or ID. You can also click Batch Add and specify specific conditions such as resource group, scheduling cycle, and workspace, to add multiple root tasks at a time. Note You can select only tasks in the workspaces to which you are added as a member.
Selected Root Tasks	The tasks for which you want to backfill data. The list displays the added root tasks. You can select descendant tasks for which you want to backfill data based on the root tasks. Note You can filter descendant tasks based on dependency levels. Direct descendant tasks of root tasks are listed at Level 1. You can select up to 500 root tasks and up to 2,000 total tasks.
Task Blacklist	If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist. Note You can add only root tasks to the blacklist. If data does not need to be backfilled for descendant tasks of root tasks, remove the descendant tasks from the Selected Root Tasks list. If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.

Select by Link

Select a start task as the root task and one or more end tasks. Then, the system automatically determines that all tasks from the start task to the end task require data backfilling.

The following table describes the parameters.

Parameter	Description
Task Selection Method	Select Select by Link.
Select Tasks	Enter a task name or task ID to search for and add a start task and use the same method to add one or more end tasks. Then, the system identifies intermediate tasks based on the start task and end tasks. An intermediate task serves as a direct or indirect descendant task of a start task and serves as a direct or indirect ancestor task of an end task.
Intermediate Tasks	The list of intermediate tasks that are automatically identified by the system based on the start task and end tasks. Note The list can display up to 2,000 tasks. Extra tasks are not displayed in the list, but data is backfilled for all the tasks as expected.
Task Blacklist	If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist. Note If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.

Select by Workspace

Select a task as the root task, and determine the tasks for which you want to backfill data based on the workspaces to which descendant tasks of the root task belong. The original plan of backfilling data for massive nodes is compatible with this method.

Parameter	Description
Task Selection Method	Select Select by Workspace.
Add Root Tasks	You can search for and add a root task by task name or ID. Data is backfilled for tasks in the workspaces to which descendant tasks of the root task belong. Note You can select only tasks in the workspaces to which you are added as a member.
Include Root Node	Specifies whether to backfill data for the root task.
Workspaces for Data Backfill	Select the workspaces in which tasks require data backfilling based on the workspaces to which descendant tasks of the root task belong. Note You can select only workspaces that reside in the current region. After you select a workspace, data will be backfilled for all tasks in the workspace by default. You can specify a custom task blacklist or whitelist based on your business requirements.
Add to Whitelist	Add other tasks for which you want to backfill data to the whitelist, in addition to the tasks in the workspaces that you selected.
Task Blacklist	Add the tasks that do not require data backfilling in the selected workspaces to the blacklist.

Specify Task and All Descendant Tasks

Select a root task. Then, the system automatically determines that the root task and all its descendant tasks require data backfilling.

Important

You can view the tasks that are triggered to run only if the data backfill task is running. Proceed with caution.

The following table describes the parameters.

Parameter	Description
Task Selection Method	Select Specify Task and All Descendant Tasks.
Add Root Task	You can search for and add a root task by task name or ID. Data will be backfilled for the selected root task and all its descendant tasks. Note You can select only tasks in the workspaces to which you are added as a member. If no task depends on the selected root task, data is backfilled for only the root task after you submit the data backfill task.
Task Blacklist	If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist. Note If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.

Configure parameters in the Data Backfill Policy section.

Configure information, including the running time of the data backfill task, whether to allow parallelism, whether to trigger an alert, and the resource group to be used, based on your business requirements.

The following table describes the parameters.

Parameter	Description
Data Timestamp	Specifies the data timestamp of data to be backfilled for selected tasks. The value of this parameter is accurate to the day. If you want to backfill data of multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps. If the specified data timestamp is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system immediately runs the data backfill instance after the data timestamp elapses. For example, if the current date is `March 12, 2024`, the data timestamp is `March 17, 2024`, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance on `March 18, 2024`. Note In batch processing, the most common scenario is to process data that was generated on the previous day on the current day. The previous day is the data timestamp. In the process of backfilling data for a task, DataWorks generates instances for the task based on the data timestamp that you selected. This way, you can backtrack the data at the specified time. We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
Time Range	Specify the time period during which the selected tasks need to be run. Instances whose scheduling time is within the time period can be generated and run. You can configure this parameter to allow tasks that are scheduled by hour or minute to backfill only data in the specified time period. Default value: `00:00 to 23:59`. Note Instances whose scheduling time is not within the time period are not generated. For example, if tasks scheduled by day depend on tasks scheduled by hour, isolated task instances may be generated, and task running is blocked. We recommend that you modify this parameter only if data that is within a specific time period needs to be backfilled for tasks that are scheduled by hour or minute.
Parallelism	If you want to backfill data of multiple data timestamps for a task, you can set this parameter to Yes and specify the number of groups. Valid values: Yes: The system will generate data backfill instances based on the specified number of groups and run the data backfill instances for different data timestamps in parallel. No: Data backfill instances are run in sequence based on the data timestamps. Note If you backfill data of a specific day for a task scheduled by hour or minute, whether instances for the task are run in parallel depends on whether you configure the self-dependency for the task. The number of groups that you can specify ranges `from 2 to 10`. The following rules apply when multiple data backfill instances are run in parallel: If the number of data timestamps is less than the number of groups, all the data backfill instances are run in parallel. For example, the data timestamps are `January 11 to January 13`, and you set the number of groups to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel. If the number of data timestamps is greater than the number of groups, the system runs some tasks in sequence and the other tasks in parallel based on the data timestamps. For example, the data timestamps are `January 11 to January 13`, and you set the number of groups to 2. In this case, two data backfill instances are generated and run in parallel. One of the data backfill instances has two data timestamps, and tasks correspond to the two data timestamps are run in sequence.
Alert for Data Backfill	Specifies whether to enable the alerting feature for data backfill. Yes: An alert is generated for data backfill if the trigger condition is met. No: The alerting feature is disabled for data backfill.
Trigger Condition	The trigger condition of an alert for data backfill. Valid values: Alert on Failure or Success: An alert is generated regardless of whether data backfill is successful or fails. Alert on Success: An alert is generated if data backfill is successful. Alert on Failure: An alert is generated if data backfill fails. Note This parameter is required only if you select Yes for the Alert for Data Backfill parameter.
Alert Notification Method	The notification method for an alert. The alert recipient is the initiator of the data backfill operation. Valid values: Text Message and Email, Text Message, and Email. Note This parameter is required only if you select Yes for the Alert for Data Backfill parameter. You can click Check Contact Information to check whether the mobile phone number or email address of the alert recipient is registered. If not, you can refer to Configure and view alert contacts to configure the information.
Order	The sequence based on which data backfill instances are run. Valid values: Ascending by Business Date and Descending by Business Date.
Resource Group for Scheduling	Specifies whether to select another resource group for scheduling to run a data backfill instance. Follow Task Configuration: The resource group for scheduling that is configured for the current auto triggered task is used to run the data backfill instance. Specify Resource Group for Scheduling: Select a resource group for scheduling to run the data backfill instance. This prevents the data backfill instance from competing for resources with auto triggered task instances. Note Make sure that network connections are established for the resource group. Otherwise, tasks may fail to run. If the specified resource group is not associated with the desired workspace, the resource group that is used to run the auto triggered task is used.
Execution Period	Specifies the time period during which data is backfilled. Valid values: Follow Task Configuration: Data is backfilled when the scheduling time of data backfill instances arrives. Specify Time Period: Data is backfilled within a specified time period. Specify a time period based on the number of tasks for which you want to backfill data. Note Data is not backfilled for the tasks that are in the Not Run state when the time period elapses. Data is continuously backfilled for the tasks that are in the Running state when the time period elapses.

Configure parameters in the Data Backfill Task Verification section.
Configure the Terminate Task Running upon Verification Failure parameter to determine whether to terminate task running if the data backfill task verification fails. The system checks the basic information about the data backfill task and also checks potential risk items.
- Basic information: Check the number of tasks involved in the data backfill operation, the number of generated instances, whether a task dependency loop is formed, whether task isolation occurs, and whether you have required permissions on workspaces.
- Risk items: Check whether a task dependency loop is formed or whether task isolation occurs. If the risk detection fails, a task running exception will occur. You can enable the system to terminate the data backfill task when the risk detection fails.
Click Submit. A data backfill task is created.

Step 2: Run the data backfill task

When the scheduling time of the data backfill task arrives and no exception occurs, the data backfill task is automatically triggered to run.

A data backfill task cannot be run if one of the following conditions is met:

The verification feature is enabled for the data backfill task and the verification fails. For more information, see Step 4 in the "Step1: Create a data backfill task" section in this topic.
The extension-based check feature is enabled for the data backfill task, and the check fails. For more information, see Overview.

Manage data backfill instances

After you configure the preceding settings, data backfill instances are generated. Then, you can view the basic information and running details of the data backfill instances, and perform related operations on the data backfill instances. For example, you can terminate, rerun, or reuse a data backfill instance.

管理补数据实例

Area	Description
1	In this area, you can click Show Search Options and specify filter conditions, such as Retroactive Instance Name, Status, and Node Type, to search for data backfill instances. You can also terminate multiple running data backfill tasks at a time.
2	In this area, you can view the following information about a data backfill instance: Node Name: the name of the data backfill instance. Click the icon before the name of the data backfill instance and view the information about the instance in this area, such as the date when the data backfill instance is run, the status of the data backfill instance, and details of the tasks for which the instance is generated. Check Status: the check status of the data backfill instance. Running Status: the status of the data backfill instance. Valid values: Succeeded, Run failed, Waiting for resources, and Pending. You can troubleshoot issues based on an abnormal state. Nodes: the number of tasks for which the data backfill instance is generated. Data Timestamp: the date when the data backfill instance is run. View Task Analysis Results: You can view the estimated number of instances that are generated, the running date, and risk detection results and handle blocked tasks at the earliest opportunity. In this area, you can perform the following operations on data backfill instances: Stop: Terminate multiple data backfill instances that are in the running state at a time. After you perform this operation, the status of the related instances is set to failed. Note Instance cleanup principles Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it. Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days). Instances that run on exclusive resource groups for scheduling are retained for one month (30 days), and logs for the instances are also retained for one month (30 days). The system regularly clears excess run logs every day when the size of run logs generated for all the auto triggered task instances that finish running exceeds 3 MB. You cannot terminate data backfill instances that are not run, are successfully run, or failed to run. Batch Rerun: Rerun multiple data backfill instances at a time. Note Only data backfill instances that are successfully run or failed to run can be rerun. If you perform this operation, selected data backfill instances are immediately rerun at the same time. The scheduling dependencies between the instances are not considered. If you want to rerun data backfill instances in sequence, you can select Rerun Descendant Nodes or perform the data backfill operation again. Reuse: Reuse a group of tasks for which data is backfilled. This allows you to quickly select tasks for which you want to backfill data.
3	In this area, you can view the following information about each task for which the data backfill instance is generated: Name: the name of the task for which the data backfill instance is generated. You can click the task name to open the directed acyclic graph (DAG) of the task and view the details of the task. Scheduling Time: the scheduling time of the task. Start run time: the time when the task starts to run. End Time: the time when the task finishes running. Runtime: the time consumed to run the task. In this area, you can also perform the following operations on a task: DAG: View the DAG of the task to identify ancestor and descendant tasks of the task. For more information, see Appendix: Use the features provided in a DAG. Stop: Stop the task. You can stop tasks that are in the running state. After you perform this operation, the status of the task is set to failed. Note You cannot stop a task that is not run, is successfully run, or failed to run. This operation will result in the failure on the running of an instance that is generated for the task and blocks the running of descendant instances of the instance. Exercise caution when you perform this operation. Rerun: Rerun the task. Note You can rerun only tasks that failed to run or are successfully run. More > Rerun Descendant Nodes: Rerun the descendant tasks of the task. More > Set Status to Successful: Set the status of the task to successful. More > Freeze: Freeze the task to pause the scheduling of the task. Note You cannot freeze a task that is in the waiting for resources, waiting for scheduling time to arrive, or running state. If the code of the task is being run or data quality of the task is being checked, the task cannot be frozen. More > Unfreeze: Unfreeze the task to resume the scheduling of the task. More > View Lineage: View the lineage of the task.
4	You can select multiple tasks in the area marked with 3 and click Stop or Rerun in this area to terminate or rerun the selected tasks at a time.

Instance status

Status	Icon
Successful
Not run
Failed
Running
Waiting
Frozen

DataWorks:Backfill data and view data backfill instances (new version)

Background information

Limits

Precautions

Go to the Data Backfill page

Step 1: Create a data backfill task

Manually Select

Select by Link

Select by Workspace

Specify Task and All Descendant Tasks

Step 2: Run the data backfill task

Manage data backfill instances

Instance status

FAQ