Restore Historical Data with Recurring Task Backfill in Operation Center - Dataphin

The data backfill feature for recurring tasks lets you refresh data within specified historical data timestamps. After a recurring task is developed and published, it runs according to the configured schedule. To run a recurring task for a specific time period or refresh data for a historical time range, you can use the data backfill feature. The scheduling parameters used by the node will be automatically replaced with corresponding values based on the business time selected for the data backfill. This topic describes how to perform data backfill for recurring tasks.

Scenarios

The data backfill feature is commonly used in the following scenarios:

Newly developed recurring tasks can only be scheduled starting from the next day. To immediately view historical partition data, you can perform a data backfill operation.
When upstream dependent tasks are rerun or have data backfilled, causing historical partition data to be refreshed. You can use the data backfill feature to refresh historical partition data for downstream tasks.
Historical business data is incomplete and must be refreshed periodically.

Data backfill modes

The data backfill feature in Operation Center supports backfilling data for the current task, either by itself or along with its downstream tasks. The details are as follows:

Backfill Current Task: This refers to performing data backfill operations on the current task. This is suitable for the following scenarios:
- You can refresh the current node's data without updating its descendant nodes.
- When the computation logic of the current task changes, you can first perform data backfill on the current task to verify the correctness of the computation logic, and then refresh data for downstream tasks.
Backfill Current And Upstream/Downstream Tasks: This refers to refreshing the current task and its upstream/downstream tasks, suitable for scenarios where the entire trace data needs to be refreshed.

Data backfill operation entry

In the top menu bar of the Dataphin homepage, choose Development > Task O&M.
In the navigation pane on the left, choose Task O&M > Recurring Task.
In the top navigation bar, select the production or development environment.
On the Integration and Computing Tasks or Modeling Tasks tab, click the icon in the Actions column of the target task, select Data Backfill, and then select Backfill Current Task or Backfill Current and Downstream Tasks.
Note
Data backfill operations can also be performed in the DAG graph of recurring tasks. For more information, see DAG graph of recurring tasks.

Backfill data for the current task

In the Data Backfill - Backfill Current Task dialog box, configure the data backfill task.

Step 1: Data backfill configuration

Parameter	Description
Basic information
Data Backfill Instance Name	The system automatically generates a name in the format of Node Name_Run Date_Instantiation Number. You can also manually change it.
Data Backfill Runtime	The time when the entire data backfill instance starts to be scheduled. You can select Run Now or Custom. Run Immediately: After the configuration is complete, the data backfill instance is immediately generated to perform the data backfill task. Custom: You can customize the specific time when the data backfill instance runs. The data backfill instance will start scheduling at the customized time. When the system time zone (the time zone in the User Center) is different from the scheduling time zone, the system will display both the system time zone and the scheduling time zone. After selecting the run time, the system automatically calculates the corresponding scheduling time zone time. Note The custom runtime must be later than the current time. After you configure a custom runtime, you can set Data Timestamp to a date up to the custom date. Periodically scheduled data backfill tasks will generate pending instances at 23:00 on the day before the scheduled run time.
Data Timestamp	Select the data timestamp range for which you need to perform data backfill (calculated according to the configured scheduling time zone). Configure the data timestamp based on the task's scheduling cycle. For more information, see the following topics: For tasks with daily, weekly, or monthly scheduling cycles, you can select By Range, By Cycle, and Custom data timestamps. The application scenarios for each option are as follows: By Range: Suitable for scenarios where you need to refresh data for multiple consecutive data timestamps. You must select a start time and an end time. The time span cannot exceed one year. Note If you only need to perform data backfill for a single day, you can select the same date for both the start time and end time. By Cycle: Used to refresh data for specific days of the week or dates of the month within multiple consecutive data timestamps. You must first select a time range. The time span cannot exceed one year. Weekly: Refreshes the selected days of the week within the continuous time period. Monthly: Refreshes the selected dates of each month within the continuous time period. Note Month-end refers to the last day of each month. Custom: Suitable for scenarios where you need to refresh data for multiple non-consecutive data timestamps. You can manually enter data timestamps from `1900-01-01` to the present. The format must be `YYYY-MM-DD`. To refresh data for multiple data timestamps, separate them with line breaks. For tasks with hourly or minute scheduling cycles, you need to first select the data timestamp, and then select the data backfill time range accurate to the minute, which defines the data timestamp and time range for the data backfill.
Select Field	If you are performing data backfill for a modeling task, you need to select the fields for data backfill. For more information, see the following topics: If the primary key or source table has changed, to ensure data consistency and correctness, only the full table data backfill mode is supported. If the primary key or source table has not changed, you can choose full table data backfill mode or specified field data backfill mode: Full Table: Suitable for scenarios where all fields in the data table need data backfill. Note This does not include registered fields. Specified Fields: Suitable for scenarios where you need to customize the fields for data backfill. After selecting the data backfill fields, fields in the same materialization node as the selected fields and fields that are required by the system will be automatically selected. The following rules apply: Fields in the same materialization node as the selected fields. Fields that are required by the system implementation, such as when the scheduling cycle of a modified metric is refreshed but the materialization remains unchanged.
Task Runtime	The runtime of a single node instance. You can select Ignore Scheduled Instance Runtime or Wait For Scheduled Instance Runtime. Ignore Scheduled Instance Runtime: Selected by default. The scheduled runtime is not checked for any run conditions of the instance. Wait For Scheduled Instance Runtime: The scheduled runtime must be met for all run conditions of the instance.
Other Configurations
Single Instance Data Backfill	Only logical fact tables can be selected. You can update data for all selected dates (within the range) of this event logical fact table through a single data backfill instance. Compared to regular multi-instance concurrent data backfill, this can save computing resources and significantly reduce data backfill time.
Number Of Concurrent Groups	The number of concurrent groups is used to control how many data backfill processes run simultaneously. You can select the number of concurrent groups. The system supports a minimum of 1 group and a maximum of 12 groups running concurrently. If the time span of the data timestamp is less than the number of concurrent groups, the actual number of parallel groups will be the number of days in the data timestamp. If the time span of the data timestamp is greater than the number of concurrent groups, there may be both serial and parallel execution. Instances in the same group run in the order of data timestamps, while instances in different groups run in parallel. For example, if the data timestamp is January 11 to January 13, and the number of concurrent groups is 2, January 11 and January 12 form one group, and January 13 forms another group. The instances for January 11 and January 13 start running at the same time, while the instance for January 12 will start running after the instance for January 11 completes. Note Concurrent execution is not supported when there are cross-epoch dependencies among the selected nodes.
Data Backfill Order	You can choose to perform data backfill in ascending or descending order based on the business time. Note Data backfill in descending order of data timestamps is not supported when there are cross-cycle dependencies or self-dependencies among the selected nodes.
Instances Of Paused Scheduled Tasks	Configure the running status of data backfill instances generated by paused scheduled tasks: Pause Running (May Block Data Backfill Process): This means that all data backfill instances generated by paused tasks are paused from running, which will block downstream instances from running normally. Note This is suitable for scenarios where neither the current task nor its downstream tasks need to run. Dry-Run: If dry-run is selected, the data backfill instances generated by the selected paused tasks will directly succeed in dry-run mode. Note This is suitable for scenarios where the current task does not need to run, but downstream tasks need to run normally according to the scheduling configuration. Run Normally: All data backfill instances generated by paused tasks run normally. Note This is suitable for scenarios where the node is set to pause scheduling, but needs to run normally on the selected data backfill data timestamps.
Instances Of Dry-Run Scheduled Tasks	Configure the running status of data backfill instances generated by dry-run scheduled tasks: Dry-Run: If dry-run is selected, the data backfill instances generated by the selected dry-run scheduled tasks will directly succeed in dry-run mode. Run Normally: All data backfill instances generated by dry-run tasks run normally.
Specify Temporary Schedule Resource Group	If you have enabled the custom resource group feature, you can specify a temporary resource group for this data backfill operation to meet temporary resource consumption needs. For more information, see Resource group overview. If no temporary schedule resource group is specified, the task schedule resource group configured for each task will be used for scheduling and running. Note The configured resource group only supports selecting resource groups that include batch operations in their application scenarios. Backfill Current Task (modeling tasks) does not support this configuration.

Step 2: Select recommended fields
When Specified Fields for a modeling task have associated fields, you can select optional associated fields for data backfill together in the Select Recommended Fields step. Recommended association reasons include calculation logic has changed, primary key of the primary table with changed calculation logic, and primary key of the child table where the primary table's primary key has changed.
- Fields With Changed Calculation Logic: The current calculation logic of the field is different from the calculation logic in the historical partitions within the selected data timestamps, meaning the calculation logic of the field has changed. Such fields can be included in the data backfill.
- Primary Key Of The Primary Table With Changed Calculation Logic: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range. The primary key fields of the primary table can be included in the data backfill.
- Primary Key Of The Child Table Where The Primary Table's Primary Key Has Changed: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range (affecting the output of the child table). The primary key fields of the child table can be included in the data backfill.

Click OK to complete the data backfill operation for the current task.

Backfill data for the current and upstream/downstream tasks

In the Data Backfill - Backfill Current And Downstream Tasks dialog box, configure the data backfill task.

Step 1: Basic information configuration

Parameter	Description
Data Backfill Instance Name	The system automatically generates a name in the format of Node Name_Run Date_Instantiation Number, You can also manually change it.
Data Backfill Runtime	The time when the entire data backfill instance starts to be scheduled. You can select Run Now or Custom. Run Immediately: After the configuration is complete, the data backfill instance is immediately generated to perform the data backfill task. Custom: You can customize the specific time when the data backfill instance runs. The data backfill instance will start scheduling at the customized time. When the system time zone (the time zone in the User Center) is different from the scheduling time zone, the system will display both the system time zone and the scheduling time zone. After selecting the run time, the system automatically calculates the corresponding scheduling time zone time. Note The custom runtime must be later than the current time. After you configure a custom runtime, you can set Data Timestamp to a date up to the custom date. Periodically scheduled data backfill tasks will generate pending instances at 23:00 on the day before the scheduled run time.
Data Timestamp	Select the data timestamp range for which you need to perform data backfill (calculated according to the configured scheduling time zone). Configure the data timestamp based on the task's scheduling cycle. For more information, see the following topics: For tasks with daily, weekly, monthly scheduling cycles, you can select By Range, By Cycle, and Custom data timestamps. The application scenarios for each option are as follows: By Range: Suitable for scenarios where data needs to be refreshed for multiple consecutive data timestamps. Note If you only need to perform data backfill for a single day, you can select the same date for both the start time and end time. By Cycle: Used for refreshing data for specific days of the week or specific dates of the month within multiple consecutive data timestamps. Weekly: Refreshes the selected days of the week within the continuous time period. Monthly: Refreshes the selected dates of each month within the continuous time period. Note Month-end refers to the last day of each month. Custom: Suitable for scenarios where data needs to be refreshed for multiple non-consecutive data timestamps. You can manually enter the data timestamps in the format `YYYY-MM-DD`. To refresh data for multiple data timestamps, separate them with line breaks. For tasks with hourly or minute scheduling cycles, you need to first select the data timestamp, and then select the data backfill time range accurate to the minute, which defines the data timestamp and time range for the data backfill.
Select Field	If you are performing data backfill for a modeling task, you need to select the fields for data backfill. For more information, see the following topics: If the primary key or source table has changed, to ensure data consistency and correctness, only the full table data backfill mode is supported. If the primary key or source table has not changed, you can choose full table data backfill mode or specified field data backfill mode: Full Table: Suitable for scenarios where all fields in the data table need data backfill. Specified Fields: Suitable for scenarios where you need to customize the fields for data backfill. After selecting the data backfill fields, fields in the same materialization node as the selected fields and fields that are required by the system will be automatically selected. The following rules apply: Fields in the same materialization node as the selected fields. Fields that are required by the system implementation, such as when the scheduling cycle of a modified metric is refreshed but the materialization remains unchanged.
Task Runtime	The runtime of a single node instance. You can select Ignore scheduled instance runtime or Wait for scheduled instance runtime. Ignore scheduled instance runtime: This option is selected by default. The scheduled runtime is not checked for any run conditions of the instance. Wait for scheduled instance runtime: The scheduled runtime must be met for all run conditions of the instance.

Step 2: Select recommended fields
When Specified Fields for a modeling task have associated fields, you can select optional associated fields for data backfill together in the Select Recommended Fields step. Recommended association reasons include calculation logic has changed, primary key of the primary table with changed calculation logic, and primary key of the child table where the primary table's primary key has changed.
- Fields With Changed Calculation Logic: The current calculation logic of the field is different from the calculation logic in the historical partitions within the selected data timestamps, meaning the calculation logic of the field has changed. Such fields can be included in the data backfill.
- Primary Key Of The Primary Table With Changed Calculation Logic: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range. The primary key fields of the primary table can be included in the data backfill.
- Primary Key Of The Child Table Where The Primary Table's Primary Key Has Changed: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range (affecting the output of the child table). The primary key fields of the child table can be included in the data backfill.
Data backfill configuration
Data Backfill Range List Mode Mass Mode: You can select upstream and downstream tasks that need data backfill through List Mode and Mass Mode.
Important
Cross-node parameter related instructions: When selecting nodes, it is recommended to select all upstream nodes that the node references cross-node parameters from. When a downstream (Down) node references cross-node output parameters from an upstream (Up) node, if you perform a data backfill operation on the downstream (Down) node without selecting the upstream (Up) node in the same data backfill instance, the cross-node input parameters in the downstream (Down) node will take values from the most recent N days of running records from the upstream (Up) node. If there are no running records or if they exceed N days, the default value will be used. The most recent N days (N) defaults to 15 days, which may be subject to change. It is recommended to select both upstream (Up) and downstream (Down) nodes. For more information, see Parameter configuration and using node parameters.
List mode
Suitable for upstream and downstream tasks at all levels, and task dependencies can be quickly selected from 1 to 10 levels or all levels. The list can display up to 2000 nodes. If this limit is exceeded, please select Mass Mode. You can also click the icon in the list to filter nodes by Node Type, Project, and O&M Owner.
Note
If the starting task is a logical table, the display range of downstream tasks depends on the logical table fields selected for data backfill.
The display range of upstream and downstream tasks includes all upstream and downstream tasks of the selected fields of the current table, including associated required fields, but not including associated recommended fields.
Filter Paused Tasks And Their Downstream:
- Selected by default. When selected, the list does not display nodes with paused scheduling and all their downstream nodes at the specified level and filtering conditions, and also cancels any already selected paused tasks.
- For logical tables, if they contain paused fields, they are filtered out. All downstream tasks of fields in logical tables marked as paused in the dependency downstream list are also filtered out.
  Note
  Downstream logical table fields can only be selected for data backfill as a whole, and paused fields cannot be filtered out separately.
Mass mode
If List mode does not meet your requirements for selecting descendant nodes, for example, if there are too many nodes or you need to select specific nodes in a batch, you can select Mass mode. Mass mode searches for nodes downwards from the current node within the selected scope based on the filter conditions and orchestrates them based on their dependencies. This mode is suitable for scenarios that require global data backfill. Mass mode also supports the following filter parameters:
- Coverage: Specify the scope using one of the following options: Specify projects, Specify node output names, All descendant nodes of the current node, Specify level-1 child nodes and all their descendants, Specify endpoints, Specify node names, Specify node IDs, Specify start points, All ancestor and descendant nodes of the current node, or All ancestor nodes of the current node.
  - Specify Project: Specify the data backfill range by specifying projects.
  - Specify Node Output Name: Specify the data backfill range by entering node output names. When entering multiple nodes, use line breaks to separate them. You can enter up to 1000 nodes.
  - All Downstream/Upstream of the current node: Backfill data for all upstream/downstream nodes of the current node.
  - Specify Level-1 Child Nodes And All Their Downstream: Backfill data for several level-1 child nodes of the current node and all their downstream nodes.
  - Specify End Point: Backfill data for all nodes on the chain from the starting point to the end point. The starting point defaults to the current node and cannot be modified. You can select multiple end point nodes.
  - Specify Starting Point: Backfill data for all nodes on the chain from the starting point to the end point. The end point defaults to the current node and cannot be modified. You can select multiple starting point nodes.
  - Specify node names/Specify node IDs: Backfills data for descendant nodes of the current node that have the specified names or IDs. Separate multiple entries with line breaks. You can enter a maximum of 5,000 entries. If a node name corresponds to multiple tasks, click Select Data Backfill Nodes in the prompt. In the Nodes With Duplicate Node Names dialog box, select the correct nodes to confirm which nodes require a data backfill.
    Note
    If the selected end point node is not a downstream node of the starting point, data backfill will only be performed on the two isolated nodes: the starting point and the end point.
    End points can be searched by ID/Node Name, with the search scope covering all nodes within the current tenant.
    Logical table task end points only support selection of the full table (all fields).
  - All Upstream And Downstream Of Current Node: Backfill data for all upstream and downstream nodes of the current node.
- Exclude Within Selected Range: Specify Node Output Names or Node Names to be excluded from the coverage range. Exclude Paused Nodes And Their Downstream is selected by default, similar to Filter Paused Nodes And Their Downstream in list mode.
  Note
  After excluding certain tasks within the selected range, isolated task nodes may appear in the DAG graph of the data backfill instance.
  This is suitable for scenarios where data backfill is only needed for one downstream task node.
- Selected node list: In massive mode, you can click View Selected Node List to confirm the nodes for data backfilling or click Export Selected Node List to export the list as a local csv file.

Other Configurations

Parameter	Description
Number Of Concurrent Groups	The number of concurrent groups is used to control how many data backfill processes run simultaneously. You can select the number of concurrent groups. The system supports a minimum of 1 group and a maximum of 12 groups running concurrently. If the time span of the data timestamp is less than the number of concurrent groups, the actual number of parallel groups will be the number of days in the data timestamp. If the time span of the data timestamp is greater than the number of concurrent groups, there may be both serial and parallel execution. Instances in the same group run in the order of data timestamps, while instances in different groups run in parallel. For example, if the data timestamp is January 11 to January 13, and the number of concurrent groups is 2, January 11 and January 12 form one group, and January 13 forms another group. The instances for January 11 and January 13 start running at the same time, while the instance for January 12 will start running after the instance for January 11 completes. Note Concurrent execution is not supported when there are cross-cycle dependencies among the selected nodes.
Data Backfill Order	You can choose to perform data backfill in ascending or descending order based on the business time. Note Data backfill in descending order of data timestamps is not supported when there are cross-cycle dependencies among the selected nodes.
Dry-Run This Node	Select whether this task needs to be dry-run: Yes: The data backfill instance corresponding to the current task runs in dry-run mode, meaning it returns success directly when scheduled to this task without actually executing the task. Note This is suitable for scenarios where the current node does not need data backfill, but downstream nodes selected from the current node need data backfill. No: This node runs normally.
Suspending Scheduled Task Instances	Configure the running status of data backfill instances generated by paused scheduled tasks: Pause Running (May Block Data Backfill Process): This means that all data backfill instances generated by paused tasks are paused from running, which will block downstream instances from running normally. Note This is suitable for scenarios where neither the current task nor its downstream tasks need to run. Dry-Run: If dry-run is selected, the data backfill instances generated by the selected paused tasks will directly succeed in dry-run mode. Note This is suitable for scenarios where the current task does not need to run, but downstream tasks need to run normally according to the scheduling configuration. Run Normally: All data backfill instances generated by paused tasks run normally. Note This is suitable for scenarios where the node is set to pause scheduling, but needs to run normally on the selected data backfill data timestamps.
Instances Of Dry-Run Scheduled Tasks	Configure the running status of data backfill instances generated by dry-run scheduled tasks: Dry-Run: If dry-run is selected, the data backfill instances generated by the selected dry-run scheduled tasks will directly succeed in dry-run mode. Run Normally: All data backfill instances generated by dry-run tasks run normally.
Hourly Range Impact Scope	If these are hourly or minute tasks, you also need to configure the effective range: Do Not Affect Daily/Weekly/Monthly Scheduled Tasks (Run When Selected): This means that downstream tasks are not affected by the hourly range selection and all run. Daily/Weekly/Monthly Scheduled Tasks Only Run If Their Scheduled Run Time Is Within The Selected Hourly Range: This means that downstream tasks are affected by the hourly range, and only run if their scheduled run time is within the selected hourly range.
Specify Temporary Schedule Resource Group	If you have enabled the custom resource group feature, you can specify a temporary resource group for this data backfill operation to meet temporary resource consumption needs. For more information, see Resource group overview. If no temporary schedule resource group is specified, the task schedule resource group configured for each task will be used for scheduling and running. Note The configured resource group only supports selecting resource groups that include batch operations in their application scenarios.

Click OK to complete the data backfill operation for the current and downstream tasks.

What to do next

After submitting the data backfill operation, you can perform operations management on the data backfill instances, such as viewing running logs, viewing node code, and stopping instance execution. For more information, see Data backfill instance O&M overview.