The data backfill feature for recurring tasks lets you refresh data within specified historical data timestamps. After a recurring task is developed and published, it runs according to the configured schedule. To run a recurring task for a specific time period or refresh data for a historical time range, you can use the data backfill feature. The scheduling parameters used by the node will be automatically replaced with corresponding values based on the business time selected for the data backfill. This topic describes how to perform data backfill for recurring tasks.
Scenarios
The data backfill feature is commonly used in the following scenarios:
Newly developed recurring tasks can only be scheduled starting from the next day. To immediately view historical partition data, you can perform a data backfill operation.
When upstream dependent tasks are rerun or have data backfilled, causing historical partition data to be refreshed. You can use the data backfill feature to refresh historical partition data for downstream tasks.
When historical business data is missing and needs to be refreshed periodically.
Data backfill modes
Currently, the Operation Center data backfill feature supports data backfill for the current task and for the current task and its upstream and downstream tasks. For more information, see the following topics:
Backfill Current Task: This refers to performing data backfill operations on the current task. This is suitable for the following scenarios:
When you need to refresh data for the current node without updating downstream node data.
When the computation logic of the current task changes, you can first perform data backfill on the current task to verify the correctness of the computation logic, and then refresh data for downstream tasks.
Backfill Current And Upstream/Downstream Tasks: This refers to refreshing the current task and its upstream/downstream tasks, suitable for scenarios where the entire trace data needs to be refreshed.
Data backfill operation entry
In the top menu bar of the Dataphin homepage, choose Development > Task O&M.
In the navigation pane on the left, choose Task O&M > Recurring Task.
In the top navigation bar, select the production or development environment.
On the Integration And Computing Tasks or Modeling Task tab, click the Actions
icon in the Data Backfill column of the target task, select Backfill Current Task or Backfill Current And Upstream/downstream Tasks.NoteData backfill operations can also be performed in the DAG graph of recurring tasks. For more information, see DAG graph of recurring tasks.
Backfill data for the current task
In the Data Backfill - Backfill Current Task dialog box, configure the data backfill task.
Step 1: Data backfill configuration
Parameter
Description
Basic information
Data Backfill Instance Name
The system automatically generates a name in the format of Node Name_Run Date_Instantiation Number. You can also manually change it.
Runtime
Supports Run Immediately or Custom run.
Run Immediately: After the configuration is complete, the data backfill instance is immediately generated to perform the data backfill task.
Custom: You can customize the specific time when the data backfill instance runs. The data backfill instance will start scheduling at the customized time.
When the system time zone (the time zone in the User Center) is different from the scheduling time zone, the system will display both the system time zone and the scheduling time zone. After selecting the run time, the system automatically calculates the corresponding scheduling time zone time.
NoteThe custom run time must be later than the current time.
After configuring the custom run time, the Data Timestamp can be selected up to the current day of the custom date at the latest.
Periodically scheduled data backfill tasks will generate pending instances at 23:00 on the day before the scheduled run time.
Data Timestamp
Select the data timestamp range for which you need to perform data backfill (calculated according to the configured scheduling time zone). Configure the data timestamp based on the task's scheduling cycle. For more information, see the following topics:
For tasks with daily, weekly, or monthly scheduling cycles, you can select By Range, By Cycle, and Custom data timestamps. The application scenarios for each option are as follows:
By Range: Suitable for scenarios where you need to refresh data for multiple consecutive data timestamps. You must select a start time and an end time. The time span cannot exceed one year.
NoteIf you only need to perform data backfill for a single day, you can select the same date for both the start time and end time.
By Cycle: Used to refresh data for specific days of the week or dates of the month within multiple consecutive data timestamps. You must first select a time range. The time span cannot exceed one year.
Weekly: Refreshes the selected days of the week within the continuous time period.
Monthly: Refreshes the selected dates of each month within the continuous time period.
NoteMonth-end refers to the last day of each month.
Custom: Suitable for scenarios where you need to refresh data for multiple non-consecutive data timestamps. You can manually enter data timestamps from
1900-01-01to the present. The format must beYYYY-MM-DD. To refresh data for multiple data timestamps, separate them with line breaks.
For tasks with hourly or minute scheduling cycles, you need to first select the data timestamp, and then select the data backfill time range accurate to the minute, which defines the data timestamp and time range for the data backfill.
Select Field
If you are performing data backfill for a modeling task, you need to select the fields for data backfill. For more information, see the following topics:
If the primary key or source table has changed, to ensure data consistency and correctness, only the full table data backfill mode is supported.
If the primary key or source table has not changed, you can choose full table data backfill mode or specified field data backfill mode:
Full Table: Suitable for scenarios where all fields in the data table need data backfill.
NoteThis does not include registered fields.
Specified Fields: Suitable for scenarios where you need to customize the fields for data backfill. After selecting the data backfill fields, fields in the same materialization node as the selected fields and fields that are required by the system will be automatically selected. The following rules apply:
Fields in the same materialization node as the selected fields.
Fields that are required by the system implementation, such as when the scheduling cycle of a modified metric is refreshed but the materialization remains unchanged.
Other Configurations
Single Instance Data Backfill
Only logical fact tables can be selected.
You can update data for all selected dates (within the range) of this event logical fact table through a single data backfill instance. Compared to regular multi-instance concurrent data backfill, this can save computing resources and significantly reduce data backfill time.
Number Of Concurrent Groups
The number of concurrent groups is used to control how many data backfill processes run simultaneously. You can select the number of concurrent groups. The system supports a minimum of 1 group and a maximum of 12 groups running concurrently.
If the time span of the data timestamp is less than the number of concurrent groups, the actual number of parallel groups will be the number of days in the data timestamp.
If the time span of the data timestamp is greater than the number of concurrent groups, there may be both serial and parallel execution. Instances in the same group run in the order of data timestamps, while instances in different groups run in parallel. For example, if the data timestamp is January 11 to January 13, and the number of concurrent groups is 2, January 11 and January 12 form one group, and January 13 forms another group. The instances for January 11 and January 13 start running at the same time, while the instance for January 12 will start running after the instance for January 11 completes.
NoteConcurrent execution is not supported when there are cross-epoch dependencies among the selected nodes.
Data Backfill Order
You can choose to perform data backfill in ascending or descending order based on the business time.
NoteData backfill in descending order of data timestamps is not supported when there are cross-cycle dependencies or self-dependencies among the selected nodes.
Instances Of Paused Scheduled Tasks
Configure the running status of data backfill instances generated by paused scheduled tasks:
Pause Running (May Block Data Backfill Process): This means that all data backfill instances generated by paused tasks are paused from running, which will block downstream instances from running normally.
NoteThis is suitable for scenarios where neither the current task nor its downstream tasks need to run.
Dry-Run: If dry-run is selected, the data backfill instances generated by the selected paused tasks will directly succeed in dry-run mode.
NoteThis is suitable for scenarios where the current task does not need to run, but downstream tasks need to run normally according to the scheduling configuration.
Run Normally: All data backfill instances generated by paused tasks run normally.
NoteThis is suitable for scenarios where the node is set to pause scheduling, but needs to run normally on the selected data backfill data timestamps.
Instances Of Dry-Run Scheduled Tasks
Configure the running status of data backfill instances generated by dry-run scheduled tasks:
Dry-Run: If dry-run is selected, the data backfill instances generated by the selected dry-run scheduled tasks will directly succeed in dry-run mode.
Run Normally: All data backfill instances generated by dry-run tasks run normally.
Specify Temporary Schedule Resource Group
If you have enabled the custom resource group feature, you can specify a temporary resource group for this data backfill operation to meet temporary resource consumption needs. For more information, see Resource group overview. If no temporary schedule resource group is specified, the task schedule resource group configured for each task will be used for scheduling and running.
NoteThe configured resource group only supports selecting resource groups that include batch operations in their application scenarios.
Backfill Current Task (modeling tasks) does not support this configuration.
Step 2: Select recommended fields
When Specified Fields for a modeling task have associated fields, you can select optional associated fields for data backfill together in the Select Recommended Fields step. Recommended association reasons include calculation logic has changed, primary key of the primary table with changed calculation logic, and primary key of the child table where the primary table's primary key has changed.
Fields With Changed Calculation Logic: The current calculation logic of the field is different from the calculation logic in the historical partitions within the selected data timestamps, meaning the calculation logic of the field has changed. Such fields can be included in the data backfill.
Primary Key Of The Primary Table With Changed Calculation Logic: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range. The primary key fields of the primary table can be included in the data backfill.
Primary Key Of The Child Table Where The Primary Table's Primary Key Has Changed: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range (affecting the output of the child table). The primary key fields of the child table can be included in the data backfill.
Click OK to complete the data backfill operation for the current task.
Backfill data for the current and upstream/downstream tasks
In the Data Backfill - Backfill Current And Downstream Tasks dialog box, configure the data backfill task.
Step 1: Basic information configuration
Parameter
Description
Data Backfill Instance Name
The system automatically generates a name in the format of Node Name_Run Date_Instantiation Number, You can also manually change it.
Runtime
Supports Run Immediately or Custom run.
Run Immediately: After the configuration is complete, the data backfill instance is immediately generated to perform the data backfill task.
Custom: You can customize the specific time when the data backfill instance runs. The data backfill instance will start scheduling at the customized time.
When the system time zone (the time zone in the User Center) is different from the scheduling time zone, the system will display both the system time zone and the scheduling time zone. After selecting the run time, the system automatically calculates the corresponding scheduling time zone time.
NoteThe custom run time must be later than the current time.
After configuring the custom run time, the Data Timestamp can be selected up to the current day of the custom date at the latest.
Periodically scheduled data backfill tasks will generate pending instances at 23:00 on the day before the scheduled run time.
Data Timestamp
Select the data timestamp range for which you need to perform data backfill (calculated according to the configured scheduling time zone). Configure the data timestamp based on the task's scheduling cycle. For more information, see the following topics:
For tasks with daily, weekly, monthly scheduling cycles, you can select By Range, By Cycle, and Custom data timestamps. The application scenarios for each option are as follows:
By Range: Suitable for scenarios where data needs to be refreshed for multiple consecutive data timestamps.
NoteIf you only need to perform data backfill for a single day, you can select the same date for both the start time and end time.
By Cycle: Used for refreshing data for specific days of the week or specific dates of the month within multiple consecutive data timestamps.
Weekly: Refreshes the selected days of the week within the continuous time period.
Monthly: Refreshes the selected dates of each month within the continuous time period.
NoteMonth-end refers to the last day of each month.
Custom: Suitable for scenarios where data needs to be refreshed for multiple non-consecutive data timestamps. You can manually enter the data timestamps in the format
YYYY-MM-DD. To refresh data for multiple data timestamps, separate them with line breaks.
For tasks with hourly or minute scheduling cycles, you need to first select the data timestamp, and then select the data backfill time range accurate to the minute, which defines the data timestamp and time range for the data backfill.
Select Field
If you are performing data backfill for a modeling task, you need to select the fields for data backfill. For more information, see the following topics:
If the primary key or source table has changed, to ensure data consistency and correctness, only the full table data backfill mode is supported.
If the primary key or source table has not changed, you can choose full table data backfill mode or specified field data backfill mode:
Full Table: Suitable for scenarios where all fields in the data table need data backfill.
Specified Fields: Suitable for scenarios where you need to customize the fields for data backfill. After selecting the data backfill fields, fields in the same materialization node as the selected fields and fields that are required by the system will be automatically selected. The following rules apply:
Fields in the same materialization node as the selected fields.
Fields that are required by the system implementation, such as when the scheduling cycle of a modified metric is refreshed but the materialization remains unchanged.
Step 2: Select recommended fields
When Specified Fields for a modeling task have associated fields, you can select optional associated fields for data backfill together in the Select Recommended Fields step. Recommended association reasons include calculation logic has changed, primary key of the primary table with changed calculation logic, and primary key of the child table where the primary table's primary key has changed.
Fields With Changed Calculation Logic: The current calculation logic of the field is different from the calculation logic in the historical partitions within the selected data timestamps, meaning the calculation logic of the field has changed. Such fields can be included in the data backfill.
Primary Key Of The Primary Table With Changed Calculation Logic: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range. The primary key fields of the primary table can be included in the data backfill.
Primary Key Of The Child Table Where The Primary Table's Primary Key Has Changed: The calculation logic of the primary key in the primary table of the parent-child dimensional table has changed within the selected data timestamp range (affecting the output of the child table). The primary key fields of the child table can be included in the data backfill.
Data backfill configuration
Data Backfill Range List Mode Mass Mode: You can select upstream and downstream tasks that need data backfill through List Mode and Mass Mode.
ImportantCross-node parameter related instructions: When selecting nodes, it is recommended to select all upstream nodes that the node references cross-node parameters from. When a downstream (Down) node references cross-node output parameters from an upstream (Up) node, if you perform a data backfill operation on the downstream (Down) node without selecting the upstream (Up) node in the same data backfill instance, the cross-node input parameters in the downstream (Down) node will take values from the most recent N days of running records from the upstream (Up) node. If there are no running records or if they exceed N days, the default value will be used. The most recent N days (N) defaults to 15 days, which may be subject to change. It is recommended to select both upstream (Up) and downstream (Down) nodes. For more information, see Parameter configuration and using node parameters.
List mode
Suitable for upstream and downstream tasks at all levels, and task dependencies can be quickly selected from 1 to 10 levels or all levels. The list can display up to 2000 nodes. If this limit is exceeded, please select Mass Mode. You can also click the
icon in the list to filter nodes by Node Type, Project, and O&M Owner.NoteIf the starting task is a logical table, the display range of downstream tasks depends on the logical table fields selected for data backfill.
The display range of upstream and downstream tasks includes all upstream and downstream tasks of the selected fields of the current table, including associated required fields, but not including associated recommended fields.
Filter Paused Tasks And Their Downstream:
Selected by default. When selected, the list does not display nodes with paused scheduling and all their downstream nodes at the specified level and filtering conditions, and also cancels any already selected paused tasks.
For logical tables, if they contain paused fields, they are filtered out. All downstream tasks of fields in logical tables marked as paused in the dependency downstream list are also filtered out.
NoteDownstream logical table fields can only be selected for data backfill as a whole, and paused fields cannot be filtered out separately.
Mass mode
If list mode cannot meet your requirements for selecting downstream nodes (for example, if there are too many nodes, or to batch select certain specific nodes), you can choose mass mode. Mass mode will search for tasks within the selected range from the current node downward according to filtering conditions, and arrange them based on dependencies. This is suitable for scenarios requiring global data backfill. Mass mode also supports the following filtering parameters:
Coverage Range: You can specify the range through Specify Project, Specify Node Output Name, All Downstream Of Current Node, Specify Level-1 Child Nodes And All Their Downstream, Specify End Point, Specify Node Name, Specify Starting Point, All Upstream And Downstream Of Current Node, All Upstream Of Current Node.
Specify Project: Specify the data backfill range by specifying projects.
Specify Node Output Name: Specify the data backfill range by entering node output names. When entering multiple nodes, use line breaks to separate them. You can enter up to 1000 nodes.
All Downstream/Upstream of the current node: Backfill data for all upstream/downstream nodes of the current node.
Specify Level-1 Child Nodes And All Their Downstream: Backfill data for several level-1 child nodes of the current node and all their downstream nodes.
Specify End Point: Backfill data for all nodes on the chain from the starting point to the end point. The starting point defaults to the current node and cannot be modified. You can select multiple end point nodes.
Specify Starting Point: Backfill data for all nodes on the chain from the starting point to the end point. The end point defaults to the current node and cannot be modified. You can select multiple starting point nodes.
Specify Node Name: Backfill data for specified node names downstream of the current node. Multiple nodes should be separated by line breaks, with a maximum of 5000 characters. When a node name corresponds to multiple tasks, you can click Select Data Backfill Nodes in the prompt message. In the Nodes With Duplicate Node Names dialog box, select the corresponding nodes to confirm which nodes need data backfill.
NoteIf the selected end point node is not a downstream node of the starting point, data backfill will only be performed on the two isolated nodes: the starting point and the end point.
End points can be searched by ID/Node Name, with the search scope covering all nodes within the current tenant.
Logical table task end points only support selection of the full table (all fields).
All Upstream And Downstream Of Current Node: Backfill data for all upstream and downstream nodes of the current node.
Exclude Within Selected Range: Specify Node Output Names or Node Names to be excluded from the coverage range. Exclude Paused Nodes And Their Downstream is selected by default, similar to Filter Paused Nodes And Their Downstream in list mode.
NoteAfter excluding certain tasks within the selected range, isolated task nodes may appear in the DAG graph of the data backfill instance.
This is suitable for scenarios where data backfill is only needed for one downstream task node.
In mass mode, you can Selected Node List View The List Of Selected Nodes to confirm the data backfill nodes or click Export Selected Node List to export it as a local file in
csvformat.
Other Configurations
Parameter
Description
Number Of Concurrent Groups
The number of concurrent groups is used to control how many data backfill processes run simultaneously. You can select the number of concurrent groups. The system supports a minimum of 1 group and a maximum of 12 groups running concurrently.
If the time span of the data timestamp is less than the number of concurrent groups, the actual number of parallel groups will be the number of days in the data timestamp.
If the time span of the data timestamp is greater than the number of concurrent groups, there may be both serial and parallel execution. Instances in the same group run in the order of data timestamps, while instances in different groups run in parallel. For example, if the data timestamp is January 11 to January 13, and the number of concurrent groups is 2, January 11 and January 12 form one group, and January 13 forms another group. The instances for January 11 and January 13 start running at the same time, while the instance for January 12 will start running after the instance for January 11 completes.
NoteConcurrent execution is not supported when there are cross-cycle dependencies among the selected nodes.
Data Backfill Order
You can choose to perform data backfill in ascending or descending order based on the business time.
NoteData backfill in descending order of data timestamps is not supported when there are cross-cycle dependencies among the selected nodes.
Dry-Run This Node
Select whether this task needs to be dry-run:
Yes: The data backfill instance corresponding to the current task runs in dry-run mode, meaning it returns success directly when scheduled to this task without actually executing the task.
NoteThis is suitable for scenarios where the current node does not need data backfill, but downstream nodes selected from the current node need data backfill.
No: This node runs normally.
Suspending Scheduled Task Instances
Configure the running status of data backfill instances generated by paused scheduled tasks:
Pause Running (May Block Data Backfill Process): This means that all data backfill instances generated by paused tasks are paused from running, which will block downstream instances from running normally.
NoteThis is suitable for scenarios where neither the current task nor its downstream tasks need to run.
Dry-Run: If dry-run is selected, the data backfill instances generated by the selected paused tasks will directly succeed in dry-run mode.
NoteThis is suitable for scenarios where the current task does not need to run, but downstream tasks need to run normally according to the scheduling configuration.
Run Normally: All data backfill instances generated by paused tasks run normally.
NoteThis is suitable for scenarios where the node is set to pause scheduling, but needs to run normally on the selected data backfill data timestamps.
Instances Of Dry-Run Scheduled Tasks
Configure the running status of data backfill instances generated by dry-run scheduled tasks:
Dry-Run: If dry-run is selected, the data backfill instances generated by the selected dry-run scheduled tasks will directly succeed in dry-run mode.
Run Normally: All data backfill instances generated by dry-run tasks run normally.
Hourly Range Impact Scope
If these are hourly or minute tasks, you also need to configure the effective range:
Do Not Affect Daily/Weekly/Monthly Scheduled Tasks (Run When Selected): This means that downstream tasks are not affected by the hourly range selection and all run.
Daily/Weekly/Monthly Scheduled Tasks Only Run If Their Scheduled Run Time Is Within The Selected Hourly Range: This means that downstream tasks are affected by the hourly range, and only run if their scheduled run time is within the selected hourly range.
Specify Temporary Schedule Resource Group
If you have enabled the custom resource group feature, you can specify a temporary resource group for this data backfill operation to meet temporary resource consumption needs. For more information, see Resource group overview. If no temporary schedule resource group is specified, the task schedule resource group configured for each task will be used for scheduling and running.
NoteThe configured resource group only supports selecting resource groups that include batch operations in their application scenarios.
Click OK to complete the data backfill operation for the current and downstream tasks.
What to do next
After submitting the data backfill operation, you can perform operations management on the data backfill instances, such as viewing running logs, viewing node code, and stopping instance execution. For more information, see Data backfill instance O&M overview.