The data backfill feature for auto triggered nodes enables the refreshing of data within a specified historical business time. After an auto triggered node is developed and published, it runs periodically as per the schedule configuration. To execute the auto triggered node for a specific time period or to refresh data for a historical time range, the data backfill feature can be utilized. During data backfill, the node's scheduling parameters are automatically replaced with values corresponding to the selected business time. This topic outlines the process for performing data backfill for auto triggered nodes.
Scenarios
Common scenarios for using the data backfill feature include:
For a newly developed auto triggered node that is scheduled to start from the next day, data backfill can be used to immediately view historical partition data.
When historical partition data is refreshed due to rerunning or data backfill of upstream dependent nodes, the data backfill feature can refresh the historical partition data of downstream nodes.
Regular refreshes of historical business data are necessary when there are omissions in the historical records.
Data Backfill Modes
The Operation Center supports two data backfill modes: for the current node only, and for both the current and downstream nodes. The following details each mode:
Backfill Current Node: This mode is used to perform data backfill operations solely on the current node. It is applicable in scenarios such as:
Refreshing data for the current node without affecting downstream nodes.
Verifying the correctness of the current node's calculation logic by performing data backfill before refreshing data for downstream nodes.
Backfill Current And Downstream Nodes: This mode involves the current node and its downstream nodes and is suitable for scenarios requiring a complete data trace refresh.
Data Backfill Operation Entry
Navigate to the Dataphin home page, click Development in the top menu bar, then select Task Operation And Maintenance.
Follow the steps below to choose the appropriate Data Backfill mode for auto triggered nodes:
Select Project (Dev-Prod mode requires selecting the environment) > click Recurring Task > choose the data backfill task and click
icon > click Data Backfill.
NoteThe data backfill operation also supports performing data backfill in the DAG graph of auto triggered nodes. For more information, see DAG Graph of Auto Triggered Nodes.
Data Backfill for Current Node
Open the Data Backfill - Backfill Current Node dialog box and configure the data backfill task as follows:
Step 1: Data Backfill Configuration.
Parameter
Description
Basic Information
Data Backfill Instance Name
The system automatically generates the instance name using the format Node Name_Run Date_Instance Number. You can also manually edit it.
Run Time
Choose between Immediate Run and Custom run options.
Immediate Run: The data backfill instance is created immediately after configuration to perform the task.
Custom: Set a custom run time for the data backfill instance, which will be scheduled at the specified time.
NoteThe custom run time must be set for a future date and time.
After setting the custom run time, the Business Date can be selected up to the custom date.
Pending data backfill tasks are generated at 23:00 the day before the scheduled run time.
Business Date
Select the business date range for data backfill. Configure the business date according to the task's scheduling cycle. Options include:
For tasks scheduled daily, weekly, or monthly, choose from By Interval, By Cycle, or Custom business dates. Each option suits different scenarios:
By Interval: Ideal for refreshing data across multiple consecutive business dates.
NoteFor single-day data backfill, select the same start and end date.
By Cycle: Refreshes data on specific days within a continuous date range.
Weekly: Refreshes data on selected weekdays within the date range.
Monthly: Refreshes data on selected dates within the date range.
NoteThe term 'month-end' refers to the last day of each month.
Custom: Suitable for refreshing data on multiple non-consecutive business dates. Manually enter dates in the format
YYYY-MM-DD
, separated by line breaks for multiple dates.
For tasks scheduled by the hour or minute, you must select the business date and specify the data backfill time range down to the minute to establish the business date and time range for data backfill.
Select Field
When performing data backfill for detail and aggregate table tasks, it is necessary to select the fields to be backfilled.
The following details apply:
If the primary key or source table changes, only full table data backfill mode is supported to ensure data consistency and correctness.
If the primary key or source table remains unchanged, choose between full table data backfill mode or specified field data backfill mode:
Full Table: Appropriate for scenarios requiring data backfill for all table fields.
NoteExcludes registered hanging fields.
Specified Field: Suitable for custom field selection for data backfill. After selecting fields, the system automatically selects fields in the same materialization nodeand fields required by system implementation. The rules are as follows:
Fields within the same materialization node as the selected fields.
Fields that are mandatory due to system implementation, such as when the scheduling cycle model of an indicator is modified but the materialization has not changed.
Other Configuration
Single Instance Data Backfill
Supports logical fact tables for selection.
A single data backfill instance can cover and update data for all selected dates (within the range) of this event's logical fact table. This approach can save computing resources and significantly reduce data backfill time compared to ordinary multi-instance concurrent data backfill.
Number Of Concurrent Running Groups
Control the number of data backfill processes running simultaneously by selecting the number of concurrent running groups. The system supports a range from 1 group to a maximum of 12 groups running concurrently.
If the business date span is less than the number of concurrent groups, the actual number of parallel groups equals the number of business date days.
If the business date span exceeds the number of concurrent groups, a mix of serial and parallel execution occurs. Instances within the same group run sequentially by business date, while instances across different groups run in parallel. For example, if the business date spans from January 11 to January 13 and there are 2 concurrent groups, January 11 and 12 form one group, and January 13 forms another. Instances for January 11 and 13 begin simultaneously, with January 12's instance starting after January 11's completion.
NoteConcurrent execution is not supported for nodes with cross-cycle dependencies.
Data Backfill Order
Opt to perform data backfill in either ascending or descending order based on business time.
NoteDescending order based on business date is not supported for nodes with cross-cycle dependencies or self-dependencies.
Skip Execution For Scheduled Tasks
Set the run status for data backfill instances generated by tasks with paused scheduling:
Pause Running (may Block Data Backfill Process): Instances from paused tasks are also paused, potentially blocking downstream operations.
NoteAppropriate when neither the current task nor downstream tasks need to run.
Dry-run: Selected paused tasks' instances will succeed in dry-run without actual execution.
NoteSuitable when the current task doesn't need to run, but downstream tasks must proceed as scheduled.
Normal Running: Instances from paused tasks run as usual.
NoteIdeal for scenarios where a node with paused scheduling must run normally on selected business dates for data backfill.
Dry-run For Scheduled Tasks
Determine the run status for data backfill instances generated by tasks with dry-run scheduling:
Dry-run: Instances from selected dry-run tasks will succeed in dry-run without actual execution.
Normal Running: Instances from dry-run tasks execute normally.
Specify Temporary Scheduling Resource Group
When the custom resource group feature is enabled, you may temporarily assign a specific resource group for data backfill operations to accommodate short-term resource needs. For more information, see Overview of Resource Groups. If a temporary scheduling resource group is not specified, the system will default to using the task's configured scheduling resource group for both scheduling and execution.
NoteThe selected resource group must support the batch operations scenario.
This configuration is not supported for backfilling current tasks (detail and aggregate table tasks).
Step 2: Select Recommended Fields.
When there are optional associated fields for the Specified Fields of detail and aggregate table tasks, you can select optional associated data backfill fields together in the Select Recommended Fields step. Recommended association reasons include changes in calculation logic, changes in the primary key of the main table with changes in calculation logic, and changes in the primary key of the child table with changes in the primary key of the main table.Subtable primary key with changes in the primary table primary key
Fields With Changes In Calculation Logic: Indicates that the current calculation logic differs from the historical partition's logic within the selected business date range, suggesting changes in the field's calculation logic. These fields can be backfilled together.
Primary Key Of The Main Table With Changes In Calculation Logic: The calculation logic of the primary key of the main table within the master-child dimension table has changed within the selected business date range. The primary key fields of the main table can be backfilled together.
Primary Key Of The Child Table With Changes In The Primary Key Of The Main Table: The calculation logic of the primary key of the main table within the master-child dimension table has changed within the selected business date range (affecting the output of the child table). The primary key fields of the child table can be backfilled together.
Click Confirm to finalize the data backfill operation for the current node.
Data Backfill for Current and Downstream Nodes
In the Data Backfill - Backfill Current Node dialog box, configure the data backfill task as follows:
Step 1: Basic Information Configuration.
Parameter
Description
Data Backfill Instance Name
The system automatically generates the instance name using the format Node Name_Run Date_Instance Number. You can also manually edit it.
Run Time
Choose between Immediate Run and Custom run options.
Immediate Run: The data backfill instance is created immediately after configuration to perform the task.
Custom: Set a custom run time for the data backfill instance, which will be scheduled at the specified time.
NoteThe custom run time must be set for a future date and time.
After setting the custom run time, the Business Date can be selected up to the custom date.
Pending data backfill tasks are generated at 23:00 the day before the scheduled run time.
Business Date
Select the business date range for data backfill. Configure the business date according to the task's scheduling cycle. Options include:
For tasks scheduled daily, weekly, or monthly, choose from By Interval, By Cycle, or Custom business dates. Each option suits different scenarios:
By Interval: Ideal for refreshing data across multiple consecutive business dates.
NoteFor single-day data backfill, select the same start and end date.
By Cycle: Used for refreshing data on specific days within a continuous date range.
Weekly: Refreshes data on selected weekdays within the date range.
Monthly: Refreshes data on selected dates within the date range.
NoteMonth-end refers to the last day of each month.
Custom: Suitable for refreshing data on multiple non-consecutive business dates. Manually enter dates in the format
YYYY-MM-DD
, separated by line breaks for multiple dates.
For tasks scheduled by the hour or minute, you must first choose the business date, followed by specifying the data backfill time range down to the minute. This process sets the business date and precise time range for data backfilling.
Select Field
When performing data backfill for detail and aggregate table tasks, you must select the fields to be backfilled.
The following details apply:
If the primary key or source table changes, only full table data backfill mode is supported to ensure data consistency and correctness.
If neither the primary key nor the source table has changed, you have the option to select either full table data backfill mode or specified field data backfill mode:
Full Table: Appropriate for scenarios requiring data backfill for all table fields.
Specified Field: Suitable for custom field selection for data backfill. After selecting fields, the system automatically selects fields in the same materialization node and fields required by system implementation. The rules are as follows:
Fields within the same materialization node as the selected fields.
Fields that are mandatory due to system implementation, such as when the scheduling cycle model of an indicator is modified but the materialization has not changed.
Step 2: Select Recommended Fields.
When there are optional associated fields for the Specified Fields of detail and aggregate table tasks, you can select optional associated data backfill fields together in the Select Recommended Fields step. Recommended association reasons include changes in calculation logic, changes in the primary key of the main table with changes in calculation logic, and changes in the primary key of the child table with changes in the primary key of the main table.Subtable primary key with changes in the primary table primary key
Fields With Changes In Calculation Logic: Indicates that the current calculation logic differs from the historical partition's logic within the selected business date range, suggesting changes in the field's calculation logic. These fields can be backfilled together.
Primary Key Of The Main Table With Changes In Calculation Logic: The calculation logic of the primary key of the main table within the master-child dimension table has changed within the selected business date range. The primary key fields of the main table can be backfilled together.
Primary Key Of The Child Table With Changes In The Primary Key Of The Main Table: The calculation logic of the primary key of the main table within the master-child dimension table has changed within the selected business date range (affecting the output of the child table). The primary key fields of the child table can be backfilled together.
Data Backfill Configuration.
Parameter
Description
Data Backfill Scope
Downstream Node Selection
Select downstream nodes for data backfill using List Mode or Massive Mode:
ImportantCross-Node Parameter Related Description: When selecting nodes, it is recommended to also select all upstream nodes that reference cross-node parameters of the selected node. When performing data backfill operations on downstream (Down) nodes that reference cross-node output parameters of upstream (Up) nodes, if the upstream (Up) node is not selected in the same data backfill instance, the cross-node input parameters in the downstream (Down) node will take values from the most recent N days of run records of the upstream (Up) node. If there are no run records or it exceeds N days, the default value will be taken. The most recent N days (N) is set to 15 days by default, and it may be changed, so it is recommended to select both upstream (Up) and downstream (Down). For more information, see Parameter Configuration and Use of Node Parameters.
List Mode: Suitable for downstream nodes at all levels, and task dependencies can be quickly selected from 1 to 10 levels and all levels. The list can display up to 2000 nodes. If it exceeds the limit, please select Massive Mode. You can also click the list
icon to filter nodes based on Node Type, Project, and Operation Owner.
NoteIf the starting task is a logical table, the display range of downstream tasks depends on the logical table fields selected for data backfill.
The display range of downstream tasks includes all downstream tasks of the selected fields of the current table, including mandatory associated fields, but not recommended associated fields.
Filter Paused Tasks And Their Downstream:
Selected by default. When selected, the list does not display nodes with paused scheduling and all their downstream nodes under the specified level and filter conditions, and cancels the selected paused tasks.
For logical tables, if they contain paused fields, they will be filtered. All downstream tasks of logical tables marked as paused in the downstream dependency list will also be filtered.
NoteDownstream logical table fields can only be selected for data backfill as a whole. Paused fields cannot be filtered individually.
Massive Mode: If the list mode cannot meet your requirements for selecting downstream nodes (for example, if there are too many nodes or you need to batch select certain specified nodes), you can choose massive mode. Massive mode will search for tasks within the selected range from the current node downwards based on the filter conditions and orchestrate them according to dependencies. It is suitable for scenarios where global data backfill is required. Massive mode also supports the following filter parameters:
Coverage Range: Supports specifying the range through Specified Project, Specified Node Output Name, All Downstream Of Current Node, Specified First-level Child Node And All Its Downstream, Specified Endpoint, and Specified Node Name.
Specified Project: Specify the data backfill range by specifying the project.
Specified Node Output Name: Specify the data backfill range by entering the node output name. When entering multiple names, separate them with line breaks. A maximum of 1000 entries can be made.
All Downstream Of Current Node: Backfill data for all downstream nodes of the current node.
Specified First-level Child Node And All Downstream: Backfill data for several first-level child nodes of the current node and all their downstream nodes.
Specified Endpoint: Backfill data for all nodes on the trace from the starting point to the endpoint. The starting point defaults to the current node and cannot be modified. Multiple endpoint nodes can be selected.
Specified Node Name: Backfill data for the specified node names downstream of the current node. Multiple nodes are separated by line breaks, with a maximum input of 5000 characters. If a node name has multiple tasks, you can click Select Data Backfill Node in the prompt message, and in the Nodes With Duplicate Node Names dialog box, select the corresponding node to confirm the nodes that need data backfill.
NoteIf the selected endpoint node is not a downstream node of the starting point, only the isolated nodes of the starting point and endpoint will be backfilled.
The endpoint can be searched by id/node name, and the search range is all nodes within the current tenant.
Logical table task endpoints only support selecting the full table (all fields).
Exclude Within Selected Range: Specify the Node Output Name or Node Name that needs to be excluded within the coverage range. By default, Exclude Paused Nodes And Their Downstream is selected, which is the same as Filter Paused Nodes And Their Downstream in list mode.
NoteAfter excluding certain tasks within the selected range, isolated task nodes may be generated on the DAG graph of the data backfill instance.
Suitable for scenarios where only one downstream task node needs data backfill.
Selected Node List: In massive mode. Supports Viewing The Selected Node List to confirm the data backfill nodes or click Export Selected Node List to export it as a local file. The file format is
csv
.
Other Configuration
Number Of Concurrent Running Groups
The number of concurrent running groups is used to control how many data backfill processes are running simultaneously. You can select the number of concurrent running groups. The system supports a minimum of 1 group and a maximum of 12 groups running concurrently.
If the business date span is less than the number of concurrent groups, the actual number of parallel groups equals the number of business date days.
If the business date span exceeds the number of concurrent groups, a mix of serial and parallel execution occurs. Instances within the same group run sequentially by business date, while instances across different groups run in parallel. For example, if the business date spans from January 11 to January 13 and there are 2 concurrent groups, January 11 and 12 form one group, and January 13 forms another. Instances for January 11 and 13 begin simultaneously, with January 12's instance starting after January 11's completion.
NoteConcurrent execution is not supported for nodes with cross-cycle dependencies.
Data Backfill Order
You can choose to perform data backfill in ascending or descending order based on business time.
NoteData backfill in descending order based on business date is not supported when there are cross-cycle dependencies in the selected nodes.
Is This Node A Dry-run
Choose whether this task needs a dry-run:
Yes: The data backfill instance corresponding to the current task runs as a dry-run, meaning that once scheduled to this task, it directly returns success without actually executing the task.
NoteSuitable for scenarios where the current node does not need data backfill, but the downstream needs to be selected for data backfill starting from the current node.
No: This node runs normally.
Skip Execution For Scheduled Tasks
Configure the run status of data backfill instances generated by tasks with paused scheduling:
Pause Running (may Block Data Backfill Process): Data backfill instances generated by tasks with paused scheduling are all paused, blocking the normal operation of downstream instances.
NoteSuitable for scenarios where neither the current task nor its downstream tasks need to run.
Dry-run: If dry-run is selected, the data backfill instances generated by the selected paused tasks will directly succeed in dry-run.
NoteSuitable for scenarios where the current task does not need to run, but downstream tasks need to run normally according to the schedule configuration.
Normal Running: Data backfill instances generated by tasks in a paused state run normally.
NoteSuitable for scenarios where the current node is set to paused scheduling and needs to run normally on the selected data backfill business dates.
Dry-run For Scheduled Tasks
Configure the run status of data backfill instances generated by tasks with dry-run scheduling:
Dry-run: If dry-run is selected, the data backfill instances generated by the selected dry-run scheduling tasks will directly succeed in dry-run.
Normal Running: Data backfill instances generated by tasks in a dry-run state run normally.
Hour Interval Impact Range
If it is an hour or minute task, you also need to configure the effective range:
Does Not Affect Day/week/month Scheduled Tasks (selected To Run): Downstream tasks are not affected by the hour interval selection and all run.
Day/week/month Scheduled Tasks Only Run When The Scheduled Run Time Is Within The Selected Hour Interval: Downstream tasks are affected by the hour interval and only run when the scheduled run time is within the selected hour interval.
Specify Temporary Scheduling Resource Group
Should the custom resource group feature be enabled, you may temporarily assign a specific resource group for a data backfill operation to satisfy short-term resource demands. For more information, see Overview of Resource Groups. In the absence of a specified temporary scheduling resource group, the system will default to the task scheduling resource group set for each task for both scheduling and execution.
NoteThe selected resource group must support the batch operations scenario.
Click Confirm to complete the data backfill operation for the current and downstream nodes.
What to do next
After submitting the data backfill operation, you can manage the data backfill instances by viewing run logs, inspecting node code, terminating instance runs, and more. For further details, see Overview of Data Backfill Instance Operation and Maintenance.