A workflow automates processes based on task dependencies. A periodic workflow handles recurring data processing, such as daily or monthly runs, and automatically generates task instances on a predefined schedule. It operates on a dual-trigger logic: a task instance runs only when its scheduled time arrives and all its upstream dependencies are met. This ensures that complex data pipelines execute reliably, orderly, and end-to-end. Common use cases for periodic workflows include:
-
Run data processing tasks on a fixed schedule: Automate tasks like data synchronization, data cleansing, or aggregation to run daily, hourly, or weekly.
-
Build complex DAG workflows: A visual drag-and-drop interface lets you integrate various node types (such as MaxCompute SQL, Hologres, EMR, and Python), establish upstream and downstream dependencies between tasks, and enable automated scheduling.
-
Centrally schedule and manage multiple subtasks: Group logically related tasks into a single workflow that is scheduled as one unit, simplifying maintenance and monitoring.
Quick start
This feature is available in the new version of DataWorks Data Studio. To distinguish between the old and new versions, see FAQ.
This section provides a basic, ready-to-run example of a periodic workflow. You will build a simple pipeline consisting of a virtual node (the starting point), a MaxCompute SQL node (for data processing), and then configure it to run on a schedule. Use case: Automatically calculate the total number of orders from the previous day and write the results to a table every day after midnight.
Step 1: Prepare compute engine and data
-
In your target workspace, bind MaxCompute as a compute engine.
-
In MaxCompute, create a table with the following structure to store the results.
-- Create a simple result table. CREATE TABLE IF NOT EXISTS dw_order_count_test ( order_date STRING, total_count BIGINT ) PARTITIONED BY (ds STRING); -- Use ds as the partition key to store the daily statistics.
Step 2: Create a periodic workflow
-
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
The button labeled "Data Development" opens an older version of Data Studio. Do not click it.
-
Click
in the left navigation bar, and to the right of Project Directory, click to go to the Create Workflow page. -
In the Create Workflow dialog box, set Scheduling Type to Periodic Scheduling, and enter a name such as
minimal_daily_demo.
Step 3: Orchestrate the workflow
-
On the workflow canvas, drag a Zero-Load Node from the component panel on the left and name it
start_node.A virtual node only defines the starting point of a business process and does not perform any actual execution.
-
Drag a MaxCompute SQL node onto the canvas and name it
count_orders. -
Click the dot at the bottom of the
start_nodenode and drag a line to the dot at the top of thecount_ordersnode. This connects the two nodes.
Step 4: Develop the node code
Enable DataWorks Copilot for intelligent code completion suggestions and improved development efficiency.
-
Double-click the
count_ordersnode to open the code editor. -
Write the business logic for the node. This example uses simulated data.
-- bizdate is a scheduling variable that is defined in the scheduling configuration. INSERT OVERWRITE TABLE dw_order_count_test PARTITION (ds='${bizdate}') SELECT '${bizdate}' as order_date, COUNT(*) as total_count FROM (SELECT 1 as id UNION ALL SELECT 2 as id) t; -- Simulated dataFor more information about node development, see MaxCompute SQL node.
-
Click the Save button at the top of the node editor.
Step 5: Configure scheduling and parameters
-
Return to the workflow canvas. In the panel on the right, click the Scheduling Settings > Scheduling time tab:
-
Set Scheduling Frequency to Day.
-
Set Scheduling time to
00:05. This schedules the workflow to run at 00:05 every day.
-
-
In the editor for the
count_ordersnode, go to the section on the right. Add a parameter with the Parameter namebizdateand the Parameter Value$[yyyymmdd-1]. This expression represents the day before the current date.
Step 6: Debug the node and workflow
-
Debug the
count_ordersnode:-
Configure debug parameters: On the right of the node editor, click Debug Configuration.
-
In the Compute Resource section, select the MaxCompute compute engine that you prepared in Step 1.
-
In the Script Parameters section, specify a value for Value Used in This Run. The default value is the day before the current date.
-
-
Run the debug task: Click the Run button in the toolbar. This runs the node with the debug parameters that you specified in Debug Configuration.
-
After you confirm that the node runs as expected, click Sync to Scheduling in the upper-right corner.
-
-
Debug the entire workflow:
-
Return to the workflow canvas and click the
icon in the top toolbar. -
In the dialog box that appears, enter a Value Used in This Run for the workflow variables. For example, if the current date is 20260120, DataWorks replaces the
bizdatevariable with20260119.
-
Step 7: Deploy to production
-
On the workflow canvas, click the
button in the top toolbar. -
In the deployment panel, the system checks dependencies and configurations. After the checks pass, click Start Release Production and select Full Publishing.
-
After the deployment is successful, go to Operation and Maintenance Center and verify that the workflow appears in the list of periodic tasks.
You have developed a basic periodic workflow that will run automatically every day after midnight.
Core design and configuration
Workflow orchestration is centered on a visual DAG canvas. You use control flow nodes, such as merge and branch nodes, and interaction nodes, such as HTTP triggers, to structure tasks. By defining scheduling dependencies to set the execution order and passing context through scheduling parameters, you can automate complex data processing pipelines.
Node and workflow orchestration
Simple orchestration
DataWorks uses visual orchestration to break down complex logic into distinct nodes and create a standardized processing flow. This DAG-based orchestration enables state-driven automation: when an upstream node succeeds, it triggers its downstream tasks, ensuring the pipeline remains linear and stable. In its simplest form, this orchestration is a static, linear, and unidirectional DAG.
Complex orchestration: Flow control
The branch node, merge node, for-each node, and do-while node are supported only in DataWorks Standard Edition and later.
Control flow nodes are core components that elevate development from simple task integration to complex business orchestration. They overcome the linear-dependency limitations of traditional DAGs by introducing advanced logical control into your data workflows.
|
Parameter |
Description |
|
virtual node |
A no-op node that performs no computation. It organizes subtasks and can serve as a workflow's starting point. For example, in a product order analysis workflow, a virtual node named Note
If you develop a standalone node, you must use the workspace root node as the initial dependency. |
|
branch node |
Executes a branch based on an upstream node's result. For example, you can check the total daily order amount. If the total is 0, the workflow triggers an alert node and stops further computation. Otherwise, it proceeds to the report generation node. |
|
merge node |
Merges execution paths from multiple branches to resolve downstream dependencies. For example, a financial settlement task might depend on a "normal settlement" branch and an "account reconciliation logic" branch. The merge node ensures that after either branch completes its logic, the final report archiving task triggers. |
|
for-each node |
Iterates over a result set passed from an assignment node and executes downstream operations for each element. For example, for a list of 31 province names returned by an assignment node, the workflow runs a data cleansing task 31 times, processing the data partition for one province in each iteration. |
|
do-while node |
Creates a loop that runs until a specified condition is met, at which point the workflow proceeds to the next step. For example, you can call an external API every 10 minutes to check a data synchronization status. If the API returns "Processing", the loop continues. If it returns "Completed", the loop exits and subsequent processing begins. |
For more information, see General-purpose nodes in the new Data Studio.
Complex orchestration: State awareness and external interaction
The Check node is supported only in DataWorks Professional Edition and later. Other nodes are supported only in DataWorks Enterprise Edition.
These nodes are core components for state awareness and conditional checks. They not only verify whether the physical resources or prerequisite tasks that a workflow depends on are ready but also bridge the communication between the data platform and external business systems.
|
Parameter |
Description |
|
HTTP trigger |
Receives HTTP requests from external systems to trigger DataWorks task runs. For example, after an upstream business system completes its daily closing, it can call an HTTP API to trigger a T+1 data processing workflow in DataWorks. |
|
Check node |
Monitors whether external resources, such as OSS files or MaxCompute partitions, are ready. Once ready, it triggers downstream tasks. For example, a Check node can wait for the daily log files to be uploaded to OSS. After it detects that the files exist, it starts a log parsing task. |
|
dependency check node |
Uses active polling, logical combinations, and checks for cross-cycle or cross-workspace dependencies to verify if dependencies are ready. Once dependencies are met, it triggers downstream tasks, enabling precise control over complex scheduling conditions. For example, a daily-scheduled node can wait for all 24 hourly tasks from the previous day to complete. |
For more information, see General-purpose nodes in the new Data Studio.
Encapsulate sub-workflows to reuse logic
In a periodic workflow, you can use a SUB_PROCESS node to encapsulate a common task pipeline into a reusable module, called a sub-workflow, that other workflows can reference. For example, e-commerce, advertising, and IoT business lines might share a standardized processing flow that includes data cleansing, statistical aggregation, and data quality checks. Encapsulating this flow as a sub-workflow improves development efficiency.
Procedure:
-
In the workflow you plan to reference, such as
child_workflow, set the workflow General to Can be cited. This action converts the workflow into a sub-workflow. -
In the main workflow, drag a SUB_PROCESS node and select
child_workflowas the referenced workflow.
Sub-workflows have the following constraints:
-
Internal-only dependencies: The sub-workflow and all its internal nodes cannot have dependencies on any external tasks.
-
Isolation: The sub-workflow cannot be set as a direct dependency by any external task.
-
Passive triggering: After being published, a sub-workflow does not automatically generate scheduled instances. It runs only when called by a
SUB_PROCESSnode in another workflow.For more information, see SUB_PROCESS node.
Recommendations for splitting and modularizing workflows
To maintain performance and maintainability, split workflows that have over 100 nodes:
-
Split by business domain: Separate processing pipelines for different business subjects, such as transactions, users, and products, into independent workflows.
-
Use SUB_PROCESS to encapsulate common logic: Package common, reusable processing steps, such as data cleansing and formatting, into referable sub-workflows.
Configure scheduling dependencies
A scheduling dependency is a core mechanism that ensures the sequential execution, consistency, and accuracy of data processing. It connects isolated tasks into an orderly data pipeline by requiring that both the scheduled time is reached and all upstream tasks have completed successfully. When you orchestrate nodes within a workflow, scheduling dependencies are created between them by default. In addition to these default dependencies, you can configure more complex dependencies.
For more information, see Scheduling dependency.
Workflow-level dependencyUse a workflow-level dependency when an entire workflow, treated as a single unit, must wait for other tasks (such as another workflow or a standalone node) to complete before it can start. This is suitable for treating a workflow as an independent business module. For example, you can configure a sales business workflow to wait for a foundational data workflow to finish. |
Node-level dependencyUse a node-level dependency when a specific node within a workflow needs to wait for an external task that is not in the same workflow to complete. This allows for granular cross-workflow orchestration. For example, an aggregation node in a reporting workflow might need to wait for output from a specific node in an external financial system. We recommend configuring an upstream dependency check node to ensure the dependent task completes on time. |
Cross-cycle dependencyA self-dependency is when a task's current-cycle instance depends on its instance from a different cycle. This is supported for self-dependencies on the same node and dependencies between different nodes. Common use cases include tasks that use |
Cross-workspace dependencyWhen you need to depend on a task in another DataWorks workspace, you can uniquely link to the node by using its workspace name and the node's Output Name, Name, or ID. This is useful for cross-department or cross-project data collaboration. For example, a task in a marketing workspace might need to reference key data from an accounting workspace. |
Parameter design and flow
Workspace parameters are supported only in DataWorks Professional Edition and later.
The new version of Data Studio in DataWorks supports a four-level parameter passing mechanism and allows for flexible, dynamic data transfer across nodes by using context parameters. The levels are ordered from the narrowest to the broadest scope:
Node parametersIn batch processing tasks, when the same SQL code needs to process different partitioned data each day, you can use node parameters to dynamically define dates in the code. Node parameters support constants, built-in variables, and custom time expressions. For more information, see Sources and expressions for scheduling parameters. |
Context parametersPasses dynamic values from an upstream node's output parameters to a downstream node. Context parameters support constants, variables, and upstream run results. For more information, see Configure and use node context parameters. |
Workflow parametersA workflow can contain dozens of nodes that need to share specific business identifiers. Manually editing parameters for each node is error-prone. You can use workflow parameters to define these values once and have them take effect for all nodes within the workflow. For more information, see Workflow parameters. Note
Sub-workflows within a workflow can directly reference the workflow's parameters. |
Workspace parametersWhen you debug code in a development environment and run it in a production environment, the database names and resource paths you use are often different. You can define these environment-specific configurations at the workspace level, and they will apply to all nodes within the workspace. For example, you can define a workspace parameter For more information, see Use workspace parameters. |
Configure the scheduling cycle
The key difference between a periodic workflow and the legacy business process is that a workflow's scheduling cycle is configured as a single unit. In contrast, a business process is a collection of tasks whose scheduling cycles must be configured individually.
The scheduling cycle can be set only at the workflow level. Nodes inside a workflow can only have a delayed execution time set, which is calculated as an offset from the workflow's defined scheduled time.
|
Dimension |
Workflow-level configuration |
Internal node behavior |
|
Time property |
Absolute time (for example, 02:00) |
Relative time (a delay based on the workflow's scheduled time) |
|
Cycle property |
Defines a daily, hourly, minutely, weekly, monthly, or yearly cycle. |
Inherits the workflow cycle and cannot be modified. |
|
Trigger logic |
Physical time is reached + Upstream tasks succeed. |
Workflow scheduled time + Delay + Upstream tasks succeed. |
Debug and run
After developing your nodes and workflow, debug them individually and together to ensure they work correctly.
Single-node debugging
Single-node debugging validates a single node's code logic, such as an SQL statement, a Python script, or a data integration sync task. This mode runs only the selected node and does not trigger its upstream or downstream dependencies.
-
In the Debug Configuration panel on the right side of the node, configure the following parameters:
Parameter
Description
Compute engine
Select a bound compute engine. If none are available, select Create Compute Engine from the drop-down list.
ImportantEnsure network connectivity between the compute engine and the resource group. For more information, see Network connectivity solutions.
Resource group
Select a resource group that passed connectivity tests when the compute engine was bound. Some nodes support configuring a custom image on the resource group to extend the runtime environment.
(Optional) Dataset
Some nodes, such as Shell and Python, support using datasets to access unstructured data stored in OSS or NAS.
(Optional) Script parameters
When you configure the content of a node, define variables by using the
${parameter name}format. You must configure the Parameter name and Parameter Value in the Script Parameters section. When the task runs, the variable is dynamically replaced with its actual value. For more information, see Source and expressions of scheduling parameters.(Optional) Associated role
Some nodes, such as Shell and Python, support configuring an associated role to access resources from other cloud products.
-
In the toolbar above the node editor, click Save and then Run to run the node task.
-
View the run log and check the results at the bottom of the node editor.
Workflow debugging
Workflow debugging focuses on validating data dependencies, parameter passing, and execution order across multiple nodes in an end-to-end pipeline. After you finish single-node debugging, you can debug part of the pipeline or the entire workflow.
-
On the workflow's DAG canvas, click the Run button in the top toolbar. Alternatively, to validate a partial pipeline, right-click a node and select Run to this node or Run from this node.
-
In the dialog box, assign temporary values to all node variables in the workflow, such as
${bizdate}. -
The system executes the nodes sequentially, following the dependencies defined in the DAG. You can monitor node statuses in real time on the canvas and click any node to view its run log.
-
Click the Return button on the left side of the canvas to return to the workflow development view.
The run history on the right displays all debug run records for the workflow.
All general-purpose nodes that involve flow control, such as loops, branches, and merges, must be placed within a workflow and run with their upstream and downstream nodes to function correctly.
Management and operations
Manage workflow nodes
To enhance data development flexibility and efficiency, DataWorks lets you import existing standalone nodes into a workflow for orchestration. You can also move nodes out of a workflow or between different workflows. These features make node reuse, workflow refactoring, and modular management simpler and more effective.
Import an existing node into a workflow
The Import Node feature lets you add existing standalone nodes from your project to the current workflow canvas for rapid reuse.
-
Double-click the target workflow to open the canvas editor.
-
In the left component panel, switch to the Import Node tab.
-
This panel lists all available standalone nodes. You can filter and search by Node Type, Path, or Node Name to quickly find the node you need.
-
After you find the node, drag it onto the canvas to complete the import.
Move a node out of a workflow
You can move a node out of a workflow to make it a standalone node or move it directly to another workflow.
-
Move out of a workflow to make it a standalone node
Use this feature when you need to decouple a node from a workflow.
-
In Workspace Directories or on the workflow canvas, right-click the target node.
-
From the context menu, select Move out of Workflow.
-
In the confirmation dialog box, select the destination path for the node and confirm.
-
-
Move to another workflow
Use this feature when you need to refactor your business process by moving a node to a different workflow.
-
In Workspace Directories or on the workflow canvas, right-click the target node.
-
From the context menu, select Move to another workflow.
-
In the list that appears, select the target workflow and confirm.
You can also move a node by dragging it directly to the target workflow.
-
Before moving a node, carefully assess the potential impact on your business processes and reconfigure any dependencies in the new location.
-
When a node is removed from or moved to a different workflow, its upstream and downstream dependencies within the original workflow are disconnected.
-
Any workflow parameters configured on the node will become invalid.
Clone a workflow
Cloning a workflow creates a new, independent, and complete copy of an existing workflow, including all its internal nodes, code, and dependencies. In Project Directory, right-click the target workflow and select Cloning.
Cloning can introduce significant risks because it duplicates nearly all configurations. Before you commit and publish a cloned workflow, perform an "environment isolation" check by verifying the following configurations:
-
Output tables and target data sources: The cloned workflow's code will write to the exact same target tables as the original. You must modify the target table names in your code (for example, from
ods_user_tabletodev_ods_user_table) or update the node's output configuration. This prevents multiple tasks from writing to the same table, which can cause data conflicts. -
Upstream dependencies: Check the workflow's upstream dependencies. By default, the cloned workflow still depends on the original upstream tasks. Ensure this is the intended behavior to prevent a disorganized data pipeline or unnecessary task runs.
-
Parameter configuration: Check the custom parameters of the workflow and its internal nodes, particularly those for date partitions (such as
${bizdate}) and input/output paths, to ensure they are correct in the new environment.
You can also clone individual nodes within a workflow, but copying nodes across different workflows is not supported.
Version management
Publishing an internal node separately also creates a new version of the parent workflow.
Version management automatically records every change to a workflow, letting you view and compare historical versions and quickly revert to any previous state. This is a key feature for ensuring task stability and supporting team collaboration.
The Version panel is on the right side of the workflow canvas. This panel displays two types of records. Understanding their differences is crucial:
-
Development Records: A new record is created each time you click Save on the canvas. It acts as a snapshot during the development process to prevent accidental code loss and does not affect any tasks running in production.
-
Publishing Records: A new record is created after the workflow is successfully published to the production environment. This is the version that the scheduling system actually executes. When issues occur with production tasks, you typically use Publishing Records to perform a rollback.
Typical use cases
-
Quick failure rollback: When a production task fails due to a code change, you can use the revert feature to instantly restore the workflow to the last stable Publishing Record, minimizing downtime.
-
Change auditing and tracing: To determine when a piece of logic was modified and by whom, use the version list and compare features to trace every code change.
-
Code recovery: If you accidentally delete some logic without saving, you can revert to the most recent Development Record to recover it.
Important notes
-
Reverting overwrites the current development area: A revert operation overwrites all code and configurations on the current canvas with the content from the selected historical version. Before you revert, back up any uncommitted changes by copying them to a local text editor.
-
Republish after reverting: A revert operation only restores the code to the development environment. The restored version is not automatically published to production. You must manually publish the workflow for the rollback to take effect in the production environment.
Publishing and operations
-
Publish a node or workflow: After you finish developing your workflow, publish it to the production environment. The nodes in your development environment will generate corresponding periodic tasks in production. For details, see Publish a node or workflow.
-
Node and workflow operations: In the production environment, a workflow runs periodically according to its scheduling configuration. Go to Operation and Maintenance Center to monitor the scheduling status of the periodic workflow and perform related operational tasks. For details, see: Basic operational tasks for periodic tasks, Operational tasks for Data Backfill instances.
Task execution mechanism
In a scheduling scenario, a workflow's overall success status is affected by the run status of its internal tasks.
A workflow succeeds or fails based on the final states of its internal nodes:
-
Failed: If any critical node fails, the entire workflow instance is typically marked as Failed.
-
Frozen/Suspended: If a node within the workflow is manually frozen or suspended and it is an upstream dependency for subsequent nodes, the pipeline is interrupted, and the entire workflow is also set to Failed.
-
merge node: If a workflow uses a merge node, it can still return Succeeded even if a node in an upstream branch fails, as long as the merge node's logic evaluates to success. Design the upstream checks for the merge node carefully to prevent silent failures.
Special scenarios:
-
If you freeze a Data Backfill instance of a workflow task, the workflow instance is set to Succeeded.
-
In a Data Backfill scenario, if the system determines that a task cannot be executed, the workflow is set to Failed.
There may be a delay between when a failure actually occurs and when the instance status is updated.
Quotas and limitations
-
Node limit: A single workflow can support a maximum of 400 internal nodes. However, to ensure optimal canvas loading performance and maintainability, we recommend keeping the number of nodes under 100. For more complex scenarios, consider splitting the workflow into modular components.
-
Sub-workflow limits:
-
A workflow enabled for referencing, along with its internal nodes, cannot depend on any external task and cannot be directly depended on by any external task. Attempting to do so will cause an error on publication.
-
By default, this type of workflow does not automatically generate scheduling instances when published to the production environment. It runs as a subtask only when referenced by another workflow through a SUB_PROCESS node.
-
-
Maximum parallel instances: Periodic workflows do not support setting the maximum parallel instances at the workflow level. This setting is supported only for individual tasks within a workflow. To limit task concurrency, you can configure the Max Parallel Instances in the scheduling configuration for an individual node within the workflow. For more information, see Node scheduling configuration.
-
Unsupported node types: The EMR Spark Streaming, Flink SQL Streaming, Flink JAR Streaming, and Flink Python Streaming node types are not supported within workflows. They can only be developed and run as standalone nodes.
FAQ
-
Q: What is the difference between a workflow and a business process?
A: A workflow is a single, schedulable unit that runs as a whole. In contrast, a business process is simply a folder used to organize nodes and cannot be scheduled.
-
Q: Why does my task run successfully in debug mode but fail during scheduled runs?
A: This issue is usually caused by an environment mismatch. Check the following:
-
Resource group mismatch: Your debug runs might use a personal or development resource group, while scheduled runs are configured with a production resource group. Verify that the production resource group is valid, has sufficient capacity, and has the correct permissions.
-
Permission mismatch: The account used for production runs might lack the necessary permissions to access certain tables, functions, or resources.
-
Dependency mismatch: The dependencies in the production environment may differ from those in the development environment, or the output from an upstream dependency might not exist in production.
-
-
Q: Why is my instance stuck in a 'waiting' or 'not running' state?
A: An instance runs only when all its trigger conditions are met, including its scheduled time, the availability of scheduling resources, and the successful completion of all upstream node dependencies.
-
Upstream dependency not met: Go to the Operation Center and check the instance's Dependency view to identify which upstream task has not completed successfully.
-
Waiting for resources: The queue for the compute resource group is full, and the task is waiting for resources to become available.
-
Workflow or node is frozen: Check whether the workflow or any of its upstream nodes has been manually 'Frozen' or 'suspended'.
-
Instance not generated: Confirm that the workflow's scheduled time has passed. For newly published tasks, the first instance is generated only at the next scheduling cycle. For more information, see Instance generation method: Instant generation after publishing.
-
-
Q: Why is the entire workflow marked as 'Failed' when some of its nodes succeeded?
A: The workflow's state is evaluated using the following rules:
-
Failure on the critical path: If any node with downstream dependencies fails, the entire workflow is marked as 'Failed', even if other parallel branches have completed successfully.
-
A node is frozen: If an internal node is in a 'Frozen' state, the entire workflow is marked as 'Failed'.
-