Managing complex data pipelines—where dozens of tasks must run in the right order, at the right time, and only after their inputs are ready—requires more than a job scheduler. A periodic workflow in DataWorks solves this by combining visual DAG orchestration with a dual-trigger execution model: a task instance runs only when its scheduled time arrives and all upstream dependencies succeed. This ensures your data pipelines run in a stable, orderly, and self-contained manner.
Use periodic workflows to:
Automate data processing on a fixed schedule: Run data synchronization, cleaning, or aggregation tasks at daily, hourly, or weekly intervals.
Build complex directed acyclic graph (DAG) dependency flows: Use a visual drag-and-drop canvas to connect MaxCompute SQL, Hologres, E-MapReduce (EMR), Python, and other node types with upstream and downstream dependencies.
Centrally schedule and manage multiple sub-tasks: Group logically related tasks into a single workflow that schedules and monitors as one unit.
Workflows are available only in the new version of DataWorks Data Studio. To check which version you're using, see FAQ.
Quick start
This section walks you through building a simple periodic workflow from scratch. The goal: automatically calculate the total number of orders from the previous day and write the result to a table every day at 00:05.
The pipeline consists of two nodes:
A virtual node (
start_node) — the control starting pointA MaxCompute SQL node (
count_orders) — the data processing step
Step 1: Prepare the compute engine and result table
This step sets up the infrastructure your workflow needs.
In your target workspace, bind MaxCompute as a compute engine.
In MaxCompute, create the result table:
-- Create a result table to store daily order counts CREATE TABLE IF NOT EXISTS dw_order_count_test ( order_date STRING, total_count BIGINT ) PARTITIONED BY (ds STRING); -- Partitioned by date
Step 2: Create a periodic workflow
This step creates the workflow container that holds all your nodes and scheduling configuration.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the target workspace and choose Shortcuts > Data Studio in the Actions column. > If the button shows "Data Development", it opens the old version. Do not click it.
Click the
icon in the left-side navigation pane. To the right of Workspace Directories, click  > Create Workflow.In the New Workflow dialog box, set Scheduling Type to Periodic Scheduling, enter a name (for example,
minimal_daily_demo), and complete the creation.
Step 3: Orchestrate the workflow
This step defines the pipeline structure using the DAG canvas.
On the workflow canvas, drag a virtual node from the left component panel and name it
start_node. > A virtual node defines the starting point of a workflow and performs no computation.Drag a MaxCompute SQL node onto the canvas and name it
count_orders.Click the dot at the bottom of
start_nodeand drag a line to the top ofcount_ordersto create a dependency.
Step 4: Write the node code
This step adds the business logic that runs each day.
Enable DataWorks Copilot for intelligent code completion.
Double-click
count_ordersto open the node code editor.Write the SQL logic. This example uses simulated data:
-- bizdate is a scheduling parameter injected at runtime. -- $[yyyymmdd-1] resolves to the previous day's date (e.g., 20260119 when today is 20260120). INSERT OVERWRITE TABLE dw_order_count_test PARTITION (ds='${bizdate}') SELECT '${bizdate}' AS order_date, COUNT(*) AS total_count FROM (SELECT 1 AS id UNION ALL SELECT 2 AS id) t; -- Simulated data> For details on node development, see MaxCompute SQL node.
Click Save.
Step 5: Configure the scheduling cycle and parameters
This step tells the workflow when to run and what date value to pass to the SQL.
Return to the workflow canvas. In the right-side panel, go to Scheduling > Scheduling Time:
Set Scheduling Cycle to Daily.
Set Schedule Time to
00:05.
In the
count_ordersnode editor, go to Scheduling Configuration > Scheduling Parameters and add:Parameter Name:
bizdateParameter Value:
$[yyyymmdd-1](the previous day's date)
Step 6: Debug the node and workflow
Debugging validates your code logic before the workflow goes to production.
Debug the node:
In the node editor, click Debugging Configurations on the right side.
Under Compute engine, select the MaxCompute engine from Step 1.
Under Script Parameters, set Value Used in This Run (defaults to yesterday's date).
Click Run in the toolbar. The node runs with the parameters you specified.
Once the results look correct, click Sync to Scheduling to apply the debug configuration to the scheduling configuration.
Debug the entire workflow:
On the workflow canvas, click the
icon in the top toolbar.In the dialog box, enter the Current Run Value for the workflow variables. For example, if today is 20260120, set
bizdateto20260119.
Step 7: Publish to production
Publishing creates a periodic task in the production environment that runs on your defined schedule.
Return to the workflow and click the
button in the top toolbar.In the publishing panel, the system runs dependency and configuration checks. After the checks pass, click Start Deployment to Production Environment and select Full Deployment.
Go to the Operation Center to confirm the workflow appears in the periodic task list.
The workflow now runs every day at 00:05.
Node types
All orchestration in DataWorks is built from nodes connected on a DAG canvas. The following table summarizes all available node types and when to use each one.
| Node | Edition required | What it does | When to use it |
|---|---|---|---|
| Virtual node | All editions | A no-op placeholder that serves as the start of a workflow | Every workflow needs one as its root node |
| Branch node | Standard and later | Routes execution to different downstream paths based on upstream results | When different outcomes (e.g., "data exists" vs. "no data") require different processing |
| Merge node | Standard and later | Waits for one of several upstream branches to complete, then continues downstream | When multiple branches converge and you need only one to succeed before proceeding |
| For-each node | Standard and later | Iterates over a result set from an assignment node and runs downstream tasks once per element | When processing a dynamic list (e.g., 31 provinces, multiple tables) |
| Do-while node | Standard and later | Loops until a condition is met, then exits | When polling an external status (e.g., check every 10 minutes until a sync completes) |
| HTTP trigger | Enterprise Edition | Receives HTTP requests from external systems to trigger DataWorks tasks | When an upstream business system needs to kick off a DataWorks pipeline |
| Check node | Professional and later | Monitors external resources (OSS files, MaxCompute partitions) and triggers downstream tasks once the resources are ready | When downstream tasks depend on files or partitions produced by an external process |
| Dependency check node | Enterprise Edition | Polls cross-cycle or cross-workspace dependencies and triggers downstream tasks once they are met | When a daily task must wait for all 24 previous hourly tasks to finish |
| SUB_PROCESS node | All editions | References a reusable sub-workflow | When multiple workflows share the same processing logic (e.g., cleaning, formatting) |
For the complete node reference, see General nodes in the new Data Studio.
Design and configuration
Workflow orchestration
Simple orchestration
DataWorks lets you break down complex data pipelines—from multi-source ingestion to layered modeling—into distinct nodes connected on a visual DAG canvas. When an upstream node succeeds, it immediately triggers its downstream tasks, keeping the entire pipeline stable and orderly. The flow is static, linear, and unidirectional.
Sub-workflows for logic reuse
When multiple workflows share the same processing steps—such as data cleaning, summary statistics, and quality checks—encapsulate those steps as a reusable sub-workflow using a SUB_PROCESS node. This avoids duplicating logic across workflows: when the shared steps change, you update only the sub-workflow.
To create and use a sub-workflow:
In the
child_workflowworkflow, set Properties to Referable. The workflow becomes a sub-workflow.In the main workflow, drag a SUB_PROCESS node and select
child_workflowas the referenced workflow.
Sub-workflows have the following constraints:
Internal-only dependencies: The sub-workflow and its internal nodes cannot depend on any external tasks.
Isolation: The sub-workflow cannot be a direct dependency for any external task.
Passive trigger: After publishing, no scheduling instances are generated automatically. The sub-workflow executes only when called by a SUB_PROCESS node. > For details, see SUB_PROCESS node.
Splitting large workflows
Split workflows that contain more than 100 nodes to maintain performance and readability:
Split by business domain: Separate processing pipelines for transactions, users, products, and other business subjects into independent workflows.
Encapsulate common logic with SUB_PROCESS nodes: Wrap reusable steps (data cleaning, formatting) into referable workflows.
Scheduling dependencies
Scheduling dependencies connect isolated nodes into an ordered data pipeline. A node runs only when its scheduled time arrives and all upstream tasks complete successfully.
For the full reference, see Scheduling dependency.
When you connect nodes on the canvas, DataWorks creates scheduling dependencies between them automatically. For more complex cross-workflow or cross-cycle dependencies, configure the following dependency types:
| Dependency type | When to use it | Example |
|---|---|---|
| Workflow-level dependency | An entire workflow must wait for another task (workflow or standalone node) to finish before starting | A sales workflow waits for a foundational data workflow to complete |
| Node-level dependency | A specific node within a workflow must wait for an external task not in the same workflow | A summary node in a reporting workflow waits for a specific node in a financial system |
| Cross-cycle dependency | A task's current cycle instance depends on its instance from a different cycle (same or different node) | Today's instance waits for yesterday's to succeed—useful for INSERT OVERWRITE on partitioned tables or cumulative calculations |
| Cross-workspace dependency | A task depends on a task in another DataWorks workspace, linked by workspace name and node output name, name, or ID | A task in a marketing workspace references key data from an accounting workspace |
Parameter design
Workspace parameters are available only in DataWorks Professional Edition and later.
DataWorks supports a four-level parameter system, ordered from narrowest to broadest scope. Use the appropriate level to avoid duplicating configuration across nodes.
| Level | Type | Scope | Use when |
|---|---|---|---|
| 1 (narrowest) | Node parameters | Single node | Injecting dynamic dates into SQL code at runtime; supports constants, built-in variables, and custom time expressions. See Sources and expressions for scheduling parameters. |
| 2 | Context parameters | Node-to-node | Passing values from an upstream node's output to a downstream node; supports constants, variables, and run results. See Configure and use node context parameters. |
| 3 | Workflow parameters | All nodes in a workflow | Managing shared business identifiers (e.g., region code, table prefix) across dozens of nodes; sub-workflows can reference the parent workflow's parameters. See Workflow parameters. |
| 4 (broadest) | Workspace parameters | All nodes in a workspace | Defining environment-specific values (e.g., database names, resource paths) that differ between development and production. See Use workspace parameters. |
Scheduling cycles
The key difference between a periodic workflow and the old business process is that a workflow's scheduling cycle is configured as a single unit. A business process is merely a folder for organizing tasks and cannot be scheduled as a whole.
The scheduling cycle is set at the workflow level. Nodes inside a workflow cannot have their own independent cycle—they inherit the workflow's cycle and can only specify a Delayed Execution Time as an offset from the workflow's scheduled time.
| Dimension | Workflow-level configuration | Internal node behavior |
|---|---|---|
| Time property | Absolute time (e.g., 02:00) | Relative time (a delay from the workflow's scheduled time) |
| Cycle property | Daily, hourly, minutely, weekly, monthly, or yearly | Inherits the workflow cycle; cannot be modified |
| Trigger logic | Physical time is reached + upstream tasks succeed | Physical time + delay time + upstream tasks succeed |
Debug and run
Single-node debugging
Single-node debugging validates the internal logic of one node—such as an SQL statement, Python script, or data integration task—without triggering any upstream or downstream dependencies.
In the right-side panel of the node, configure Debugging Configurations:
Parameter Description Compute engine Select a bound compute engine. If none are available, select Create Compute Engine from the drop-down list. Ensure the compute engine and Resource Group are connected. See Network connectivity solutions. Resource Group Select a Resource Group that passed connectivity tests when the compute engine was bound. Some nodes support a custom image on the Resource Group to extend the runtime environment. Dataset (optional) Shell and Python nodes support using datasets to access unstructured data in OSS. Script parameters (optional) Define variables in node code using ${parameter_name}. Set the Parameter Name and Parameter Value here. For details, see Scheduling parameter sources and their expressions.Associated role (optional) Shell and Python nodes support configuring an associated role to access resources from other cloud products. Click Save, then click Run.
View run logs and results at the bottom of the node editor.
Workflow debugging
Workflow debugging validates data dependencies, parameter passing, and execution order across the entire pipeline. After debugging individual nodes, debug the full workflow—or a partial pipeline.
On the workflow's DAG canvas, click Run in the top toolbar. To validate a subset, right-click a node and select Run to This Node or Run from This Node.
In the dialog box, assign a temporary value for all node variables (such as
${bizdate}) for this debugging session.The system executes nodes sequentially from top to bottom, following the dependencies defined in the DAG. Monitor node statuses on the canvas in real time and click any node to view its run log.
Click Back on the left side of the canvas to return to the development state. The run history on the right shows all debugging run records.
Control flow nodes (loops, branches, merges) must be placed within a workflow and debugged together with upstream and downstream nodes to work correctly.
Management and operations
Workflow node management
DataWorks lets you import standalone nodes into a workflow, move nodes out of a workflow, or transfer nodes between workflows—without rebuilding them from scratch.
Import an existing node into a workflow
Double-click the target workflow to open its canvas editor.
In the left component panel, switch to the Import Node tab.
Filter by Node Type, Path, or Node Name to find the target node.
Drag the node onto the canvas.
Move a node out of a workflow
Convert to a standalone node: Right-click the node in the directory tree or canvas, select Move out of Workflow, choose the target path, and confirm.
Move to another workflow: Right-click the node, select Move to another workflow, select the target workflow, and confirm. You can also drag the node directly to the target workflow.
Before moving nodes, assess the impact on existing pipelines and reconfigure dependencies in the new location. Moving a node breaks its upstream and downstream dependencies in the original workflow, and any workflow parameters configured on the node become invalid.
Clone a workflow
Cloning a workflow creates a complete, independent copy including all internal nodes, code, and dependencies. In Workspace Directories, right-click the target workflow and select Clone.
Before committing and publishing a cloned workflow, review the following to prevent data conflicts and pipeline issues:
Output table and target data source: A cloned workflow writes to the same target table as the original. Update the table name in the code (for example, from
ods_user_tabletodev_ods_user_table) to prevent multiple tasks from writing to the same table.Upstream dependencies: The cloned workflow still depends on the original upstream tasks by default. Confirm whether this is the intended behavior.
Parameter configuration: Check custom parameters—especially those involving date partitions (such as
${bizdate}) and input/output paths—to make sure they are correct in the new environment.
You can also clone individual nodes within a workflow, but you cannot copy nodes across different workflows.
Version management
Publishing an internal node separately also creates a new version of the parent workflow.
Version management automatically records every change to a workflow. From the Versions panel on the right side of the workflow canvas, you can view, compare, and revert to any previous version.
The Versions panel shows two types of records:
Development Records: Created each time you click Save. This is a work-in-progress snapshot and does not affect production.
Publishing Records: Created when the workflow is successfully published to production. This is what the production scheduling system executes.
Typical use cases
| Scenario | Action |
|---|---|
| A production task fails after a code change | Use Revert to restore the workflow to the last stable Publishing Record, then republish |
| You need to trace when and by whom logic was modified | Use the version list and compare feature to view every code change |
| You accidentally delete code that hasn't been saved | Revert from the most recent Development Record |
Before reverting:
Reverting overwrites the current development area: All code and configurations on the canvas are replaced with the historical version. If you have unsaved changes, back them up locally before reverting.
Republish after reverting: Reverting only restores the development state. Manually publish the workflow for the rollback to take effect in production.
Publishing and operations
Publish: After developing a workflow, publish it to the production environment. The nodes in the development environment generate corresponding periodic tasks in production. For details, see Publish a node or workflow.
Operations: The workflow runs periodically based on its scheduling configuration. Go to the Operation Center to monitor scheduling status and manage tasks. For details, see Basic operational tasks for periodic tasks and Operational tasks for Data Backfill instances.
Workflow instance states
A workflow's overall status reflects the final state of its internal nodes. Understanding how the system determines workflow state helps you diagnose issues faster.
| State | Trigger condition | What to check |
|---|---|---|
| Running | Scheduled time reached + upstream tasks succeeded | — |
| Succeeded | All critical nodes completed successfully | — |
| Failed | A single critical node failed | Check the failed node's run log in Operation Center |
| Failed (frozen node) | An internal node is frozen or suspended and is an upstream dependency | Unfreeze the node or reconfigure the dependency |
| Succeeded (merge node exception) | A merge node evaluates to success even if some upstream branches failed | Review merge node upstream checks to prevent silent failures |
Special scenarios:
Freezing a Data Backfill instance of a workflow task sets the workflow instance to Succeeded.
In a Data Backfill scenario, if the system determines a task cannot be executed, the workflow is set to Failed.
There is a delay between when an actual failure event occurs and when the instance status is updated.
Limitations
| Item | Limit |
|---|---|
| Max nodes per workflow | 400 nodes (keep under 100 for optimal canvas performance and maintainability; split large workflows into smaller ones) |
| Parallel instances | Periodic workflows do not support a max parallel instance limit at the workflow level. Set this limit at the individual node level via Scheduling Configuration. See Node scheduling configuration. |
| Sub-workflow dependencies | A sub-workflow (with "Can Be Referenced" enabled) and its internal nodes cannot depend on any external tasks, and no external task can directly depend on it. Violating this constraint causes a publishing error. |
| Sub-workflow scheduling | After publishing, a sub-workflow does not generate scheduling instances automatically. It runs only when called by a SUB_PROCESS node. |
Frequently asked questions
What is the difference between a workflow and a business process?
A workflow is a schedulable unit—it has its own scheduling cycle and generates periodic task instances. A business process is just a folder for organizing nodes; it cannot be scheduled as a whole.
My task runs successfully in debug mode but fails in scheduled runs. What's wrong?
This is almost always an environment mismatch. Check:
Resource Group: Debug runs often use a personal Resource Group, while scheduled runs use a production Resource Group. Confirm the production Resource Group is active, has capacity, and has the correct permissions.
Permissions: The account used for production scheduled runs may lack access to certain tables, functions, or resources.
Dependencies: The production environment may have different upstream dependencies than the development environment, or the upstream output may not exist in production.
My instance is stuck in "waiting" state. What should I check?
An instance runs only when its scheduled time arrives, resources are available, and all upstream dependencies have succeeded. Check in this order:
Upstream dependency not complete: In the Operation Center, open the instance's Dependency view to see which upstream task hasn't succeeded yet.
Resource queue full: The compute engine's Resource Group queue is at capacity. The task is waiting for resources.
Workflow or node is frozen: Check if the workflow or any upstream node is manually frozen or suspended.
Instance not yet generated: Confirm the workflow's scheduled time has passed. Newly published workflows generate their first instance at the next scheduling cycle. See Instance generation method: Instant generation after publishing.
The entire workflow shows as "failed" even though some nodes succeeded. Why?
Critical path failure: If any node with downstream dependencies fails, the entire workflow is marked Failed—even if other parallel branches succeeded.
Frozen node: A frozen internal node causes the entire workflow to fail.