All Products
Search
Document Center

DataWorks:Workflow

Last Updated:Mar 26, 2026

Managing complex data pipelines—where dozens of tasks must run in the right order, at the right time, and only after their inputs are ready—requires more than a job scheduler. A periodic workflow in DataWorks solves this by combining visual DAG orchestration with a dual-trigger execution model: a task instance runs only when its scheduled time arrives and all upstream dependencies succeed. This ensures your data pipelines run in a stable, orderly, and self-contained manner.

Use periodic workflows to:

  • Automate data processing on a fixed schedule: Run data synchronization, cleaning, or aggregation tasks at daily, hourly, or weekly intervals.

  • Build complex directed acyclic graph (DAG) dependency flows: Use a visual drag-and-drop canvas to connect MaxCompute SQL, Hologres, E-MapReduce (EMR), Python, and other node types with upstream and downstream dependencies.

  • Centrally schedule and manage multiple sub-tasks: Group logically related tasks into a single workflow that schedules and monitors as one unit.

Important

Workflows are available only in the new version of DataWorks Data Studio. To check which version you're using, see FAQ.

Quick start

This section walks you through building a simple periodic workflow from scratch. The goal: automatically calculate the total number of orders from the previous day and write the result to a table every day at 00:05.

The pipeline consists of two nodes:

  • A virtual node (start_node) — the control starting point

  • A MaxCompute SQL node (count_orders) — the data processing step

Step 1: Prepare the compute engine and result table

This step sets up the infrastructure your workflow needs.

  1. In your target workspace, bind MaxCompute as a compute engine.

  2. In MaxCompute, create the result table:

    -- Create a result table to store daily order counts
    CREATE TABLE IF NOT EXISTS dw_order_count_test (
        order_date STRING,
        total_count BIGINT
    )
    PARTITIONED BY (ds STRING); -- Partitioned by date

Step 2: Create a periodic workflow

This step creates the workflow container that holds all your nodes and scheduling configuration.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the target workspace and choose Shortcuts > Data Studio in the Actions column. > If the button shows "Data Development", it opens the old version. Do not click it.

  2. Click the image icon in the left-side navigation pane. To the right of Workspace Directories, click ![image](https://help-static-aliyun-doc.aliyuncs.com/assets/img/en-US/9003941571/p852270.png) > Create Workflow.

  3. In the New Workflow dialog box, set Scheduling Type to Periodic Scheduling, enter a name (for example, minimal_daily_demo), and complete the creation.

Step 3: Orchestrate the workflow

This step defines the pipeline structure using the DAG canvas.

  1. On the workflow canvas, drag a virtual node from the left component panel and name it start_node. > A virtual node defines the starting point of a workflow and performs no computation.

  2. Drag a MaxCompute SQL node onto the canvas and name it count_orders.

  3. Click the dot at the bottom of start_node and drag a line to the top of count_orders to create a dependency.

Step 4: Write the node code

This step adds the business logic that runs each day.

Important

Enable DataWorks Copilot for intelligent code completion.

  1. Double-click count_orders to open the node code editor.

  2. Write the SQL logic. This example uses simulated data:

    -- bizdate is a scheduling parameter injected at runtime.
    -- $[yyyymmdd-1] resolves to the previous day's date (e.g., 20260119 when today is 20260120).
    INSERT OVERWRITE TABLE dw_order_count_test PARTITION (ds='${bizdate}')
    SELECT
        '${bizdate}' AS order_date,
        COUNT(*) AS total_count
    FROM (SELECT 1 AS id UNION ALL SELECT 2 AS id) t; -- Simulated data

    > For details on node development, see MaxCompute SQL node.

  3. Click Save.

Step 5: Configure the scheduling cycle and parameters

This step tells the workflow when to run and what date value to pass to the SQL.

  1. Return to the workflow canvas. In the right-side panel, go to Scheduling > Scheduling Time:

    • Set Scheduling Cycle to Daily.

    • Set Schedule Time to 00:05.

  2. In the count_orders node editor, go to Scheduling Configuration > Scheduling Parameters and add:

    • Parameter Name: bizdate

    • Parameter Value: $[yyyymmdd-1] (the previous day's date)

Step 6: Debug the node and workflow

Debugging validates your code logic before the workflow goes to production.

  1. Debug the node:

    1. In the node editor, click Debugging Configurations on the right side.

      • Under Compute engine, select the MaxCompute engine from Step 1.

      • Under Script Parameters, set Value Used in This Run (defaults to yesterday's date).

    2. Click Run in the toolbar. The node runs with the parameters you specified.

    3. Once the results look correct, click Sync to Scheduling to apply the debug configuration to the scheduling configuration.

  2. Debug the entire workflow:

    1. On the workflow canvas, click the image icon in the top toolbar.

    2. In the dialog box, enter the Current Run Value for the workflow variables. For example, if today is 20260120, set bizdate to 20260119.

Step 7: Publish to production

Publishing creates a periodic task in the production environment that runs on your defined schedule.

  1. Return to the workflow and click the image button in the top toolbar.

  2. In the publishing panel, the system runs dependency and configuration checks. After the checks pass, click Start Deployment to Production Environment and select Full Deployment.

  3. Go to the Operation Center to confirm the workflow appears in the periodic task list.

The workflow now runs every day at 00:05.

Node types

All orchestration in DataWorks is built from nodes connected on a DAG canvas. The following table summarizes all available node types and when to use each one.

NodeEdition requiredWhat it doesWhen to use it
Virtual nodeAll editionsA no-op placeholder that serves as the start of a workflowEvery workflow needs one as its root node
Branch nodeStandard and laterRoutes execution to different downstream paths based on upstream resultsWhen different outcomes (e.g., "data exists" vs. "no data") require different processing
Merge nodeStandard and laterWaits for one of several upstream branches to complete, then continues downstreamWhen multiple branches converge and you need only one to succeed before proceeding
For-each nodeStandard and laterIterates over a result set from an assignment node and runs downstream tasks once per elementWhen processing a dynamic list (e.g., 31 provinces, multiple tables)
Do-while nodeStandard and laterLoops until a condition is met, then exitsWhen polling an external status (e.g., check every 10 minutes until a sync completes)
HTTP triggerEnterprise EditionReceives HTTP requests from external systems to trigger DataWorks tasksWhen an upstream business system needs to kick off a DataWorks pipeline
Check nodeProfessional and laterMonitors external resources (OSS files, MaxCompute partitions) and triggers downstream tasks once the resources are readyWhen downstream tasks depend on files or partitions produced by an external process
Dependency check nodeEnterprise EditionPolls cross-cycle or cross-workspace dependencies and triggers downstream tasks once they are metWhen a daily task must wait for all 24 previous hourly tasks to finish
SUB_PROCESS nodeAll editionsReferences a reusable sub-workflowWhen multiple workflows share the same processing logic (e.g., cleaning, formatting)

For the complete node reference, see General nodes in the new Data Studio.

Design and configuration

Workflow orchestration

Simple orchestration

DataWorks lets you break down complex data pipelines—from multi-source ingestion to layered modeling—into distinct nodes connected on a visual DAG canvas. When an upstream node succeeds, it immediately triggers its downstream tasks, keeping the entire pipeline stable and orderly. The flow is static, linear, and unidirectional.

Sub-workflows for logic reuse

When multiple workflows share the same processing steps—such as data cleaning, summary statistics, and quality checks—encapsulate those steps as a reusable sub-workflow using a SUB_PROCESS node. This avoids duplicating logic across workflows: when the shared steps change, you update only the sub-workflow.

To create and use a sub-workflow:

  1. In the child_workflow workflow, set Properties to Referable. The workflow becomes a sub-workflow.

  2. In the main workflow, drag a SUB_PROCESS node and select child_workflow as the referenced workflow.

Sub-workflows have the following constraints:

  • Internal-only dependencies: The sub-workflow and its internal nodes cannot depend on any external tasks.

  • Isolation: The sub-workflow cannot be a direct dependency for any external task.

  • Passive trigger: After publishing, no scheduling instances are generated automatically. The sub-workflow executes only when called by a SUB_PROCESS node. > For details, see SUB_PROCESS node.

image

Splitting large workflows

Split workflows that contain more than 100 nodes to maintain performance and readability:

  • Split by business domain: Separate processing pipelines for transactions, users, products, and other business subjects into independent workflows.

  • Encapsulate common logic with SUB_PROCESS nodes: Wrap reusable steps (data cleaning, formatting) into referable workflows.

Scheduling dependencies

Scheduling dependencies connect isolated nodes into an ordered data pipeline. A node runs only when its scheduled time arrives and all upstream tasks complete successfully.

For the full reference, see Scheduling dependency.

When you connect nodes on the canvas, DataWorks creates scheduling dependencies between them automatically. For more complex cross-workflow or cross-cycle dependencies, configure the following dependency types:

Dependency typeWhen to use itExample
Workflow-level dependencyAn entire workflow must wait for another task (workflow or standalone node) to finish before startingA sales workflow waits for a foundational data workflow to complete
Node-level dependencyA specific node within a workflow must wait for an external task not in the same workflowA summary node in a reporting workflow waits for a specific node in a financial system
Cross-cycle dependencyA task's current cycle instance depends on its instance from a different cycle (same or different node)Today's instance waits for yesterday's to succeed—useful for INSERT OVERWRITE on partitioned tables or cumulative calculations
Cross-workspace dependencyA task depends on a task in another DataWorks workspace, linked by workspace name and node output name, name, or IDA task in a marketing workspace references key data from an accounting workspace
image

Parameter design

Workspace parameters are available only in DataWorks Professional Edition and later.

DataWorks supports a four-level parameter system, ordered from narrowest to broadest scope. Use the appropriate level to avoid duplicating configuration across nodes.

LevelTypeScopeUse when
1 (narrowest)Node parametersSingle nodeInjecting dynamic dates into SQL code at runtime; supports constants, built-in variables, and custom time expressions. See Sources and expressions for scheduling parameters.
2Context parametersNode-to-nodePassing values from an upstream node's output to a downstream node; supports constants, variables, and run results. See Configure and use node context parameters.
3Workflow parametersAll nodes in a workflowManaging shared business identifiers (e.g., region code, table prefix) across dozens of nodes; sub-workflows can reference the parent workflow's parameters. See Workflow parameters.
4 (broadest)Workspace parametersAll nodes in a workspaceDefining environment-specific values (e.g., database names, resource paths) that differ between development and production. See Use workspace parameters.
image

Scheduling cycles

Important

The key difference between a periodic workflow and the old business process is that a workflow's scheduling cycle is configured as a single unit. A business process is merely a folder for organizing tasks and cannot be scheduled as a whole.

The scheduling cycle is set at the workflow level. Nodes inside a workflow cannot have their own independent cycle—they inherit the workflow's cycle and can only specify a Delayed Execution Time as an offset from the workflow's scheduled time.

DimensionWorkflow-level configurationInternal node behavior
Time propertyAbsolute time (e.g., 02:00)Relative time (a delay from the workflow's scheduled time)
Cycle propertyDaily, hourly, minutely, weekly, monthly, or yearlyInherits the workflow cycle; cannot be modified
Trigger logicPhysical time is reached + upstream tasks succeedPhysical time + delay time + upstream tasks succeed

Debug and run

Single-node debugging

Single-node debugging validates the internal logic of one node—such as an SQL statement, Python script, or data integration task—without triggering any upstream or downstream dependencies.

  1. In the right-side panel of the node, configure Debugging Configurations:

    ParameterDescription
    Compute engineSelect a bound compute engine. If none are available, select Create Compute Engine from the drop-down list. Ensure the compute engine and Resource Group are connected. See Network connectivity solutions.
    Resource GroupSelect a Resource Group that passed connectivity tests when the compute engine was bound. Some nodes support a custom image on the Resource Group to extend the runtime environment.
    Dataset (optional)Shell and Python nodes support using datasets to access unstructured data in OSS.
    Script parameters (optional)Define variables in node code using ${parameter_name}. Set the Parameter Name and Parameter Value here. For details, see Scheduling parameter sources and their expressions.
    Associated role (optional)Shell and Python nodes support configuring an associated role to access resources from other cloud products.
  2. Click Save, then click Run.

  3. View run logs and results at the bottom of the node editor.

Workflow debugging

Workflow debugging validates data dependencies, parameter passing, and execution order across the entire pipeline. After debugging individual nodes, debug the full workflow—or a partial pipeline.

  1. On the workflow's DAG canvas, click Run in the top toolbar. To validate a subset, right-click a node and select Run to This Node or Run from This Node.

  2. In the dialog box, assign a temporary value for all node variables (such as ${bizdate}) for this debugging session.

  3. The system executes nodes sequentially from top to bottom, following the dependencies defined in the DAG. Monitor node statuses on the canvas in real time and click any node to view its run log.

  4. Click Back on the left side of the canvas to return to the development state. The run history on the right shows all debugging run records.

Important

Control flow nodes (loops, branches, merges) must be placed within a workflow and debugged together with upstream and downstream nodes to work correctly.

Management and operations

Workflow node management

DataWorks lets you import standalone nodes into a workflow, move nodes out of a workflow, or transfer nodes between workflows—without rebuilding them from scratch.

Import an existing node into a workflow

  1. Double-click the target workflow to open its canvas editor.

  2. In the left component panel, switch to the Import Node tab.

  3. Filter by Node Type, Path, or Node Name to find the target node.

  4. Drag the node onto the canvas.

Move a node out of a workflow

  • Convert to a standalone node: Right-click the node in the directory tree or canvas, select Move out of Workflow, choose the target path, and confirm.

  • Move to another workflow: Right-click the node, select Move to another workflow, select the target workflow, and confirm. You can also drag the node directly to the target workflow.

Important

Before moving nodes, assess the impact on existing pipelines and reconfigure dependencies in the new location. Moving a node breaks its upstream and downstream dependencies in the original workflow, and any workflow parameters configured on the node become invalid.

Clone a workflow

Cloning a workflow creates a complete, independent copy including all internal nodes, code, and dependencies. In Workspace Directories, right-click the target workflow and select Clone.

Before committing and publishing a cloned workflow, review the following to prevent data conflicts and pipeline issues:

  1. Output table and target data source: A cloned workflow writes to the same target table as the original. Update the table name in the code (for example, from ods_user_table to dev_ods_user_table) to prevent multiple tasks from writing to the same table.

  2. Upstream dependencies: The cloned workflow still depends on the original upstream tasks by default. Confirm whether this is the intended behavior.

  3. Parameter configuration: Check custom parameters—especially those involving date partitions (such as ${bizdate}) and input/output paths—to make sure they are correct in the new environment.

You can also clone individual nodes within a workflow, but you cannot copy nodes across different workflows.

Version management

Important

Publishing an internal node separately also creates a new version of the parent workflow.

Version management automatically records every change to a workflow. From the Versions panel on the right side of the workflow canvas, you can view, compare, and revert to any previous version.

The Versions panel shows two types of records:

  • Development Records: Created each time you click Save. This is a work-in-progress snapshot and does not affect production.

  • Publishing Records: Created when the workflow is successfully published to production. This is what the production scheduling system executes.

Typical use cases

ScenarioAction
A production task fails after a code changeUse Revert to restore the workflow to the last stable Publishing Record, then republish
You need to trace when and by whom logic was modifiedUse the version list and compare feature to view every code change
You accidentally delete code that hasn't been savedRevert from the most recent Development Record

Before reverting:

  • Reverting overwrites the current development area: All code and configurations on the canvas are replaced with the historical version. If you have unsaved changes, back them up locally before reverting.

  • Republish after reverting: Reverting only restores the development state. Manually publish the workflow for the rollback to take effect in production.

Publishing and operations

Workflow instance states

A workflow's overall status reflects the final state of its internal nodes. Understanding how the system determines workflow state helps you diagnose issues faster.

StateTrigger conditionWhat to check
RunningScheduled time reached + upstream tasks succeeded
SucceededAll critical nodes completed successfully
FailedA single critical node failedCheck the failed node's run log in Operation Center
Failed (frozen node)An internal node is frozen or suspended and is an upstream dependencyUnfreeze the node or reconfigure the dependency
Succeeded (merge node exception)A merge node evaluates to success even if some upstream branches failedReview merge node upstream checks to prevent silent failures

Special scenarios:

  • Freezing a Data Backfill instance of a workflow task sets the workflow instance to Succeeded.

  • In a Data Backfill scenario, if the system determines a task cannot be executed, the workflow is set to Failed.

Important

There is a delay between when an actual failure event occurs and when the instance status is updated.

image

Limitations

ItemLimit
Max nodes per workflow400 nodes (keep under 100 for optimal canvas performance and maintainability; split large workflows into smaller ones)
Parallel instancesPeriodic workflows do not support a max parallel instance limit at the workflow level. Set this limit at the individual node level via Scheduling Configuration. See Node scheduling configuration.
Sub-workflow dependenciesA sub-workflow (with "Can Be Referenced" enabled) and its internal nodes cannot depend on any external tasks, and no external task can directly depend on it. Violating this constraint causes a publishing error.
Sub-workflow schedulingAfter publishing, a sub-workflow does not generate scheduling instances automatically. It runs only when called by a SUB_PROCESS node.

Frequently asked questions

What is the difference between a workflow and a business process?

A workflow is a schedulable unit—it has its own scheduling cycle and generates periodic task instances. A business process is just a folder for organizing nodes; it cannot be scheduled as a whole.

My task runs successfully in debug mode but fails in scheduled runs. What's wrong?

This is almost always an environment mismatch. Check:

  • Resource Group: Debug runs often use a personal Resource Group, while scheduled runs use a production Resource Group. Confirm the production Resource Group is active, has capacity, and has the correct permissions.

  • Permissions: The account used for production scheduled runs may lack access to certain tables, functions, or resources.

  • Dependencies: The production environment may have different upstream dependencies than the development environment, or the upstream output may not exist in production.

My instance is stuck in "waiting" state. What should I check?

An instance runs only when its scheduled time arrives, resources are available, and all upstream dependencies have succeeded. Check in this order:

  1. Upstream dependency not complete: In the Operation Center, open the instance's Dependency view to see which upstream task hasn't succeeded yet.

  2. Resource queue full: The compute engine's Resource Group queue is at capacity. The task is waiting for resources.

  3. Workflow or node is frozen: Check if the workflow or any upstream node is manually frozen or suspended.

  4. Instance not yet generated: Confirm the workflow's scheduled time has passed. Newly published workflows generate their first instance at the next scheduling cycle. See Instance generation method: Instant generation after publishing.

The entire workflow shows as "failed" even though some nodes succeeded. Why?

  • Critical path failure: If any node with downstream dependencies fails, the entire workflow is marked Failed—even if other parallel branches succeeded.

  • Frozen node: A frozen internal node causes the entire workflow to fail.