All Products
Search
Document Center

DataWorks:Node development

Last Updated:Jun 23, 2026

DataWorks Data Studio provides various nodes for different data processing tasks: data integration nodes for synchronization; engine compute nodes such as MaxCompute SQL, Hologres SQL, and EMR Hive for data cleansing; and general-purpose nodes such as virtual nodes and do-while loop nodes for complex logic processing. These nodes work together to effectively address various data processing challenges.

Supported node types

The following table lists the node types supported by periodic scheduling. Supported node types for manual tasks or manually triggered workflows may differ. For the most up-to-date list, refer to the UI.

Note
  • Node availability varies by edition and region. For the most accurate information, see the UI.

  • Some nodes cannot be run in a workflow. See the node details for specifics.

Node type

Node name

Description

Node code

TaskType

Data Integration

batch synchronization

Synchronizes data in recurring batches between various data sources, supporting data synchronization across multiple heterogeneous data sources in complex scenarios.

For more information about the data sources supported by batch synchronization, see Supported data sources and synchronization solutions.

23

DI

real-time synchronization

Synchronizes data changes from a source to a destination database in real time. You can synchronize a single table or an entire database to maintain data consistency.

For more information about the data sources supported by real-time synchronization, see Supported data sources and synchronization solutions.

900

RI

Notebook

Notebook

Notebook provides an interactive and flexible data processing and analysis platform. By enhancing intuitiveness, modularity, and interactive experience, it makes data processing, exploration, visualization, and model building more efficient and convenient.

1323

NOTEBOOK

MaxCompute

MaxCompute SQL

Supports periodic scheduling of MaxCompute SQL tasks. MaxCompute SQL uses SQL-like syntax and is suitable for distributed processing scenarios that involve large-scale data (TB-level) but do not require high real-time performance.

10

ODPS_SQL

SQL component

An SQL component is a reusable SQL code template with multiple input and output parameters. It can process data by filtering, joining, and aggregating data source tables to generate result tables. During data development, you can create SQL component nodes and use these predefined components to quickly build data processing pipelines, significantly improving development efficiency.

1010

COMPONENT_SQL

MaxCompute Script

Allows you to combine multiple SQL statements into a single script for unified compilation and execution. This is ideal for complex query scenarios such as nested subqueries or multi-step operations. By submitting the entire script at once and generating a unified execution plan, the job only needs to be queued and executed once, making resource utilization more efficient.

24

ODPS_SQL_SCRIPT

PyODPS 2

By integrating the MaxCompute Python SDK, you can write and edit Python code directly on PyODPS 2 nodes to conveniently perform data processing and analysis tasks in MaxCompute.

221

PY_ODPS

PyODPS 3

With PyODPS 3 nodes, you can write MaxCompute jobs directly in Python code and configure these jobs for periodic scheduling.

1221

PYODPS3

MaxCompute Spark

Supports running MaxCompute-based Spark batch jobs (cluster mode) on the DataWorks platform.

225

ODPS_SPARK

MaxCompute MR

By creating a MaxCompute MR node and submitting it for task scheduling, you can use the MapReduce Java API to write MapReduce programs for processing large-scale datasets in MaxCompute.

11

ODPS_MR

Map metadata to Hologres

When you need to accelerate queries on MaxCompute data in Hologres, you can use the MaxCompute metadata mapping feature of Data Catalog to map MaxCompute table metadata to Hologres, enabling accelerated queries on MaxCompute data through Hologres external tables.

-

-

Synchronize data to Hologres

Supports synchronizing single-table data from MaxCompute to Hologres for efficient big data analysis and real-time queries.

-

-

Hologres

Hologres SQL

Hologres SQL nodes support querying data in Hologres instances. In addition, Hologres and MaxCompute are seamlessly connected at the underlying level, allowing you to use standard PostgreSQL statements in Hologres SQL nodes to query and analyze large-scale data in MaxCompute without migrating data, delivering fast query results.

1093

HOLOGRES_SQL

Synchronize data to MaxCompute

Supports migrating single-table data from Hologres to MaxCompute.

1070

HOLOGRES_SYNC_DATA_TO_MC

One-click MaxCompute table schema synchronization

Provides a one-click table schema import feature to quickly create Hologres external tables in batches that are consistent with MaxCompute table schemas.

1094

HOLOGRES_SYNC_DDL

One-click MaxCompute data synchronization

Provides a one-click MaxCompute data synchronization node to quickly synchronize data from MaxCompute to a Hologres database.

1095

HOLOGRES_SYNC_DATA

Serverless Spark

Serverless Spark Batch

A Spark node based on Serverless Spark, suitable for large-scale data processing.

2100

SERVERLESS_SPARK_BATCH

Serverless Spark SQL

An SQL query node based on Serverless Spark that supports standard SQL syntax and provides high-performance data analysis capabilities.

2101

SERVERLESS_SPARK_SQL

Serverless Kyuubi node

Connects to Serverless Spark through the Kyuubi JDBC/ODBC interface, providing multi-tenant Spark SQL services.

2103

SERVERLESS_KYUUBI

Severless StarRocks

Serverless StarRocks SQL

An SQL node based on EMR Serverless StarRocks that is compatible with open-source StarRocks SQL syntax, providing ultra-fast OLAP query analysis and lakehouse query analysis.

2104

SERVERLESS_STARROCKS

LLM

Large language model node

Features a built-in powerful data processing and analysis engine that intelligently performs data cleansing and mining based on your natural language instructions.

2200

LLM_NODE

Flink

Flink SQL Streaming

Supports defining real-time task processing logic using standard SQL statements. It offers ease of use, rich SQL support, powerful state management and fault tolerance, compatibility with event time and processing time, and flexible scalability. This node integrates easily with systems such as Kafka and HDFS, and provides comprehensive logging and performance monitoring tools.

2012

FLINK_SQL_STREAM

Flink SQL Batch

Allows you to define and execute data processing tasks using standard SQL statements. It is suitable for analysis and transformation of large datasets, including data cleansing and aggregation. This node supports visual configuration and provides an efficient and flexible large-scale batch data processing solution.

2011

FLINK_SQL_BATCH

EMR

EMR Hive

Allows you to use SQL-like statements to read, write, and manage large datasets, enabling efficient analysis and development of massive log data.

227

EMR_HIVE

EMR Impala

A fast, real-time interactive SQL query engine for PB-scale big data.

260

EMR_IMPALA

EMR MR

Breaks down large-scale datasets into multiple parallel Map tasks to significantly improve data processing efficiency.

230

EMR_MR

EMR Presto

A flexible, scalable distributed SQL query engine that supports interactive analysis and querying of big data using standard SQL query syntax.

259

EMR_PRESTO

EMR Shell

Allows you to write and execute custom Shell scripts for advanced features such as data processing, invoking Hadoop components, and file operations.

257

EMR_SHELL

EMR Spark

A general-purpose big data analytics engine known for its high performance, ease of use, and broad applicability. It supports complex in-memory computing and is ideal for building large-scale, low-latency data analytics applications.

228

EMR_SPARK

EMR Spark SQL

Processes structured data using a distributed SQL query engine to improve job execution efficiency.

229

EMR_SPARK_SQL

EMR Spark Streaming

Processes high-throughput real-time streaming data with fault tolerance mechanisms that can quickly recover from data stream errors.

264

EMR_SPARK_STREAMING

EMR Trino

A distributed SQL query engine suitable for interactive analysis and querying across multiple data sources.

267

EMR_TRINO

EMR Kyuubi

A distributed and multi-tenant gateway that provides SQL query services for data lake query engines such as Spark, Flink, and Trino.

268

EMR_KYUUBI

ADB

ADB for PostgreSQL

Supports the development and periodic scheduling of AnalyticDB for PostgreSQL tasks.

1000090

-

ADB for MySQL

Supports the development and periodic scheduling of AnalyticDB for MySQL tasks.

1000126

-

ADB Spark

Supports the development and periodic scheduling of AnalyticDB Spark tasks.

1990

ADB_SPARK

ADB Spark SQL

Supports the development and periodic scheduling of AnalyticDB Spark SQL tasks.

1991

ADB_SPARK_SQL

CDH

CDH Hive

For users who have deployed a CDH cluster and want to run Hive tasks through DataWorks.

270

CDH_HIVE

CDH Spark

A general-purpose big data analytics engine known for its high performance, ease of use, and broad applicability. It can be used for complex in-memory analysis and building large-scale, low-latency data analytics applications.

271

CDH_SPARK

CDH Spark SQL

Processes structured data using a distributed SQL query engine to improve job execution efficiency.

272

CDH_SPARK_SQL

CDH MR

Processes ultra-large-scale datasets.

273

CDH_MR

CDH Presto

Provides a distributed SQL query engine that further enhances the data analysis capabilities of the CDH environment.

278

CDH_PRESTO

CDH Impala

CDH Impala nodes allow you to write and execute Impala SQL scripts, providing faster query performance.

279

CDH_IMPALA

Lindorm

Lindorm Spark

Supports the development and periodic scheduling of Lindorm Spark tasks.

1800

LINDORM_SPARK

Lindorm Spark SQL

Supports the development and periodic scheduling of Lindorm Spark SQL tasks.

1801

LINDORM_SPARK_SQL

Click House

ClickHouse SQL

Supports distributed SQL queries and structured data processing to improve job execution efficiency.

1301

CLICK_SQL

Data Quality

Quality monitoring

Allows you to configure data quality monitoring rules to monitor the data quality of related data source tables (for example, checking for dirty data). You can also customize scheduling policies to periodically run monitoring tasks for data validation.

1333

DATA_QUALITY_MONITOR

Data comparison

The comparison node supports multiple methods for comparing data across different tables.

1331

DATA_SYNCHRONIZATION_QUALITY_CHECK

General

Zero load node

A virtual node is a control-type node that performs a dry run without generating any data. It is typically used as the root node for workflow orchestration, making it easier to manage nodes and workflows.

99

VIRTUAL

Assignment node

Used for parameter passing. It uses its built-in output to pass the last query or output result of the assignment node to downstream nodes through the node context feature, enabling cross-node parameter passing.

1100

CONTROLLER_ASSIGNMENT

Shell node

Shell nodes support standard Shell syntax but do not support interactive syntax.

6

DIDE_SHELL

Parameter node

Used for aggregating parameters from upstream nodes and distributing them downstream.

1115

PARAM_HUB

OSS object check

Triggers downstream node execution by monitoring OSS objects.

239

OSS_INSPECT

Python node

Supports the Python 3.0 language. It can obtain upstream parameters through scheduling parameters in schedule settings and apply custom parameters, as well as pass its own output as parameters to downstream nodes.

1322

PYTHON

Merge node

Used for merging the running status of upstream nodes, resolving dependency mounting and run triggering issues for nodes downstream of branch nodes.

1102

CONTROLLER_JOIN

Branch node

Used for evaluating upstream results and directing different outcomes to different branch logic. You can use it together with assignment nodes.

1101

CONTROLLER_BRANCH

for-each node

Used for iterating over the result set passed by an assignment node.

1106

CONTROLLER_TRAVERSE

Do-while node

Used for executing a subset of node logic in a loop. You can also use it together with assignment nodes to loop through the results passed by an assignment node.

1103

CONTROLLER_CYCLE

Check node

Used for checking whether a target object (MaxCompute partitioned table, FTP file, or OSS file) is available. When the check node meets the check policy, it returns a success status. If downstream dependencies exist, it triggers downstream task execution upon success. Supported target objects:

  • MaxCompute partitioned table

  • FTP file

  • OSS file

  • HDFS

  • OSS-HDFS

241

CHECK_NODE

Function Compute

Used for periodically scheduling and processing event functions.

1330

FUNCTION_COMPUTE

HTTP trigger

If you want tasks on other scheduling systems to trigger DataWorks tasks upon completion, you can use this node.

Note

DataWorks no longer supports creating cross-tenant nodes. If you are using cross-tenant nodes, we recommend that you switch to HTTP trigger nodes, which provide the same capabilities.

1114

SCHEDULER_TRIGGER

SSH

Allows DataWorks to remotely access a host connected through a specified SSH data source and trigger script execution on the remote host.

1321

SSH

Data Push

A Data Push node can push data query results generated by other nodes in a Data Studio workflow to DingTalk groups, Lark groups, WeCom groups, Teams, and email by creating data push targets.

1332

DATA_PUSH

Database nodes

MySQL node

MySQL nodes support the development and periodic scheduling of MySQL tasks.

1000125

-

SQL Server

SQL Server nodes support the development and periodic scheduling of SQL Server tasks.

10001

-

Oracle node

Oracle nodes support the development and periodic scheduling of Oracle tasks.

10002

-

PostgreSQL node

PostgreSQL nodes support the development and periodic scheduling of PostgreSQL tasks.

10003

-

StarRocks node

Supports the development and periodic scheduling of StarRocks tasks.

10004

-

DRDS node

Supports the development and periodic scheduling of DRDS tasks.

10005

-

PolarDB MySQL node

Supports the development and periodic scheduling of PolarDB MySQL tasks.

10006

-

PolarDB PostgreSQL node

PolarDB PostgreSQL nodes support the development and periodic scheduling of PolarDB PostgreSQL tasks.

10007

-

Doris node

Doris nodes support the development and periodic scheduling of Doris tasks.

10008

-

MariaDB node

MariaDB nodes support the development and periodic scheduling of MariaDB tasks.

10009

-

SelectDB node

SelectDB nodes support the development and periodic scheduling of SelectDB tasks.

10010

-

Redshift node

Redshift nodes support the development and periodic scheduling of Redshift tasks.

10011

-

Saphana node

Saphana nodes support the development and periodic scheduling of SAP HANA tasks.

10012

-

Vertica node

Vertica nodes support the development and periodic scheduling of Vertica tasks.

10013

-

DM (Dameng) node

DM nodes support the development and periodic scheduling of DM tasks.

10014

-

KingbaseES node

KingbaseES nodes support the development and periodic scheduling of KingbaseES tasks.

10015

-

OceanBase node

OceanBase nodes support the development and periodic scheduling of OceanBase tasks.

10016

-

DB2 node

DB2 nodes support the development and periodic scheduling of DB2 tasks.

10017

-

GBase 8a node

GBase 8a nodes support the development and periodic scheduling of GBase 8a tasks.

10018

-

Algorithm

PAI Designer

PAI's visual modeling tool, Designer, for implementing end-to-end machine learning development workflows with visual modeling.

1117

PAI_STUDIO

PAI DLC

PAI's container-based training service, DLC, for distributed execution of training tasks.

1119

PAI_DLC

PAI Flow

PAI knowledge base index workflow / generates PAIFlow nodes on the DataWorks side.

1250

PAI_FLOW

Logic node

SUB_PROCESS node

The SUB_PROCESS node consolidates multiple workflows into a unified whole for management and scheduling.

1122

SUB_PROCESS

Create nodes

Create nodes for scheduled workflows

If your tasks need to run automatically at specified intervals (such as hourly, daily, or weekly), you can create scheduled task nodes in the following ways: create a scheduled task node, add internal nodes to a scheduled workflow, or clone an existing node to create a new one.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left-side navigation pane, click image to go to the Data Studio page.

Create a scheduled task node

  1. Click image on the right side of the project directory, select New Node, and then select the desired node type.

    Important

    The system provides a Common Nodes list and an All Nodes list. Select All Nodes at the bottom to view all available node types. Use the search box to quickly find nodes, or use category filters (such as MaxCompute, Data Integration, and General) to locate and create the desired node.

    You can create directories in advance to organize and manage nodes.
  2. Set the node name and save it. The node editing page then appears.

Create internal nodes in a scheduled workflow

  1. Create a scheduled workflow.

  2. On the workflow canvas, click New Node in the toolbar at the top, select the desired node type based on the task you need to develop, and drag it onto the canvas.

  3. Set the node name and save it.

Create a node by cloning

Use the clone feature to quickly clone an existing node and create a new one. The cloned content includes the node's Scheduling Settings information (Scheduling Parameters, Scheduling time, and Scheduling Dependency).

  1. In the left-side Project Directory, right-click the node you want to clone and select Cloning from the context menu.

  2. In the dialog, modify the node Name and Path (or keep the default values), and click Confirm to start cloning.

  3. After cloning is complete, view the newly created node in the Project Directory.

Create nodes for manually triggered workflows

If your tasks do not need to run periodically but need to be deployed to the production environment and run manually when needed, you can create internal nodes in a manually triggered workflow.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left-side navigation pane, click image to go to the manually triggered workflow page.

    1. Create a manually triggered workflow.

    2. On the toolbar at the top of the manually triggered workflow editing page, click New Internal Node, and select the desired node type based on the task you need to develop.

    3. Set the node name and save it.

Create manual task nodes

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left-side navigation pane, click image to go to the manual task page.

  3. In the lower section, click image on the right side of Manually Triggered Task, select New Node, and then select the desired node type.

    Note

    Manual tasks only support the following node types: Offline synchronization, Notebook, Maxcompute SQL, Maxcompute Script, Pyodps 2, Maxcompute MR, Hologres SQL, Python, and Shell.

  4. Set the node name and save it. The node editing page then appears.

Batch editing of nodes

When a workflow contains a large number of nodes, opening them one by one for editing is inefficient. DataWorks provides an Internal Node List feature that displays all nodes in a list on the right side of the canvas for quick preview, search, and batch editing.

Usage

  1. On the toolbar at the top of the workflow canvas, click the Show Internal Node List button to open the feature panel on the right side of the canvas.

    image

  2. After the panel opens, all nodes in the current workflow are displayed in a list.

    • Code preview and sorting:

      • Nodes that support code editing (such as MaxCompute SQL) expand the code editor by default.

      • Nodes that do not support code editing (such as virtual nodes) are displayed as cards and are automatically arranged at the bottom of the list.

    • Quick search and navigation:

      • Search: Enter keywords in the search box at the top to perform a fuzzy search on node names.

      • Linkage: Bidirectional linkage is available between the canvas and the sidebar. Selecting a node on the canvas highlights the corresponding node in the sidebar, and vice versa.

    • Online editing:

      • Actions: The upper-right corner of each node card provides quick actions such as Load Latest Code, Open Node, and Edit.

      • Auto-save: After you enter the editing state, changes are automatically saved when the mouse focus leaves the code block area.

      • Conflict detection: If the code is updated by another user during editing, a save failure notification is triggered to prevent accidental overwrites.

    • Focus mode:

      • Select a node and click image in the upper-right corner of the floating window to enable Focus Mode. The sidebar displays only the currently selected node, providing more space for code editing.

Version management

The system supports restoring nodes to a specified historical version through version management. It also provides version viewing and comparison features to help you analyze differences and make adjustments.

  1. In the left-side Project Directory, double-click the target node name to go to the node editing page.

  2. Click Version on the right side of the node editing page. On the Version page, view and manage Developer Record and Publish Record information.

    • View a version:

      1. On the Developer Record or Publish Record tab, find the node version you want to view.

      2. Click View in the Operation column to go to the details page where you can view the node code content and Scheduling Settings information.

        Note

        Scheduling Settings information can be viewed in Script Mode or Visual Mode. You can switch between the viewing modes in the upper-right corner of the Scheduling Settings tab.

    • Compare versions:

      On the Developer Record or Publish Record tab, you can compare different versions of a node. The following example uses the developer record to demonstrate the comparison operation.

      • Compare within development or deployment records: On the Developer Record tab, select two versions and click the Select Comparison button at the top to compare the node code content and schedule settings between versions.

      • Compare across development and deployment or build records:

        1. On the Developer Record tab, locate the desired version of the node.

        2. Click Compare in the Operation column, and on the details page, select a version from Publish Record or Build Records to compare.

    • Restore a version:

      You can only restore nodes from the Developer Record to a specified historical version. On the Developer Record tab, find the target version and click Restore in the Operation column to restore the node's code and Scheduling Settings information to the target version.

References

FAQ

Can I download node code (such as SQL or Python) to my local machine?

  • Answer: A direct download feature is not available. As an alternative, you can copy the code to your local machine directly during development. Alternatively, you can develop in the personal directory in Data Studio, and then submit the code to the project directory. In this case, your code is saved locally.