All Products
Search
Document Center

DataWorks:Node development

Last Updated:Apr 22, 2025

The Data Studio service of DataWorks allows you to create various types of nodes, such as data synchronization nodes, nodes of compute engine types for data cleansing, and general nodes for complex logic processing, to meet your different data processing requirements. Nodes of compute engine types include MaxCompute SQL, Hologres SQL, and E-MapReduce (EMR) Hive nodes. General nodes include zero load nodes and do-while nodes. Nodes of different types work together to effectively address various data processing challenges.

Supported node types

The following table describes the node types supported in periodic scheduling. The node types supported by manually triggered tasks or workflows may differ. You can view the supported node types in the DataWorks console.

Note

Supported node types vary based on the DataWorks edition and region in which DataWorks resides. You can view the supported node types in the DataWorks console.

Node type

Node name

Description

Node code

Task type (specified by TaskType)

Notebook

Notebook

This type of node provides an interactive and flexible data processing and analysis platform. This improves intuition, modularity, and interactivity, and helps you perform data processing, exploration, visualization, and model building in a more efficient and convenient manner.

1323

NOTEBOOK

Data Integration

Batch Synchronization

This type of node is used to periodically synchronize offline data and to synchronize data between heterogeneous data sources in complex scenarios.

For information about the data source types that support batch synchronization, see Supported data source types and synchronization operations.

23

DI2

Real-time Synchronization

This type of node allows you to synchronize data changes in a source table or database to a destination table or database in real time to ensure data consistency at the source and destination.

For information about the data source types that support real-time synchronization, see Supported data source types and synchronization operations.

900

RI

MaxCompute

MaxCompute SQL

This type of node allows you to schedule MaxCompute SQL tasks on a regular basis and is integrated with other types of nodes for joint scheduling. MaxCompute SQL tasks can process terabytes of data in distributed scenarios that do not require real-time processing by using the SQL-like syntax.

10

ODPS_SQL

SQL Script Template

This type of node is used to filter source table data, join source tables, and aggregate source table data to generate a result table. A script template defines an SQL code process that includes multiple input and output parameters. You can create SQL Script Template nodes in Data Studio to build a data processing process. This helps significantly improve development efficiency.

1010

COMPONENT_SQL

MaxCompute Script

This type of node allows you to integrate multiple SQL statements into a script for compilation and execution. The script mode is suitable for processing complex queries, such as nested subqueries, or scenarios that require step-by-step operations. After you submit a script, a unified execution plan is generated. In this case, the related job needs to queue and be run only once. This helps improve resource utilization.

24

ODPS_SCRIPT

PyODPS 2

This type of node is integrated with MaxCompute SDK for Python. You can edit Python code on PyODPS 2 nodes in the DataWorks console to process and analyze data in MaxCompute.

221

PY_ODPS

PyODPS 3

This type of node allows you to directly write Python code for MaxCompute jobs to schedule the MaxCompute jobs on a regular basis.

1221

PY_ODPS3

MaxCompute Spark

This type of node allows you to run offline Spark on MaxCompute tasks in cluster mode in DataWorks to integrate the tasks with other types of nodes for scheduling.

225

SPARK

MaxCompute MR

You can create and commit MaxCompute MR nodes that call the MapReduce Java API to write MapReduce programs and process large datasets in MaxCompute.

11

ODPS_MR

Hologres

Hologres SQL

This type of node allows you to query data in Hologres instances. Hologres and MaxCompute are seamlessly connected at the underlying layer. This allows you to use a Hologres SQL node to query and analyze large-scale data in MaxCompute by executing standard PostgreSQL statements, without the need to migrate data. You can obtain query results in an efficient manner.

1093

HOLOGRES_SQL

One-click MaxCompute Table Schema Synchronization (Metadata Mapping Between MaxCompute and Hologres)

DataWorks provides the one-click table schema import feature that allows you to quickly create Hologres external tables that have the same schemas as MaxCompute tables.

1094

HOLOGRES_SYNC_DDL

One-click MaxCompute Data Synchronization (Data Synchronization from MaxCompute to Hologres)

DataWorks provides the one-click data synchronization feature that allows you to quickly synchronize data from MaxCompute to Hologres databases.

1095

HOLOGRES_SYNC_DATA

Flink

Flink SQL Streaming

This type of node allows you to use standard SQL statements to define the processing logic of real-time tasks. Flink SQL Streaming nodes are easy to use, support a variety of SQL syntax, and provide powerful state management and fault tolerance capabilities. In addition, Flink SQL Streaming nodes are compatible with event time and processing time and can be flexibly expanded. Flink SQL Streaming nodes are easy to integrate with services, such as Kafka and Hadoop Distributed File System (HDFS), and provide detailed logs and performance monitoring tools.

2012

FLINK_SQL_STREAM

Flink SQL Batch

This type of node allows you to define and run data processing tasks by using standard SQL statements. Flink SQL Batch nodes are suitable for the analysis and transformation of large datasets, including data cleansing and aggregation. Flink SQL Batch nodes can be configured in a visualized manner to provide efficient and flexible batch processing solutions for large-scale data.

2011

FLINK_SQL_BATCH

EMR

EMR Hive

This type of node allows you to use SQL-like statements to read data from and write data to large datasets and manage the large datasets. This way, you can analyze and develop large amounts of log data in an efficient manner.

227

EMR_HIVE

EMR Impala

This type of node allows you to perform fast and real-time interactive SQL queries on petabytes of data.

260

EMR_IMPALA

EMR MR

This type of node allows you to process a large dataset by using multiple parallel map tasks. EMR MR nodes help significantly improve data processing efficiency.

230

EMR_MR

EMR Presto

Presto is a flexible and scalable distributed SQL query engine that allows you to execute standard SQL statements to perform interactive analytic queries of big data.

259

EMR_PRESTO

EMR Shell

This type of node allows you to specify custom Shell scripts and run the scripts to use advanced features such as data processing, Hadoop component calling, and file management.

257

EMR_SHELL

EMR Spark

Spark is a general-purpose big data analytics engine. Spark features high performance, ease of use, and widespread use. You can use Spark to perform complex memory computing and build large, low-latency data analysis applications.

228

EMR_SPARK

EMR Spark SQL

This type of node allows you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs.

229

EMR_SPARK_SQL

EMR Spark Streaming

This type of node can be used to process streaming data with high throughput. This type of node supports fault tolerance, which helps you quickly restore data streams on which errors occur.

264

SPARK_STREAMING

EMR Trino

Trino is a distributed SQL query engine designed to run interactive analytic queries of various data sources.

267

EMR_TRINO

EMR Kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway that provides query services such as SQL queries for data lake query engines. The data lake query engines include Spark, Flink, and Trino.

268

EMR_KYUUBI

ADB

ADB for PostgreSQL

This type of node allows you to develop and periodically schedule AnalyticDB for PostgreSQL tasks and integrate AnalyticDB for PostgreSQL tasks with other types of tasks.

1000024

-

ADB for MySQL

This type of node allows you to develop and periodically schedule AnalyticDB for MySQL tasks and integrate AnalyticDB for MySQL tasks with other types of tasks.

1000036

-

ADB Spark

This type of node allows you to develop and periodically schedule AnalyticDB Spark tasks and integrate AnalyticDB Spark tasks with other types of tasks.

1990

ADB Spark

ADB Spark SQL

This type of node allows you to develop and periodically schedule AnalyticDB Spark SQL tasks and integrate AnalyticDB Spark SQL tasks with other types of tasks.

1991

ADB Spark SQL

CDH

CDH Hive

You can use Cloudera's Distribution Including Apache Hadoop (CDH) Hive nodes in DataWorks to run Hive tasks if you have deployed a CDH cluster.

270

CDH_HIVE

CDH Spark

Spark is a general-purpose big data analytics engine. Spark features high performance, ease of use, and widespread use. You can use Spark to perform complex memory analysis and build large, low-latency data analysis applications.

271

CDH_SPARK

CDH Spark SQL

This type of node allows you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs.

272

CDH_SPARK_SQL

CDH MR

This type of node allows you to process data in ultra-large datasets.

273

CDH_MR

CDH Presto

This type of node allows you to use a distributed SQL query engine to analyze real-time data. This further enhances the data analysis capabilities in the CDH environment.

278

CDH_PRESTO

CDH Impala

This type of node allows you to write and run Impala SQL scripts. CDH Impala nodes provide higher query performance than CDH Hive nodes.

279

CDH_IMPALA

Click House

Click House SQL

This type of node allows you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs.

-

-

General

Zero load node

A zero load node is a control node that supports dry-run scheduling and does not generate data. In most cases, a zero load node serves as the root node of a workflow and allows you to easily manage nodes and workflows.

99

VIRTUAL_NODE

Assignment

This type of node can be used if you want to use the outputs parameter of an assignment node to pass the data from the output of the last row of the code for the assignment node to its descendant nodes.

1100

CONTROLLER_ASSIGNMENT

Shell

This type of node supports the standard Shell syntax. The interactive syntax is not supported.

6

SHELL2

Parameter node

This type of node can be used to aggregate parameters of its ancestor nodes and distribute parameters to its descendant nodes.

1115

PARAM_HUB

OSS object inspection

This type of node can be used if you want to trigger a descendant node to run after Object Storage Service (OSS) objects are generated.

239

OSS

Python

This type of node supports the Python 3.0 syntax and allows you to configure scheduling parameters on the Properties tab to obtain parameters from its ancestor nodes and configure custom parameters. In addition, the output of this type of node can be passed to a descendant node as parameters.

1322

PYTHON

Merge node

This type of node allows you to merge the status of its ancestor nodes and prevent dry run of its descendant nodes.

1102

CONTROLLER_JOIN

Branch node

This type of node allows you to route results based on logical conditions. You can also use this type of node together with an assignment node.

1101

CONTROLLER_BRANCH

for-each node

This type of node allows you to traverse the result set of an assignment node.

1106

CONTROLLER_TRAVERSE

Do-while node

This type of node allows you to execute the logic of specific nodes in loops. You can also use this type of node together with an assignment node to generate the data that is passed to a descendant node of the assignment node in loops.

1103

CONTROLLER_CYCLE

Check node

This type of node allows you to check the availability of MaxCompute partitioned tables, FTP files, and OSS objects based on check policies. If the condition that is specified in the check policy for a Check node is met, the task on the Check node is successfully run. If the running of a task depends on an object, you can use a Check node to check the availability of the object and configure the task as a descendant task of the Check node. If the condition that is specified in the check policy for the Check node is met, the task on the Check node is successfully run and then its descendant task is triggered to run. Supported objects:

  • MaxCompute partitioned table

  • FTP file

  • OSS object

  • HDFS

  • OSS-HDFS

241

-

Cross-tenant collaboration node

You can configure a node for data sending and a node for data reception from different tenants to meet your requirements for task running across tenants.

1089

CROSS

Function Compute node

This type of node allows you to periodically schedule and process event functions and complete integration and joint scheduling with other types of nodes.

1330

FUNCTION_COMPUTE

HTTP Trigger node

This type of node can be used if you want to trigger nodes in DataWorks to run after nodes in other scheduling systems finish running.

Note

DataWorks no longer allows you to create cross-tenant collaboration nodes. If you have used a cross-tenant collaboration node in your business, we recommend that you replace the cross-tenant collaboration node with an HTTP Trigger node. An HTTP Trigger node provides the same capabilities as a cross-tenant collaboration node.

1114

SCHEDULER_TRIGGER

SSH node

In DataWorks, you can create an SSH node and use the SSH node based on a specific SSH data source to remotely access a host that is connected to the data source and trigger script running on the host.

1321

SSH

Database

MySQL

This type of node allows you to develop and periodically schedule MySQL tasks and integrate MySQL tasks with other types of tasks.

1000039

-

SQL Server node

This type of node allows you to develop and periodically schedule SQL Server tasks and integrate SQL Server tasks with other types of tasks.

10001

-

Oracle

This type of node allows you to develop and periodically schedule Oracle tasks and integrate Oracle tasks with other types of tasks.

10002

-

PostgreSQL

This type of node allows you to develop and periodically schedule PostgreSQL tasks and integrate PostgreSQL tasks with other types of tasks.

10003

-

PolarDB PostgreSQL

This type of node allows you to develop and periodically schedule PolarDB for PostgreSQL tasks and integrate PolarDB for PostgreSQL tasks with other types of tasks.

10007

-

Doris

This type of node allows you to develop and periodically schedule Doris tasks and integrate Doris tasks with other types of tasks.

10008

-

MariaDB

This type of node allows you to develop and periodically schedule MariaDB tasks and integrate MariaDB tasks with other types of tasks.

10009

-

SelectDB

This type of node allows you to develop and periodically schedule SelectDB tasks and integrate SelectDB tasks with other types of tasks.

10010

-

Redshift

This type of node allows you to develop and periodically schedule Redshift tasks and integrate Redshift tasks with other types of tasks.

10011

-

Saphana

This type of node allows you to develop and periodically schedule SAP HANA tasks and integrate SAP HANA tasks with other types of tasks.

10012

-

Vertica

This type of node allows you to develop and periodically schedule Vertica tasks and integrate Vertica tasks with other types of tasks.

10013

-

DM

This type of node allows you to develop and periodically schedule DM tasks and integrate DM tasks with other types of tasks.

10014

-

KingbaseES

This type of node allows you to develop and periodically schedule KingbaseES tasks and integrate KingbaseES tasks with other types of tasks.

10015

-

OceanBase

This type of node allows you to develop and periodically schedule OceanBase tasks and integrate OceanBase tasks with other types of tasks.

10016

-

DB2

This type of node allows you to develop and periodically schedule Db2 tasks and integrate Db2 tasks with other types of tasks.

10017

-

GBase 8a

This type of node allows you to develop and periodically schedule GBase 8a tasks and integrate GBase 8a tasks with other types of tasks.

10018

-

Algorithm

PAI Designer

Machine Learning Designer is a visualized modeling tool that is provided by Platform for AI (PAI) to implement end-to-end machine learning development.

-

-

PAI DLC

Deep Learning Containers (DLC) of PAI is used to run training tasks in a distributed manner.

1119

PAI_DLC

Logical node

SUB_PROCESS node

This type of node allows you to integrate multiple workflows as a whole for management and scheduling.

1122

-

Create an auto triggered node

If your task needs to be automatically run on a regular basis within a specified period of time, you can create an auto triggered node or create a node in an auto triggered workflow. For example, your task can be scheduled to run by hour, day, or week on a regular basis.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left-side navigation pane of the Data Studio page, click the image icon.

    • Create an auto triggered node

      Directly create an auto triggered node

      1. Select a node type.

        Click the image icon to the right of the Workspace Directories section in the DATA STUDIO pane, select Create Node, and then select a desired node type.

        DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.

        Note

        The first time you perform operations in the Workspace Directories section of the DATA STUDIO pane, you can directly click Create Node to create a node.

      2. Create a node.

        In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.

      Create an auto triggered node in a directory

      1. Create a directory.

        Click the image icon to the right of the Workspace Directories section in the DATA STUDIO pane and select Create Directory. In the Create Directory dialog box, specify a directory name and click OK.

      2. Select a node type.

        Right-click the name of the created directory, select Create Node, and then select a node type.

        DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.

      3. Create a node.

        In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.

    • Create a node in an auto triggered workflow

      1. Create an auto triggered workflow.

      2. Select a node type.

        On the left side of the configuration tab of the workflow, select a node type based on the type of task that you want to develop, and drag the node type to the canvas on the right.

        DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.

      3. In the Create Node dialog box, specify a node name and click OK.

Create a manually triggered node

If your task does not need to be run on a regular basis, but needs to be deployed to the production environment for running, you can create a manually triggered node or create a node in a manually triggered workflow.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left-side navigation pane of the Data Studio page, click the image icon.

    • Create a manually triggered node

      Directly create a manually triggered node

      1. Select a node type.

        Click the image icon to the right of the Manually Triggered Tasks section in the MANUALLY TRIGGERED OBJECTS pane, select Create Node, and then select a desired node type.

        Note

        You can create manually triggered nodes of only the following types: batch synchronization, notebook, MaxCompute SQL, MaxCompute Script, PyODPS 2, MaxCompute MR, Hologres SQL, Python, and Shell. For more information about nodes, see Supported node types.

      2. Create a node.

        In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.

      Create a manually triggered node in a directory

      1. Create a directory.

        Click the image icon to the right of the Manually Triggered Tasks section in the MANUALLY TRIGGERED OBJECTS pane and select Create Directory. In the Create Directory dialog box, specify a directory name and click OK.

      2. Select a node type.

        Right-click the name of the created directory, select Create Node, and then select a node type.

        Note

        You can create manually triggered nodes of only the following types: batch synchronization, notebook, MaxCompute SQL, MaxCompute Script, PyODPS 2, MaxCompute MR, Hologres SQL, Python, and Shell. For more information about nodes, see Supported node types.

      3. Create a node.

        In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.

    • Create a node in a manually triggered workflow

      1. Create a manually triggered workflow.

      2. Select a node type.

        In the top toolbar of the configuration tab of the created manually triggered workflow, click Create Internal Node. In the popover that appears, select a node type based on the type of task that you want to develop.

        DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.

      3. Specify a node name and press Enter.

References