DataWorks nodes - DataWorks - Alibaba Cloud Documentation Center

The DataStudio service of DataWorks allows you to create various types of nodes, such as Data Integration nodes used for data synchronization, compute engine nodes used for data cleansing, and general nodes used together with compute engine nodes to process complex logic. Compute engine nodes include ODPS SQL nodes, Hologres SQL nodes, and E-MapReduce (EMR) Hive nodes. General nodes include zero load nodes that can be used to manage multiple other nodes and do-while nodes that can run node code in loops. You can combine these types of nodes in your business to meet your different data processing requirements.

Node types supported by DataStudio

The following table describes the node types that are supported by DataStudio.

Node type	Description
Data synchronization nodes	DataWorks Data Integration supports data synchronization in complex network environments. You can create a batch synchronization node to periodically synchronize offline data or create a real-time synchronization node to synchronize incremental data from a single table or database in real time. You can create this type of node on the DataStudio page.
Compute engine nodes	DataWorks encapsulates the capabilities of compute engines. You can create and configure nodes of a compute engine type to develop data. You can enable the system to periodically schedule different types of nodes in DataWorks without the need to use complex command lines. You can create nodes of the following compute engine types: MaxCompute, Hologres, EMR, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, MySQL, ClickHouse, Cloudera's Distribution including Apache Hadoop (CDH), and Platform for AI (PAI).
General nodes	You can use general nodes and nodes of a specific compute engine type in DataWorks to process complex logic. General nodes include do-while nodes that are used to run node code in loops, for-each nodes that are used to traverse the outputs of assignment nodes in loops and judge the outputs, and branch nodes.

Note

This topic describes the node code that corresponds to a node type. This code is used when you call an API operation to perform node-related operations, such as obtaining node information. You can also call the ListFileType operation to query the node code.

Data synchronization nodes

Data synchronization nodes are used to synchronize data. The following table describes different types of data synchronization nodes.

Type

Description

Node code

Batch synchronization node

This type of node is used to periodically synchronize offline data and to synchronize data between heterogeneous data sources in complex scenarios.

For information about the data source types that support batch synchronization, see Supported data source types and read and write operations.

Real-time synchronization node

This type of node is used to synchronize incremental data in real time. A real-time synchronization node uses three basic plug-ins to read, convert, and write data. These plug-ins interact with each other based on an intermediate data format that is defined by the plug-ins.

For information about the data source types that support real-time synchronization, see Supported data source types and read and write operations.

900

Note

In addition to the nodes that can be created on the DataStudio page, DataWorks also allows you to create different types of synchronization tasks in Data Integration. For example, you can create a synchronization task in Data Integration that synchronizes full data at a time and then incremental data in real time or a synchronization task that synchronizes all data in a database in offline mode. For more information, see Overview of the full and incremental synchronization feature. In most cases, the node code of a task that is created in Data Integration is 24.

Compute engine nodes

In a specific workflow, you can create nodes of a specific compute engine type, use the nodes to develop data, issue the engine code to a corresponding data cleansing engine, and then run the code. The following table describes different types of compute engine nodes.

Note

You must first activate corresponding services for your DataWorks workspace and add data sources of required compute engine types to your DataWorks workspace. DataWorks can access data of required compute engine types and perform relevant development operations based on the added data sources. For information about how to add a data source, see Add and manage data sources.

Compute engine integrated with DataWorks	Encapsulated engine capability	Node code
MaxCompute	Develop a MaxCompute SQL task	10
	Develop a MaxCompute Spark task	225
	Develop a PyODPS 2 node	221
	Develop a PyODPS 3 node.	1221
	Develop a MaxCompute script task	24
	Develop a MaxCompute MR task	11
	Reference a script template	1010
EMR	Create an EMR Hive node	227
	Create an EMR MR node	230
	Create an EMR Spark SQL node	229
	Create an EMR Spark node	228
	Create an EMR Shell node	257
	Create an EMR Presto node	259
	Create an EMR Spark Streaming node	264
	Create an EMR Kyuubi node	268
	Create an EMR Trino node	267
CDH	Create a CDH Hive node	270
	Create a CDH Spark node	271
	Create a CDH MR node	273
	Create a CDH Presto node	278
	Create a CDH Impala node	279
AnalyticDB for PostgreSQL	Create and use AnalyticDB for PostgreSQL nodes	-
AnalyticDB for MySQL	Create an AnalyticDB for MySQL node	-
Hologres	Create a Hologres SQL node	1093
	Create a node to synchronize schemas of MaxCompute tables with a few clicks	1094
	Create a node to synchronize MaxCompute data with a few clicks	-
ClickHouse	Configure a ClickHouse SQL node	-
StarRocks	Configure a StarRocks node	10004
PAI	Create and use a PAI Studio node	-
	Use DataWorks tasks to schedule pipelines in Machine Learning Designer	-
	Create and use a PAI DLC node	-
Database	Create and use a MySQL node	1000039
	Configure an SQL Server node	10001
	Configure an Oracle node	10002
	Configure a PostgreSQL node	10003
	Configure a DRDS node	10005
	Configure a PolarDB for MySQL node	10006
	Configure a PolarDB for PostgreSQL node	10007
	Configure a Doris node	10008
	Configure a MariaDB node	10009
	Configure a Redshift node	10011
	Configure a SAP HANA node	-
	Configure a Vertica node	10013
	Configure a DM node	10014
	Configure a KingbaseES node	10015
	Configure an OceanBase node	10016
	Configure a Db2 node	10017
	Configure a GBase 8a node	-
Others	Data Lake Analytics node	1000023

General nodes

In a specific workflow, you can create a general node and use the node together with compute engine nodes to process complex logic. In a specific workflow, you can create a general node and use the node together with compute engine nodes to process complex logic. The following table describes different types of general nodes.

Scenario	Node type	Node code	Description
Business management	Zero load node	99	A zero load node is a control node that supports dry-run scheduling and does not generate data. In most cases, a zero load node serves as the root node of a workflow and allows you to easily manage nodes and workflows.
Event-based trigger	HTTP Trigger node	1114	You can use this type of node if you want to trigger nodes in DataWorks to run after nodes in other scheduling systems finish running.
	OSS object inspection node	239	You can use this type of node if you want to trigger a descendant node to run after Object Storage Service (OSS) objects are generated.
	FTP Check node	1320	You can use this type of node if you want to trigger a descendant node to run after File Transfer Protocol (FTP) files are generated.
	Check node	241	You can use this type of node to check the availability of MaxCompute partitioned tables, FTP files, and OSS objects based on check policies. If the condition that is specified in the check policy for a Check node is met, the task on the Check node is successfully run. If the running of a task depends on an object, you can use a Check node to check the availability of the object and configure the task as the descendant task of the Check node. If the condition that is specified in the check policy for the Check node is met, the task on the Check node is successfully run and then its descendant task is triggered to run.
Parameter value assignment and parameter passing	Assignment node	1100	You can use this type of node if you want to use the outputs parameter of an assignment node to pass the data from the output of the last row of the code for the assignment node to its descendant nodes.
Parameter value assignment and parameter passing	Parameter node	1115	You can use this type of node to aggregate parameters of its ancestor nodes and distribute parameters to its descendant nodes.
Control	For-each node	1106	You can use this type of node to traverse the result set of an assignment node.
	Do-while node	1103	You can use this type of node to execute the logic of specific nodes in loops. You can also use this type of node together with an assignment node to generate the data that is passed to a descendant node of the assignment node in loops.
	Branch node	1101	You can use this type of node to route results based on logical conditions. You can also use this type of node together with an assignment node.
	Merge node	1102	You can use this type of node to merge the status of its ancestor nodes and prevent dry run of its descendant nodes.
Others	Shell node	6	Shell nodes support the standard Shell syntax. The interactive syntax is not supported.
Others	Function Compute node	1330	You can use this type of node to periodically schedule and process event functions and complete integration and joint scheduling with other types of nodes.