The Data Studio service of DataWorks allows you to create various types of nodes, such as data synchronization nodes, nodes of compute engine types for data cleansing, and general nodes for complex logic processing, to meet your different data processing requirements. Nodes of compute engine types include MaxCompute SQL, Hologres SQL, and E-MapReduce (EMR) Hive nodes. General nodes include zero load nodes and do-while nodes. Nodes of different types work together to effectively address various data processing challenges.
Supported node types
The following table describes the node types supported in periodic scheduling. The node types supported by manually triggered tasks or workflows may differ. You can view the supported node types in the DataWorks console.
Supported node types vary based on the DataWorks edition and region in which DataWorks resides. You can view the supported node types in the DataWorks console.
Node type | Node name | Description | Node code | Task type (specified by TaskType) |
Notebook | This type of node provides an interactive and flexible data processing and analysis platform. This improves intuition, modularity, and interactivity, and helps you perform data processing, exploration, visualization, and model building in a more efficient and convenient manner. | 1323 | NOTEBOOK | |
Data Integration | This type of node is used to periodically synchronize offline data and to synchronize data between heterogeneous data sources in complex scenarios. For information about the data source types that support batch synchronization, see Supported data source types and synchronization operations. | 23 | DI2 | |
Real-time Synchronization | This type of node allows you to synchronize data changes in a source table or database to a destination table or database in real time to ensure data consistency at the source and destination. For information about the data source types that support real-time synchronization, see Supported data source types and synchronization operations. | 900 | RI | |
MaxCompute | This type of node allows you to schedule MaxCompute SQL tasks on a regular basis and is integrated with other types of nodes for joint scheduling. MaxCompute SQL tasks can process terabytes of data in distributed scenarios that do not require real-time processing by using the SQL-like syntax. | 10 | ODPS_SQL | |
This type of node is used to filter source table data, join source tables, and aggregate source table data to generate a result table. A script template defines an SQL code process that includes multiple input and output parameters. You can create SQL Script Template nodes in Data Studio to build a data processing process. This helps significantly improve development efficiency. | 1010 | COMPONENT_SQL | ||
This type of node allows you to integrate multiple SQL statements into a script for compilation and execution. The script mode is suitable for processing complex queries, such as nested subqueries, or scenarios that require step-by-step operations. After you submit a script, a unified execution plan is generated. In this case, the related job needs to queue and be run only once. This helps improve resource utilization. | 24 | ODPS_SCRIPT | ||
This type of node is integrated with MaxCompute SDK for Python. You can edit Python code on PyODPS 2 nodes in the DataWorks console to process and analyze data in MaxCompute. | 221 | PY_ODPS | ||
This type of node allows you to directly write Python code for MaxCompute jobs to schedule the MaxCompute jobs on a regular basis. | 1221 | PY_ODPS3 | ||
This type of node allows you to run offline Spark on MaxCompute tasks in cluster mode in DataWorks to integrate the tasks with other types of nodes for scheduling. | 225 | SPARK | ||
You can create and commit MaxCompute MR nodes that call the MapReduce Java API to write MapReduce programs and process large datasets in MaxCompute. | 11 | ODPS_MR | ||
Hologres | This type of node allows you to query data in Hologres instances. Hologres and MaxCompute are seamlessly connected at the underlying layer. This allows you to use a Hologres SQL node to query and analyze large-scale data in MaxCompute by executing standard PostgreSQL statements, without the need to migrate data. You can obtain query results in an efficient manner. | 1093 | HOLOGRES_SQL | |
One-click MaxCompute Table Schema Synchronization (Metadata Mapping Between MaxCompute and Hologres) | DataWorks provides the one-click table schema import feature that allows you to quickly create Hologres external tables that have the same schemas as MaxCompute tables. | 1094 | HOLOGRES_SYNC_DDL | |
One-click MaxCompute Data Synchronization (Data Synchronization from MaxCompute to Hologres) | DataWorks provides the one-click data synchronization feature that allows you to quickly synchronize data from MaxCompute to Hologres databases. | 1095 | HOLOGRES_SYNC_DATA | |
Flink | This type of node allows you to use standard SQL statements to define the processing logic of real-time tasks. Flink SQL Streaming nodes are easy to use, support a variety of SQL syntax, and provide powerful state management and fault tolerance capabilities. In addition, Flink SQL Streaming nodes are compatible with event time and processing time and can be flexibly expanded. Flink SQL Streaming nodes are easy to integrate with services, such as Kafka and Hadoop Distributed File System (HDFS), and provide detailed logs and performance monitoring tools. | 2012 | FLINK_SQL_STREAM | |
This type of node allows you to define and run data processing tasks by using standard SQL statements. Flink SQL Batch nodes are suitable for the analysis and transformation of large datasets, including data cleansing and aggregation. Flink SQL Batch nodes can be configured in a visualized manner to provide efficient and flexible batch processing solutions for large-scale data. | 2011 | FLINK_SQL_BATCH | ||
EMR | This type of node allows you to use SQL-like statements to read data from and write data to large datasets and manage the large datasets. This way, you can analyze and develop large amounts of log data in an efficient manner. | 227 | EMR_HIVE | |
This type of node allows you to perform fast and real-time interactive SQL queries on petabytes of data. | 260 | EMR_IMPALA | ||
This type of node allows you to process a large dataset by using multiple parallel map tasks. EMR MR nodes help significantly improve data processing efficiency. | 230 | EMR_MR | ||
Presto is a flexible and scalable distributed SQL query engine that allows you to execute standard SQL statements to perform interactive analytic queries of big data. | 259 | EMR_PRESTO | ||
This type of node allows you to specify custom Shell scripts and run the scripts to use advanced features such as data processing, Hadoop component calling, and file management. | 257 | EMR_SHELL | ||
Spark is a general-purpose big data analytics engine. Spark features high performance, ease of use, and widespread use. You can use Spark to perform complex memory computing and build large, low-latency data analysis applications. | 228 | EMR_SPARK | ||
This type of node allows you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs. | 229 | EMR_SPARK_SQL | ||
This type of node can be used to process streaming data with high throughput. This type of node supports fault tolerance, which helps you quickly restore data streams on which errors occur. | 264 | SPARK_STREAMING | ||
Trino is a distributed SQL query engine designed to run interactive analytic queries of various data sources. | 267 | EMR_TRINO | ||
Apache Kyuubi is a distributed and multi-tenant gateway that provides query services such as SQL queries for data lake query engines. The data lake query engines include Spark, Flink, and Trino. | 268 | EMR_KYUUBI | ||
ADB | This type of node allows you to develop and periodically schedule AnalyticDB for PostgreSQL tasks and integrate AnalyticDB for PostgreSQL tasks with other types of tasks. | 1000024 | - | |
This type of node allows you to develop and periodically schedule AnalyticDB for MySQL tasks and integrate AnalyticDB for MySQL tasks with other types of tasks. | 1000036 | - | ||
This type of node allows you to develop and periodically schedule AnalyticDB Spark tasks and integrate AnalyticDB Spark tasks with other types of tasks. | 1990 | ADB Spark | ||
This type of node allows you to develop and periodically schedule AnalyticDB Spark SQL tasks and integrate AnalyticDB Spark SQL tasks with other types of tasks. | 1991 | ADB Spark SQL | ||
CDH | You can use Cloudera's Distribution Including Apache Hadoop (CDH) Hive nodes in DataWorks to run Hive tasks if you have deployed a CDH cluster. | 270 | CDH_HIVE | |
Spark is a general-purpose big data analytics engine. Spark features high performance, ease of use, and widespread use. You can use Spark to perform complex memory analysis and build large, low-latency data analysis applications. | 271 | CDH_SPARK | ||
This type of node allows you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs. | 272 | CDH_SPARK_SQL | ||
This type of node allows you to process data in ultra-large datasets. | 273 | CDH_MR | ||
This type of node allows you to use a distributed SQL query engine to analyze real-time data. This further enhances the data analysis capabilities in the CDH environment. | 278 | CDH_PRESTO | ||
This type of node allows you to write and run Impala SQL scripts. CDH Impala nodes provide higher query performance than CDH Hive nodes. | 279 | CDH_IMPALA | ||
Click House | This type of node allows you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs. | - | - | |
General | A zero load node is a control node that supports dry-run scheduling and does not generate data. In most cases, a zero load node serves as the root node of a workflow and allows you to easily manage nodes and workflows. | 99 | VIRTUAL_NODE | |
Assignment | This type of node can be used if you want to use the outputs parameter of an assignment node to pass the data from the output of the last row of the code for the assignment node to its descendant nodes. | 1100 | CONTROLLER_ASSIGNMENT | |
This type of node supports the standard Shell syntax. The interactive syntax is not supported. | 6 | SHELL2 | ||
This type of node can be used to aggregate parameters of its ancestor nodes and distribute parameters to its descendant nodes. | 1115 | PARAM_HUB | ||
OSS object inspection | This type of node can be used if you want to trigger a descendant node to run after Object Storage Service (OSS) objects are generated. | 239 | OSS | |
This type of node supports the Python 3.0 syntax and allows you to configure scheduling parameters on the Properties tab to obtain parameters from its ancestor nodes and configure custom parameters. In addition, the output of this type of node can be passed to a descendant node as parameters. | 1322 | PYTHON | ||
This type of node allows you to merge the status of its ancestor nodes and prevent dry run of its descendant nodes. | 1102 | CONTROLLER_JOIN | ||
This type of node allows you to route results based on logical conditions. You can also use this type of node together with an assignment node. | 1101 | CONTROLLER_BRANCH | ||
This type of node allows you to traverse the result set of an assignment node. | 1106 | CONTROLLER_TRAVERSE | ||
This type of node allows you to execute the logic of specific nodes in loops. You can also use this type of node together with an assignment node to generate the data that is passed to a descendant node of the assignment node in loops. | 1103 | CONTROLLER_CYCLE | ||
This type of node allows you to check the availability of MaxCompute partitioned tables, FTP files, and OSS objects based on check policies. If the condition that is specified in the check policy for a Check node is met, the task on the Check node is successfully run. If the running of a task depends on an object, you can use a Check node to check the availability of the object and configure the task as a descendant task of the Check node. If the condition that is specified in the check policy for the Check node is met, the task on the Check node is successfully run and then its descendant task is triggered to run. Supported objects:
| 241 | - | ||
You can configure a node for data sending and a node for data reception from different tenants to meet your requirements for task running across tenants. | 1089 | CROSS | ||
This type of node allows you to periodically schedule and process event functions and complete integration and joint scheduling with other types of nodes. | 1330 | FUNCTION_COMPUTE | ||
This type of node can be used if you want to trigger nodes in DataWorks to run after nodes in other scheduling systems finish running. Note DataWorks no longer allows you to create cross-tenant collaboration nodes. If you have used a cross-tenant collaboration node in your business, we recommend that you replace the cross-tenant collaboration node with an HTTP Trigger node. An HTTP Trigger node provides the same capabilities as a cross-tenant collaboration node. | 1114 | SCHEDULER_TRIGGER | ||
In DataWorks, you can create an SSH node and use the SSH node based on a specific SSH data source to remotely access a host that is connected to the data source and trigger script running on the host. | 1321 | SSH | ||
MySQL | This type of node allows you to develop and periodically schedule MySQL tasks and integrate MySQL tasks with other types of tasks. | 1000039 | - | |
SQL Server node | This type of node allows you to develop and periodically schedule SQL Server tasks and integrate SQL Server tasks with other types of tasks. | 10001 | - | |
Oracle | This type of node allows you to develop and periodically schedule Oracle tasks and integrate Oracle tasks with other types of tasks. | 10002 | - | |
PostgreSQL | This type of node allows you to develop and periodically schedule PostgreSQL tasks and integrate PostgreSQL tasks with other types of tasks. | 10003 | - | |
PolarDB PostgreSQL | This type of node allows you to develop and periodically schedule PolarDB for PostgreSQL tasks and integrate PolarDB for PostgreSQL tasks with other types of tasks. | 10007 | - | |
Doris | This type of node allows you to develop and periodically schedule Doris tasks and integrate Doris tasks with other types of tasks. | 10008 | - | |
MariaDB | This type of node allows you to develop and periodically schedule MariaDB tasks and integrate MariaDB tasks with other types of tasks. | 10009 | - | |
SelectDB | This type of node allows you to develop and periodically schedule SelectDB tasks and integrate SelectDB tasks with other types of tasks. | 10010 | - | |
Redshift | This type of node allows you to develop and periodically schedule Redshift tasks and integrate Redshift tasks with other types of tasks. | 10011 | - | |
Saphana | This type of node allows you to develop and periodically schedule SAP HANA tasks and integrate SAP HANA tasks with other types of tasks. | 10012 | - | |
Vertica | This type of node allows you to develop and periodically schedule Vertica tasks and integrate Vertica tasks with other types of tasks. | 10013 | - | |
DM | This type of node allows you to develop and periodically schedule DM tasks and integrate DM tasks with other types of tasks. | 10014 | - | |
KingbaseES | This type of node allows you to develop and periodically schedule KingbaseES tasks and integrate KingbaseES tasks with other types of tasks. | 10015 | - | |
OceanBase | This type of node allows you to develop and periodically schedule OceanBase tasks and integrate OceanBase tasks with other types of tasks. | 10016 | - | |
DB2 | This type of node allows you to develop and periodically schedule Db2 tasks and integrate Db2 tasks with other types of tasks. | 10017 | - | |
GBase 8a | This type of node allows you to develop and periodically schedule GBase 8a tasks and integrate GBase 8a tasks with other types of tasks. | 10018 | - | |
Algorithm | Machine Learning Designer is a visualized modeling tool that is provided by Platform for AI (PAI) to implement end-to-end machine learning development. | - | - | |
Deep Learning Containers (DLC) of PAI is used to run training tasks in a distributed manner. | 1119 | PAI_DLC | ||
Logical node | This type of node allows you to integrate multiple workflows as a whole for management and scheduling. | 1122 | - |
Create an auto triggered node
If your task needs to be automatically run on a regular basis within a specified period of time, you can create an auto triggered node or create a node in an auto triggered workflow. For example, your task can be scheduled to run by hour, day, or week on a regular basis.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the left-side navigation pane of the Data Studio page, click the
icon.
Create an auto triggered node
Directly create an auto triggered node
Select a node type.
Click the
icon to the right of the Workspace Directories section in the DATA STUDIO pane, select Create Node, and then select a desired node type.
DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.
NoteThe first time you perform operations in the Workspace Directories section of the DATA STUDIO pane, you can directly click Create Node to create a node.
Create a node.
In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.
Create an auto triggered node in a directory
Create a directory.
Click the
icon to the right of the Workspace Directories section in the DATA STUDIO pane and select Create Directory. In the Create Directory dialog box, specify a directory name and click OK.
Select a node type.
Right-click the name of the created directory, select Create Node, and then select a node type.
DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.
Create a node.
In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.
Create a node in an auto triggered workflow
Create an auto triggered workflow.
Select a node type.
On the left side of the configuration tab of the workflow, select a node type based on the type of task that you want to develop, and drag the node type to the canvas on the right.
DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.
In the Create Node dialog box, specify a node name and click OK.
Create a manually triggered node
If your task does not need to be run on a regular basis, but needs to be deployed to the production environment for running, you can create a manually triggered node or create a node in a manually triggered workflow.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the left-side navigation pane of the Data Studio page, click the
icon.
Create a manually triggered node
Directly create a manually triggered node
Select a node type.
Click the
icon to the right of the Manually Triggered Tasks section in the MANUALLY TRIGGERED OBJECTS pane, select Create Node, and then select a desired node type.
NoteYou can create manually triggered nodes of only the following types: batch synchronization, notebook, MaxCompute SQL, MaxCompute Script, PyODPS 2, MaxCompute MR, Hologres SQL, Python, and Shell. For more information about nodes, see Supported node types.
Create a node.
In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.
Create a manually triggered node in a directory
Create a directory.
Click the
icon to the right of the Manually Triggered Tasks section in the MANUALLY TRIGGERED OBJECTS pane and select Create Directory. In the Create Directory dialog box, specify a directory name and click OK.
Select a node type.
Right-click the name of the created directory, select Create Node, and then select a node type.
NoteYou can create manually triggered nodes of only the following types: batch synchronization, notebook, MaxCompute SQL, MaxCompute Script, PyODPS 2, MaxCompute MR, Hologres SQL, Python, and Shell. For more information about nodes, see Supported node types.
Create a node.
In the Create Node dialog box, specify a node name and click OK. The configuration tab of the node appears.
Create a node in a manually triggered workflow
Create a manually triggered workflow.
Select a node type.
In the top toolbar of the configuration tab of the created manually triggered workflow, click Create Internal Node. In the popover that appears, select a node type based on the type of task that you want to develop.
DataWorks provides various node types. You can select a node type based on your business requirements. For more information, see Supported node types.
Specify a node name and press Enter.
References
For more information about node development in an auto triggered workflow or a manually triggered workflow, see Auto triggered workflow or Manually triggered workflow.
After you create and develop a node, you can deploy the node to the production environment. For more information, see Scheduling configuration and Node or workflow deployment.