DataWorks Data Studio node types and development - DataWorks

DataWorks Data Studio provides various nodes for different data processing tasks: data integration nodes for synchronization; engine compute nodes such as MaxCompute SQL, Hologres SQL, and EMR Hive for data cleansing; and general-purpose nodes such as virtual nodes and do-while loop nodes for complex logic processing. These nodes work together to effectively address various data processing challenges.

Supported node types

The following table lists the node types supported by periodic scheduling. Supported node types for manual tasks or manually triggered workflows may differ. For the most up-to-date list, refer to the UI.

Note

Node availability varies by edition and region. For the most accurate information, see the UI.
Some nodes cannot be run in a workflow. See the node details for specifics.

Node type	Node name	Description	Node code	TaskType
Data Integration	batch synchronization	Synchronizes data in recurring batches between various data sources, supporting data synchronization across multiple heterogeneous data sources in complex scenarios. For more information about the data sources supported by batch synchronization, see Supported data sources and synchronization solutions.	23	DI
Data Integration	real-time synchronization	Synchronizes data changes from a source to a destination database in real time. You can synchronize a single table or an entire database to maintain data consistency. For more information about the data sources supported by real-time synchronization, see Supported data sources and synchronization solutions.	900	RI
Notebook	Notebook	Notebook provides an interactive and flexible data processing and analysis platform. By enhancing intuitiveness, modularity, and interactive experience, it makes data processing, exploration, visualization, and model building more efficient and convenient.	1323	NOTEBOOK
MaxCompute	MaxCompute SQL	Supports periodic scheduling of MaxCompute SQL tasks. MaxCompute SQL uses SQL-like syntax and is suitable for distributed processing scenarios that involve large-scale data (TB-level) but do not require high real-time performance.	10	ODPS_SQL
	SQL component	An SQL component is a reusable SQL code template with multiple input and output parameters. It can process data by filtering, joining, and aggregating data source tables to generate result tables. During data development, you can create SQL component nodes and use these predefined components to quickly build data processing pipelines, significantly improving development efficiency.	1010	COMPONENT_SQL
	MaxCompute Script	Allows you to combine multiple SQL statements into a single script for unified compilation and execution. This is ideal for complex query scenarios such as nested subqueries or multi-step operations. By submitting the entire script at once and generating a unified execution plan, the job only needs to be queued and executed once, making resource utilization more efficient.	24	ODPS_SQL_SCRIPT
	PyODPS 2	By integrating the MaxCompute Python SDK, you can write and edit Python code directly on PyODPS 2 nodes to conveniently perform data processing and analysis tasks in MaxCompute.	221	PY_ODPS
	PyODPS 3	With PyODPS 3 nodes, you can write MaxCompute jobs directly in Python code and configure these jobs for periodic scheduling.	1221	PYODPS3
	MaxCompute Spark	Supports running MaxCompute-based Spark batch jobs (cluster mode) on the DataWorks platform.	225	ODPS_SPARK
	MaxCompute MR	By creating a MaxCompute MR node and submitting it for task scheduling, you can use the MapReduce Java API to write MapReduce programs for processing large-scale datasets in MaxCompute.	11	ODPS_MR
	Map metadata to Hologres	When you need to accelerate queries on MaxCompute data in Hologres, you can use the MaxCompute metadata mapping feature of Data Catalog to map MaxCompute table metadata to Hologres, enabling accelerated queries on MaxCompute data through Hologres external tables.	-	-
	Synchronize data to Hologres	Supports synchronizing single-table data from MaxCompute to Hologres for efficient big data analysis and real-time queries.	-	-
Hologres	Hologres SQL	Hologres SQL nodes support querying data in Hologres instances. In addition, Hologres and MaxCompute are seamlessly connected at the underlying level, allowing you to use standard PostgreSQL statements in Hologres SQL nodes to query and analyze large-scale data in MaxCompute without migrating data, delivering fast query results.	1093	HOLOGRES_SQL
	Synchronize data to MaxCompute	Supports migrating single-table data from Hologres to MaxCompute.	1070	HOLOGRES_SYNC_DATA_TO_MC
	One-click MaxCompute table schema synchronization	Provides a one-click table schema import feature to quickly create Hologres external tables in batches that are consistent with MaxCompute table schemas.	1094	HOLOGRES_SYNC_DDL
	One-click MaxCompute data synchronization	Provides a one-click MaxCompute data synchronization node to quickly synchronize data from MaxCompute to a Hologres database.	1095	HOLOGRES_SYNC_DATA
Serverless Spark	Serverless Spark Batch	A Spark node based on Serverless Spark, suitable for large-scale data processing.	2100	SERVERLESS_SPARK_BATCH
	Serverless Spark SQL	An SQL query node based on Serverless Spark that supports standard SQL syntax and provides high-performance data analysis capabilities.	2101	SERVERLESS_SPARK_SQL
	Serverless Kyuubi node	Connects to Serverless Spark through the Kyuubi JDBC/ODBC interface, providing multi-tenant Spark SQL services.	2103	SERVERLESS_KYUUBI
Severless StarRocks	Serverless StarRocks SQL	An SQL node based on EMR Serverless StarRocks that is compatible with open-source StarRocks SQL syntax, providing ultra-fast OLAP query analysis and lakehouse query analysis.	2104	SERVERLESS_STARROCKS
LLM	Large language model node	Features a built-in powerful data processing and analysis engine that intelligently performs data cleansing and mining based on your natural language instructions.	2200	LLM_NODE
Flink	Flink SQL Streaming	Supports defining real-time task processing logic using standard SQL statements. It offers ease of use, rich SQL support, powerful state management and fault tolerance, compatibility with event time and processing time, and flexible scalability. This node integrates easily with systems such as Kafka and HDFS, and provides comprehensive logging and performance monitoring tools.	2012	FLINK_SQL_STREAM
	Flink SQL Batch	Allows you to define and execute data processing tasks using standard SQL statements. It is suitable for analysis and transformation of large datasets, including data cleansing and aggregation. This node supports visual configuration and provides an efficient and flexible large-scale batch data processing solution.	2011	FLINK_SQL_BATCH
	Flink JAR Streaming	Supports running Flink real-time tasks by submitting JAR packages. You can select an uploaded Flink JAR resource as the job entry point and configure the entry class and runtime parameters.	2016	FLINK_JAR_STREAM
	Flink JAR Batch	Supports running Flink batch processing tasks by submitting JAR packages. You can select an uploaded Flink JAR resource as the job entry point and configure the entry class and scheduling parameters.	2015	FLINK_JAR_BATCH
	Flink Python Streaming	Supports running Flink real-time tasks by submitting Python files. You can select an uploaded Flink Python resource as the file address and configure the entry module and runtime parameters.	2018	FLINK_PYTHON_STREAM
	Flink Python Batch	Supports running Flink batch processing tasks by submitting Python files. You can select an uploaded Flink Python resource as the file address and configure the entry module and scheduling parameters.	2017	FLINK_PYTHON_BATCH
EMR	EMR Hive	Allows you to use SQL-like statements to read, write, and manage large datasets, enabling efficient analysis and development of massive log data.	227	EMR_HIVE
	EMR Impala	A fast, real-time interactive SQL query engine for PB-scale big data.	260	EMR_IMPALA
	EMR MR	Breaks down large-scale datasets into multiple parallel Map tasks to significantly improve data processing efficiency.	230	EMR_MR
	EMR Presto	A flexible, scalable distributed SQL query engine that supports interactive analysis and querying of big data using standard SQL query syntax.	259	EMR_PRESTO
	EMR Shell	Allows you to write and execute custom Shell scripts for advanced features such as data processing, invoking Hadoop components, and file operations.	257	EMR_SHELL
	EMR Spark	A general-purpose big data analytics engine known for its high performance, ease of use, and broad applicability. It supports complex in-memory computing and is ideal for building large-scale, low-latency data analytics applications.	228	EMR_SPARK
	EMR Spark SQL	Processes structured data using a distributed SQL query engine to improve job execution efficiency.	229	EMR_SPARK_SQL
	EMR Spark Streaming	Processes high-throughput real-time streaming data with fault tolerance mechanisms that can quickly recover from data stream errors.	264	EMR_SPARK_STREAMING
	EMR Trino	A distributed SQL query engine suitable for interactive analysis and querying across multiple data sources.	267	EMR_TRINO
	EMR Kyuubi	A distributed and multi-tenant gateway that provides SQL query services for data lake query engines such as Spark, Flink, and Trino.	268	EMR_KYUUBI
ADB	ADB for PostgreSQL	Supports the development and periodic scheduling of AnalyticDB for PostgreSQL tasks.	1000090	-
	ADB for MySQL	Supports the development and periodic scheduling of AnalyticDB for MySQL tasks.	1000126	-
	ADB Spark	Supports the development and periodic scheduling of AnalyticDB Spark tasks.	1990	ADB_SPARK
	ADB Spark SQL	Supports the development and periodic scheduling of AnalyticDB Spark SQL tasks.	1991	ADB_SPARK_SQL
CDH	CDH Hive	For users who have deployed a CDH cluster and want to run Hive tasks through DataWorks.	270	CDH_HIVE
	CDH Spark	A general-purpose big data analytics engine known for its high performance, ease of use, and broad applicability. It can be used for complex in-memory analysis and building large-scale, low-latency data analytics applications.	271	CDH_SPARK
	CDH Spark SQL	Processes structured data using a distributed SQL query engine to improve job execution efficiency.	272	CDH_SPARK_SQL
	CDH MR	Processes ultra-large-scale datasets.	273	CDH_MR
	CDH Presto	Provides a distributed SQL query engine that further enhances the data analysis capabilities of the CDH environment.	278	CDH_PRESTO
	CDH Impala	CDH Impala nodes allow you to write and execute Impala SQL scripts, providing faster query performance.	279	CDH_IMPALA
Lindorm	Lindorm Spark	Supports the development and periodic scheduling of Lindorm Spark tasks.	1800	LINDORM_SPARK
Lindorm	Lindorm Spark SQL	Supports the development and periodic scheduling of Lindorm Spark SQL tasks.	1801	LINDORM_SPARK_SQL
Click House	ClickHouse SQL	Supports distributed SQL queries and structured data processing to improve job execution efficiency.	1301	CLICK_SQL
Data Quality	Quality monitoring	Allows you to configure data quality monitoring rules to monitor the data quality of related data source tables (for example, checking for dirty data). You can also customize scheduling policies to periodically run monitoring tasks for data validation.	1333	DATA_QUALITY_MONITOR
Data Quality	Data comparison	The comparison node supports multiple methods for comparing data across different tables.	1331	DATA_SYNCHRONIZATION_QUALITY_CHECK
General	Zero load node	A virtual node is a control-type node that performs a dry run without generating any data. It is typically used as the root node for workflow orchestration, making it easier to manage nodes and workflows.	99	VIRTUAL
	Assignment node	Used for parameter passing. It uses its built-in output to pass the last query or output result of the assignment node to downstream nodes through the node context feature, enabling cross-node parameter passing.	1100	CONTROLLER_ASSIGNMENT
	Shell node	Shell nodes support standard Shell syntax but do not support interactive syntax.	6	DIDE_SHELL
	Parameter node	Used for aggregating parameters from upstream nodes and distributing them downstream.	1115	PARAM_HUB
	OSS object check	Triggers downstream node execution by monitoring OSS objects.	239	OSS_INSPECT
	Python node	Supports the Python 3.0 language. It can obtain upstream parameters through scheduling parameters in schedule settings and apply custom parameters, as well as pass its own output as parameters to downstream nodes.	1322	PYTHON
	Merge node	Used for merging the running status of upstream nodes, resolving dependency mounting and run triggering issues for nodes downstream of branch nodes.	1102	CONTROLLER_JOIN
	Branch node	Used for evaluating upstream results and directing different outcomes to different branch logic. You can use it together with assignment nodes.	1101	CONTROLLER_BRANCH
	for-each node	Used for iterating over the result set passed by an assignment node.	1106	CONTROLLER_TRAVERSE
	Do-while node	Used for executing a subset of node logic in a loop. You can also use it together with assignment nodes to loop through the results passed by an assignment node.	1103	CONTROLLER_CYCLE
	Check node	Used for checking whether a target object (MaxCompute partitioned table, FTP file, or OSS file) is available. When the check node meets the check policy, it returns a success status. If downstream dependencies exist, it triggers downstream task execution upon success. Supported target objects: MaxCompute partitioned table FTP file OSS file HDFS OSS-HDFS	241	CHECK_NODE
	Function Compute	Used for periodically scheduling and processing event functions.	1330	FUNCTION_COMPUTE
	HTTP trigger	If you want tasks on other scheduling systems to trigger DataWorks tasks upon completion, you can use this node. Note DataWorks no longer supports creating cross-tenant nodes. If you are using cross-tenant nodes, we recommend that you switch to HTTP trigger nodes, which provide the same capabilities.	1114	SCHEDULER_TRIGGER
	SSH	Allows DataWorks to remotely access a host connected through a specified SSH data source and trigger script execution on the remote host.	1321	SSH
	Data Push	A Data Push node can push data query results generated by other nodes in a Data Studio workflow to DingTalk groups, Lark groups, WeCom groups, Teams, and email by creating data push targets.	1332	DATA_PUSH
Database nodes	MySQL node	MySQL nodes support the development and periodic scheduling of MySQL tasks.	1000125	-
	SQL Server	SQL Server nodes support the development and periodic scheduling of SQL Server tasks.	10001	-
	Oracle node	Oracle nodes support the development and periodic scheduling of Oracle tasks.	10002	-
	PostgreSQL node	PostgreSQL nodes support the development and periodic scheduling of PostgreSQL tasks.	10003	-
	StarRocks node	Supports the development and periodic scheduling of StarRocks tasks.	10004	-
	DRDS node	Supports the development and periodic scheduling of DRDS tasks.	10005	-
	PolarDB MySQL node	Supports the development and periodic scheduling of PolarDB MySQL tasks.	10006	-
	PolarDB PostgreSQL node	PolarDB PostgreSQL nodes support the development and periodic scheduling of PolarDB PostgreSQL tasks.	10007	-
	Doris node	Doris nodes support the development and periodic scheduling of Doris tasks.	10008	-
	MariaDB node	MariaDB nodes support the development and periodic scheduling of MariaDB tasks.	10009	-
	SelectDB node	SelectDB nodes support the development and periodic scheduling of SelectDB tasks.	10010	-
	Redshift node	Redshift nodes support the development and periodic scheduling of Redshift tasks.	10011	-
	Saphana node	Saphana nodes support the development and periodic scheduling of SAP HANA tasks.	10012	-
	Vertica node	Vertica nodes support the development and periodic scheduling of Vertica tasks.	10013	-
	DM (Dameng) node	DM nodes support the development and periodic scheduling of DM tasks.	10014	-
	KingbaseES node	KingbaseES nodes support the development and periodic scheduling of KingbaseES tasks.	10015	-
	OceanBase node	OceanBase nodes support the development and periodic scheduling of OceanBase tasks.	10016	-
	DB2 node	DB2 nodes support the development and periodic scheduling of DB2 tasks.	10017	-
	GBase 8a node	GBase 8a nodes support the development and periodic scheduling of GBase 8a tasks.	10018	-
Algorithm	PAI Designer	PAI's visual modeling tool, Designer, for implementing end-to-end machine learning development workflows with visual modeling.	1117	PAI_STUDIO
	PAI DLC	PAI's container-based training service, DLC, for distributed execution of training tasks.	1119	PAI_DLC
	PAI Flow	PAI knowledge base index workflow / generates PAIFlow nodes on the DataWorks side.	1250	PAI_FLOW
Logic node	SUB_PROCESS node	The SUB_PROCESS node consolidates multiple workflows into a unified whole for management and scheduling.	1122	SUB_PROCESS

Create nodes

Create nodes for scheduled workflows

If your tasks need to run automatically at specified intervals (such as hourly, daily, or weekly), you can create scheduled task nodes in the following ways: create a scheduled task node, add internal nodes to a scheduled workflow, or clone an existing node to create a new one.

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
In the left-side navigation pane, click to go to the Data Studio page.

Create a scheduled task node

Click on the right side of the project directory, select New Node, and then select the desired node type.

Important
The system provides a Common Nodes list and an All Nodes list. Select All Nodes at the bottom to view all available node types. Use the search box to quickly find nodes, or use category filters (such as MaxCompute, Data Integration, and General) to locate and create the desired node.

You can create directories in advance to organize and manage nodes.
Set the node name and save it. The node editing page then appears.

Create internal nodes in a scheduled workflow

Create a scheduled workflow.
On the workflow canvas, click New Node in the toolbar at the top, select the desired node type based on the task you need to develop, and drag it onto the canvas.
Set the node name and save it.

Create a node by cloning

Use the clone feature to quickly clone an existing node and create a new one. The cloned content includes the node's Scheduling Settings information (Scheduling Parameters, Scheduling time, and Scheduling Dependency).

In the left-side Project Directory, right-click the node you want to clone and select Cloning from the context menu.
In the dialog, modify the node Name and Path (or keep the default values), and click Confirm to start cloning.
After cloning is complete, view the newly created node in the Project Directory.

Create nodes for manually triggered workflows

If your tasks do not need to run periodically but need to be deployed to the production environment and run manually when needed, you can create internal nodes in a manually triggered workflow.

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
In the left-side navigation pane, click to go to the manually triggered workflow page.
1. Create a manually triggered workflow.
2. On the toolbar at the top of the manually triggered workflow editing page, click New Internal Node, and select the desired node type based on the task you need to develop.
3. Set the node name and save it.

Create manual task nodes

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
In the left-side navigation pane, click to go to the manual task page.
In the lower section, click on the right side of Manually Triggered Task, select New Node, and then select the desired node type.

Note
Manual tasks only support the following node types: Offline synchronization, Notebook, Maxcompute SQL, Maxcompute Script, Pyodps 2, Maxcompute MR, Hologres SQL, Python, and Shell.
Set the node name and save it. The node editing page then appears.

Batch editing of nodes

When a workflow contains a large number of nodes, opening them one by one for editing is inefficient. DataWorks provides an Internal Node List feature that displays all nodes in a list on the right side of the canvas for quick preview, search, and batch editing.

Usage

On the toolbar at the top of the workflow canvas, click the Show Internal Node List button to open the feature panel on the right side of the canvas.
After the panel opens, all nodes in the current workflow are displayed in a list.
- Code preview and sorting:
  - Nodes that support code editing (such as MaxCompute SQL) expand the code editor by default.
  - Nodes that do not support code editing (such as virtual nodes) are displayed as cards and are automatically arranged at the bottom of the list.
- Quick search and navigation:
  - Search: Enter keywords in the search box at the top to perform a fuzzy search on node names.
  - Linkage: Bidirectional linkage is available between the canvas and the sidebar. Selecting a node on the canvas highlights the corresponding node in the sidebar, and vice versa.
- Online editing:
  - Actions: The upper-right corner of each node card provides quick actions such as Load Latest Code, Open Node, and Edit.
  - Auto-save: After you enter the editing state, changes are automatically saved when the mouse focus leaves the code block area.
  - Conflict detection: If the code is updated by another user during editing, a save failure notification is triggered to prevent accidental overwrites.
- Focus mode:
  - Select a node and click in the upper-right corner of the floating window to enable Focus Mode. The sidebar displays only the currently selected node, providing more space for code editing.

Version management

The system supports restoring nodes to a specified historical version through version management. It also provides version viewing and comparison features to help you analyze differences and make adjustments.

In the left-side Project Directory, double-click the target node name to go to the node editing page.
Click Version on the right side of the node editing page. On the Version page, view and manage Developer Record and Publish Record information.
- View a version:
  1. On the Developer Record or Publish Record tab, find the node version you want to view.
  2. Click View in the Operation column to go to the details page where you can view the node code content and Scheduling Settings information.
    
    Note
    Scheduling Settings information can be viewed in Script Mode or Visual Mode. You can switch between the viewing modes in the upper-right corner of the Scheduling Settings tab.
- Compare versions:
  
  On the Developer Record or Publish Record tab, you can compare different versions of a node. The following example uses the developer record to demonstrate the comparison operation.
  - Compare within development or deployment records: On the Developer Record tab, select two versions and click the Select Comparison button at the top to compare the node code content and schedule settings between versions.
  - Compare across development and deployment or build records:
    1. On the Developer Record tab, locate the desired version of the node.
    2. Click Compare in the Operation column, and on the details page, select a version from Publish Record or Build Records to compare.
- Restore a version:
  
  You can only restore nodes from the Developer Record to a specified historical version. On the Developer Record tab, find the target version and click Restore in the Operation column to restore the node's code and Scheduling Settings information to the target version.

References

For more information about developing nodes in scheduled workflows and manually triggered workflows, see Scheduled workflows and Manually triggered workflows.
After nodes are created and developed, you can deploy them to the production environment. For more information, see Submit nodes and Deploy nodes.

FAQ

Can I download node code (such as SQL or Python) to my local machine?

Answer: A direct download feature is not available. As an alternative, you can copy the code to your local machine directly during development. Alternatively, you can develop in the personal directory in Data Studio, and then submit the code to the project directory. In this case, your code is saved locally.