edit-icon download-icon

Task type overview

Last Updated: Dec 08, 2017

Seven types of nodes are in DataWorks, which are applicable to different use cases.

Task type

OPEN_MR task

OPEN_MR task is used to run the data processing program periodically based on the MaxCompute MapReduce API (Java API). For the examples, see OPEN_MR task.

MaxCompute supports MapReduce API, whose Java API can be used to write MapReduce program for processing MaxCompute data. You can package the results into JAR or other types of resources and upload the package to the DataWorks, and then configure the OPEN_MR node task.

ODPS_MR task

MaxCompute supports MapReduce API, whose Java API can be used to write MapReduce program for processing MaxCompute data. You can create ODPS_MR nodes and use them in Task Scheduling. For the examples, see ODPS_MR task.

ODPS_SQL task

ODPS_SQL task enables you to edit and maintain SQL codes directly on the Web, and facilitates the debugging and running, and collaborative development. DataWorks also provides the features such as code version management and automatic parsing of upstream and downstream dependencies. For the examples, see Create a task.

DataWorks uses the project of MaxCompute by default as the space for development and production, so the code contents of ODPS_SQL nodes follow the MaxCompute SQL syntax. MaxCompute SQL adopts the syntax similar to that of Hive, which can be considered as a subset of standard SQL. However, MaxCompute SQL cannot be equated with a database, because it does not possess the characteristics of a database in many aspects, such as transactions, primary key constraints, and indexes.

For more information on the specific MaxCompute SQL syntax, see SQL overview.

Data synchronization task

The data synchronization node task, which is provided by the Alibaba Cloud DTplus platform and to the public, is a stable, highly efficient, and elastic data synchronization cloud service. With the data synchronization node, you can easily synchronize the data in the business system to MaxCompute. For details, see Data synchronization task.

Machine Learning task

The Machine Learning node is used to call the tasks created in the Machine Learning platform, and perform scheduling production according to the node configuration. For details, see Machine Learning.

Note:

Only the experiments that are created and saved in the Machine Learning platform can be selected in the Machine Learning nodes of DataWorks.

SHELL task

SHELL node supports standard SHELL syntax, but does not support the interactive syntax. For details, see SHELL task.

Virtual node task

A virtual node is a control node that does not generate any data. Generally, it is used as the root node of overall planning nodes in the workflow. For more information on virtual node tasks, see Virtual node task.

Note:

The final output table of a workflow contains multiple branch input tables. Virtual nodes are usually used if these input tables are not dependent.

Example:

An output table is generated after the source tables imported by three data synchronization tasks are processed by the ODPS_SQL task. The three data synchronization tasks are not dependent, but the ODPS_SQL task is dependent on the three data synchronization tasks. The workflow is shown as follows:

A virtual node is used as the root node where the workflow starts. The three data synchronization tasks depend on the virtual node and the ODPS_SQL processing task is dependent on the three data synchronization tasks.

Thank you! We've received your feedback.