This topic introduces the basic concepts in DataWorks, including workspace, workflow, solution, SQL script template, node, instance, commit operation, script, resource, function, and output name.
You can bind multiple compute engines such as MaxCompute, E-MapReduce, and Realtime Compute to a single workspace. Then, you can configure and schedule nodes in the workspace.
- Allows you to organize nodes by type.
- Supports a hierarchical directory structure. We recommend that you create a maximum of four levels of subdirectories for a workflow.
- Allows you to view and optimize the workflow from the business perspective.
- Allows you to deploy and manage the workflow as a whole.
- Allows you to view the workflow on a dashboard to develop code with improved efficiency.
A solution contains one or more workflows.
- A solution can contain multiple workflows.
- A workflow can be used in multiple solutions.
- Workspace members can collaboratively develop and manage solutions in a workspace.
SQL script template
SQL script templates are general logic chunks that are abstracted from SQL scripts. They can be reused to improve the efficiency of code development.
Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate source tables to generate a result table based on your business requirements. An SQL script template includes multiple input and output parameters.
- A sync node is used to synchronize data from ApsaraDB for RDS to MaxCompute.
- An ODPS SQL node is used to convert data by executing SQL statements that are supported by MaxCompute.
Each node has zero or more input tables or datasets and generates one or more output tables or datasets.
|Node task||A node task is used to perform a data operation. You can configure dependencies between a node task and other node tasks or flow tasks to form a directed acyclic graph (DAG).|
|Flow task||A flow task contains a group of inner nodes that process a workflow. We recommend
that you create less than 10 flow tasks.
Inner nodes in a flow task cannot be depended upon by other flow tasks or node tasks. You can configure dependencies between a flow task and other flow tasks or node tasks to form a DAG.
Note In DataWorks V2.0 and later, you can find the flow tasks that are created in DataWorks V1.0 but cannot create flow tasks. Instead, you can create workflows to perform similar operations.
|Inner node||An inner node is a node within a flow task. Its features are basically the same as those of a node task. You can configure dependencies between inner nodes in a flow task by performing drag-and-drop operations. However, you cannot configure a recurrence for inner nodes because they follow the recurrence configuration of the flow task.|
An instance is a snapshot of a node at a specific time point. An instance is generated every time a node is run as scheduled by the scheduling system or manually triggered. An instance contains information such as the time point at which the node is run, the running status of the node, and operational logs.
A script stores code for data analysis. The code in a script can be used only for data query and analysis. It cannot be committed to the scheduling system for scheduling.
Resource and function
The DataWorks console allows you to manage resources and functions. Note that you cannot query resources or functions in DataWorks if they are uploaded by using other services such as MaxCompute.
Under an Alibaba Cloud account, each node has an output name that is used to connect to its descendant nodes.