This topic introduces the basic concepts related to DataWorks, including the workflow, solution, SQL script template, node, instance, commit operation, script, resource, function, and output name.

Workflow

Workflows are abstracted from business to help you manage and develop code based on business demands and improve the efficiency of node management.
Note A workflow can be added to multiple solutions.
Workflows help you manage and develop code based on business demands. They have the following features:
  • Nodes in a workflow are organized by type.
  • A hierarchical directory structure is supported. We recommend that you create a maximum of four levels of subfolders.
  • You can view and optimize each workflow from a business perspective.
  • You can deploy and manage each workflow as a whole.
  • You can view each workflow on a dashboard to efficiently develop code.

Solution

A solution contains one or more workflows.

Solutions have the following benefits:
  • A solution can contain multiple workflows.
  • A workflow can be added to multiple solutions.
  • Workspace members can collaboratively develop and manage all solutions in a workspace.

SQL script template

SQL script templates are general logic chunks abstracted from SQL scripts. They can be reused to enhance the efficiency of code development.

Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate them to generate a result table based on the requirements of new business. An SQL script template includes multiple input and output parameters.

Node

Nodes are various data operations, for example:
  • A sync node is used to synchronize data from ApsaraDB for RDS to MaxCompute.
  • An ODPS SQL node is used to run MaxCompute SQL for data conversion.

Each node has zero or more input tables or datasets and generates one or more output tables or datasets.

Nodes are classified into node tasks, flow tasks, and inner nodes.Nodes
Type Description
Node task A node task is a data operation. You can configure the dependencies between a node task and other node tasks or flow tasks to form a directed acyclic graph (DAG).
Flow task A flow task contains a group of inner nodes that process a workflow. We recommend that you create less than 10 flow tasks.
Inner nodes in a flow task cannot be used as dependencies of other flow tasks or node tasks. You can configure the dependencies between a flow task and other flow tasks or node tasks to form a DAG.
Note In DataWorks V2.0 and later, you can find the flow tasks created in DataWorks V1.0 but cannot create flow tasks. You can create workflows instead to perform similar operations.
Inner node An inner node is a node within a flow task. It has basically the same features as a node task. You can configure dependencies between inner nodes in a flow task by using drag-and-drop operations. However, you cannot configure a recurrence for inner nodes because they follow the recurrence configuration of the flow task.

Instance

An instance is a snapshot of a node at a certain time point. It is generated when a node is scheduled by the scheduling system or triggered manually. The instance contains information such as the time when the node is run, the running status of the node, and operational logs.

Assume that Node 1 is configured to run at 02:00 every day. At 23:30 every day, the scheduling system automatically generates an instance of Node 1 that will run at 02:00 the next day. At 02:00 the next day, if the scheduling system detects that all the ancestor instances are run, the system automatically runs the instance of Node 1.
Note You can query the instance information on the Cycle Instance page of Operation Center.

Commit

You can commit nodes and workflows from the development environment to the scheduling system. The scheduling system runs the code specified in the committed nodes and workflows according to their configurations.
Note The scheduling system runs nodes and workflows only after you commit them.

Script

A script stores the code for data analysis. The code in a script can only be used for data query and analysis. It cannot be deployed to the scheduling system and cannot be scheduled.

Resource and function

Resources and functions are concepts in MaxCompute. For more information, see Resource and Function.

The DataWorks console allows you to manage resources and functions. Note that you cannot query resources and functions in DataWorks if they are uploaded through other services such as MaxCompute.

Output name

Under an Alibaba Cloud account, each node has an output name that is used to connect to its descendant nodes.

You must configure dependencies for a node based on the output name of the node instead of the node name or node ID. After you configure the dependencies, the output name of the node is also the input name of its descendant nodes.Output name
Note Each output name distinguishes a node from other nodes under an Alibaba Cloud account. By default, an output name is in the following format: Workspace name.Randomly generated nine-digit number_out. You can also customize the output name for a node. Note that the output name of each node must be unique under an Alibaba Cloud account.