This topic introduces the basic concepts in DataWorks, including workspace, workflow, solution, SQL script template, node, instance, commit operation, script, resource, function, and output name.

Workspace

Workspaces are basic units for managing nodes, members, roles, and permissions in DataWorks. A workspace administrator can add members to the workspace and assign the workspace administrator, developer, administration expert, deployment expert, security expert, or visitor role to each member. In this way, workspace members with different roles can collaborate with each other.
Note We recommend that you create workspaces to isolate resources by department or business unit.

You can bind multiple compute engines such as MaxCompute, E-MapReduce, and Realtime Compute to a single workspace. Then, you can configure and schedule nodes in the workspace.

Workflows are abstracted from business to help you manage and develop code based on business demands and improve the efficiency of node management.
Note A workflow can be used in multiple solutions.
Workflows help you manage and develop code based on business demands. A workflow has the following features:
  • Allows you to organize nodes by type.
  • Supports a hierarchical directory structure. We recommend that you create a maximum of four levels of subdirectories for a workflow.
  • Allows you to view and optimize the workflow from the business perspective.
  • Allows you to deploy and manage the workflow as a whole.
  • Allows you to view the workflow on a dashboard to develop code with improved efficiency.

Solution

A solution contains one or more workflows.

Solutions have the following benefits:
  • A solution can contain multiple workflows.
  • A workflow can be used in multiple solutions.
  • Workspace members can collaboratively develop and manage solutions in a workspace.

SQL script template

SQL script templates are general logic chunks that are abstracted from SQL scripts. They can be reused to improve the efficiency of code development.

Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate source tables to generate a result table based on your business requirements. An SQL script template includes multiple input and output parameters.

Node

Each type of node is used to perform a specific data operation. For example:
  • A sync node is used to synchronize data from ApsaraDB for RDS to MaxCompute.
  • An ODPS SQL node is used to convert data by executing SQL statements that are supported by MaxCompute.

Each node has zero or more input tables or datasets and generates one or more output tables or datasets.

Nodes are classified into node tasks, flow tasks, and inner nodes.Nodes
Type Description
Node task A node task is used to perform a data operation. You can configure dependencies between a node task and other node tasks or flow tasks to form a directed acyclic graph (DAG).
Flow task A flow task contains a group of inner nodes that process a workflow. We recommend that you create less than 10 flow tasks.
Inner nodes in a flow task cannot be depended upon by other flow tasks or node tasks. You can configure dependencies between a flow task and other flow tasks or node tasks to form a DAG.
Note In DataWorks V2.0 and later, you can find the flow tasks that are created in DataWorks V1.0 but cannot create flow tasks. Instead, you can create workflows to perform similar operations.
Inner node An inner node is a node within a flow task. Its features are basically the same as those of a node task. You can configure dependencies between inner nodes in a flow task by performing drag-and-drop operations. However, you cannot configure a recurrence for inner nodes because they follow the recurrence configuration of the flow task.

Instance

An instance is a snapshot of a node at a specific time point. An instance is generated every time a node is run as scheduled by the scheduling system or manually triggered. An instance contains information such as the time point at which the node is run, the running status of the node, and operational logs.

Assume that Node 1 is configured to run at 02:00 every day. The scheduling system automatically generates an instance of Node 1 at 23:30 every day. At 02:00 the next day, if the scheduling system verifies that all the ancestor instances are run, the system automatically runs the instance of Node 1.
Note You can query the instance information on the Cycle Instance page of Operation Center.

Commit

You can commit nodes and workflows from the development environment to the scheduling system. The scheduling system runs the code in the committed nodes and workflows as configured.
Note The scheduling system runs nodes and workflows only after you commit them.

Script

A script stores code for data analysis. The code in a script can be used only for data query and analysis. It cannot be committed to the scheduling system for scheduling.

Resource and function

Resources and functions are concepts in MaxCompute. For more information, see Resource and Function.

The DataWorks console allows you to manage resources and functions. Note that you cannot query resources or functions in DataWorks if they are uploaded by using other services such as MaxCompute.

Output name

Under an Alibaba Cloud account, each node has an output name that is used to connect to its descendant nodes.

When you configure dependencies for a node, you must use its output name instead of its node name or node ID. After you configure the dependencies, the output name of the node serves as the input name of its descendant nodes.Output name
Note Each output name distinguishes a node from other nodes under the same Alibaba Cloud account. By default, an output name is in the following format: Workspace name.Randomly generated nine-digit number_out. You can customize the output name for a node. Note that the output name of each node must be unique under an Alibaba Cloud account.