This topic introduces the basic concepts in DataWorks, including workspace, workflow, solution, SQL script template, node, instance, commit operation, script, resource, function, and output name.

Workspace

Workspaces are basic units for managing nodes, members, roles, and permissions in DataWorks. A workspace administrator can add members to the workspace and assign the workspace administrator, developer, administration expert, deployment expert, security expert, or visitor role to each member. This way, workspace members with different roles can collaborate with each other.
Note We recommend that you create workspaces to isolate resources by department or business unit.

You can bind instances of multiple compute engines such as MaxCompute, E-MapReduce (EMR), and Realtime Compute to a single workspace. After you bind a compute engine instance to a workspace, you can configure and schedule the corresponding type of nodes in the workspace.

Workflows are abstracted from business to help you manage and develop code based on business demands and improve the efficiency of node management.
Note A workflow can be used in multiple solutions.
Workflows help you manage code based on business demands.
  • A workflow allows you to organize nodes by type.
  • A workflow supports a hierarchical directory structure. We recommend that you create a maximum of four levels of subdirectories for a workflow.
  • You can view and optimize a workflow from the business perspective.
  • You can deploy and manage nodes in a workflow as a whole.
  • A workflow provides a dashboard for you to develop code with improved efficiency.

Solution

You can include one or more workflows in a solution.

Solutions have the following benefits:
  • A solution can contain multiple workflows.
  • A workflow can be used in multiple solutions.
  • An organizational solution can include various nodes. This improves user experience.

SQL script template

SQL script templates are general logic chunks that are abstracted from SQL scripts. They can help reuse code.

Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate source tables to generate a result table based on your business requirements. An SQL script template includes multiple input and output parameters.

Node

Each type of node is used to perform a specific data operation. Examples:
  • A sync node is used to synchronize data from ApsaraDB RDS to MaxCompute.
  • An ODPS SQL node is used to convert data by executing SQL statements that are supported by MaxCompute.

Each node has zero or more input tables or datasets and generates one or more output tables or datasets.

Nodes are classified into node tasks, flow tasks, and inner nodes.Nodes
Node type Description
Node task A node task is used to perform a data operation. You can configure dependencies between a node task and other node tasks or flow tasks to form a directed acyclic graph (DAG).
Flow task A flow task contains a group of inner nodes that process a workflow. We recommend that you create less than 10 flow tasks.
Inner nodes in a flow task cannot be depended upon by other flow tasks or node tasks. You can configure dependencies between a flow task and other flow tasks or node tasks to form a DAG.
Note In DataWorks V2.0 and later, you can find the flow tasks that are created in DataWorks V1.0 but cannot create flow tasks. Instead, you can create workflows to perform similar operations.
Inner node An inner node is a node within a flow task. The features of an inner node are basically the same as those of a node task. You can drag lines between inner nodes in a flow task to configure dependencies. However, you cannot configure a recurrence for inner nodes because they follow the recurrence configuration of the flow task.

Instance

An instance is a snapshot of a node at a specific point in time. An instance is generated every time a node is run as scheduled by the scheduling system or is manually triggered. An instance contains information such as the time at which the node is run, the running status of the node, and operational logs.

Assume that Node 1 is scheduled to run at 02:00 every day. The scheduling system automatically generates an instance of Node 1 at 23:30 every day based on the time that is defined for the auto triggered node. At 02:00 the next day, if the scheduling system verifies that all the ancestor instances are run, the system automatically runs the instance of Node 1.
Note You can query the instance information on the Operation Center > Cycle Instance page.

Commit

A commit operation refers to the process of committing a node or workflow from the development environment to the scheduling system in DataWorks. When the node or workflow is committed, all of the code and scheduling configurations are also committed to the scheduling system. The scheduling system runs the node or workflow as configured.
Note The scheduling system runs nodes and workflows only after they are committed.

Script

A script stores code for data analysis. The code in a script can be used only for data query and analysis. It cannot be committed to the scheduling system for scheduling.

Resource and function

Resources and functions are concepts in MaxCompute. For more information, see Resource and Function.

The DataWorks console allows you to manage resources and functions. If resources or functions are uploaded by using other services such as MaxCompute, you cannot query them in DataWorks.

Output name

Each node has an output name. When you configure dependencies under an Alibaba account, the output name of a node is used to connect to its descendant nodes.

When you configure dependencies between a node and other nodes, you must use its output name instead of its node name or node ID. After you configure the dependencies, the output name of the node serves as the input name of its descendant nodes.Output name
Note Each output name distinguishes a node from other nodes under the same Alibaba Cloud account. By default, the output name of each node is in the following format: Workspace name.Randomly generated nine-digit number_out. You can customize the output name for a node, but make sure that the output name of the node is unique within your Alibaba Cloud account.

Metadata

Metadata is data that provides descriptions for other data. It can describe the attributes such as the name, size, and data type, or the structure including the field, type, and length, or relevant information such as the location, owner, output node, and access permission. In DataWorks, metadata refers to information about tables or databases. Data Map is the main application for metadata management.

Retroactive data generation

After an auto triggered node is developed, committed, and deployed to the scheduling system, the system runs the node as scheduled. If you want to perform computing on historical data in a time period, you can generate retroactive data for the node. The generated retroactive instance is run based on the specified data timestamp.