All Products
Search
Document Center

DataWorks:Terms

Last Updated:Apr 07, 2024

This topic describes the terms that are related to DataWorks, including workspace, workflow, solution, SQL script template, task, instance, data timestamp, scheduling time, commit operation, script, resource, function, and output name.

workspace

A workspace is a basic unit for managing tasks, members, roles, and permissions in DataWorks. The administrator of a workspace can add users to the workspace as members and assign the Workspace Administrator, Development, O&M, Deploy, Security Manager, or Visitor role to each member. This way, workspace members to which different roles are assigned can collaborate with each other.

Note

We recommend that you create workspaces by department or business unit to isolate resources.

You can add multiple types of data sources to a workspace. After you add MaxCompute, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL V3.0, ClickHouse, E-MapReduce (EMR), and CDH/DCP data sources to a workspace and associate these data sources with DataStudio, you can develop and schedule tasks of these compute engine types in the workspace.

workflow

A workflow is abstracted from business to help you manage and develop code based on your business requirements and improve task management efficiency.

Note

A workflow can be added to multiple solutions.

Workflows help you manage and develop code based on your business requirements. A workflow has the following features:

  • Allows you to develop and manage code by task type.

  • Supports a hierarchical directory structure. We recommend that you create a maximum of four levels of subdirectories for a workflow.

  • Allows you to view and optimize a workflow from the business perspective.

  • Allows you to deploy and perform O&M operations on tasks in a workflow as a whole.

  • Provides a dashboard for you to develop code with improved efficiency.

solution

A solution contains one or more workflows.

Solutions have the following benefits:

  • A solution can contain multiple workflows.

  • Multiple solutions can use the same workflow.

  • An organizational solution can contain various types of nodes. This improves user experience.

SQL script template

SQL script templates are general logic chunks that are abstracted from SQL scripts and can facilitate the reuse of code.

Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate source tables to generate a result table based on your business requirements. An SQL script template contains multiple input and output parameters.

task

A task is used to perform a specific data operation.

  • A data synchronization node task is used to synchronize data from a source to a destination, such as from ApsaraDB RDS to MaxCompute.

  • An ODPS SQL node task is used to convert data by executing SQL statements that are supported by MaxCompute.

Each task has zero or more input tables or datasets and generates one or more output tables or datasets.

Tasks are classified into node tasks, flow tasks, and inner nodes.任务

Task type

Description

Node task

A node task is used to perform a data operation. You can configure dependencies between a node task and flow tasks or other node tasks to form a directed acyclic graph (DAG).

Flow task

A flow task contains a group of inner nodes that are used in the same business scenario. We recommend that you create less than 10 flow tasks in a workspace.

Inner nodes in a flow task cannot be configured as the dependencies of node tasks or other flow tasks. You can configure dependencies between a flow task and node tasks or other flow tasks to form a DAG.

Note

In DataWorks V2.0 and later, the flow tasks that are created in DataWorks V1.0 are retained, but you can no longer create flow tasks. Instead, you can create workflows to perform similar operations.

Inner node

An inner node is a node within a flow task. The features of an inner node are basically the same as those of a node task. You can drag lines between inner nodes in a flow task to configure dependencies. However, you cannot configure scheduling properties for inner nodes because these nodes use the scheduling configurations of the flow task.

instance

An instance is a snapshot of a task at a specific point in time. An instance is generated every time a task is run as scheduled by the scheduling system or is manually triggered. An instance contains information such as the time at which the task is run, the running status of the task, and run logs.

For example, Task 1 is an auto triggered task that is scheduled to run at 02:00 every day. The scheduling system automatically generates an instance for Task 1 at 23:30 every day based on the scheduling time of Task 1. At 02:00 the next day, if the scheduling system determines that the ancestor instance is run, the scheduling system automatically runs the instance of Task 1.

Note

You can query the instance information on the Cycle Instance page in Operation Center.

data timestamp/scheduling time

  • Data timestamp

    For example, if you collect statistical data on the turnover of the previous day on the current day, the previous day is the date on which the business transaction is conducted and represents the data timestamp.

  • Scheduling time

    The scheduling time can be different from the actual time at which the node is scheduled to run. The actual time at which a node is run is affected by multiple factors.

commit

You can commit node tasks and workflows from the development environment to the scheduling system. The scheduling system runs the code of the committed node tasks and workflows based on the related scheduling configurations.

Note

The scheduling system runs only committed node tasks and workflows.

script

A script stores code for data analysis. The code in a script can be used only to query and analyze data. The code cannot be committed to the scheduling system for scheduling or used to configure scheduling parameters.

resource and function

Resources and functions are terms in MaxCompute and refer to resources and functions that are used by a MaxCompute compute engine. For more information, see Resource and Function.

output name

An output name is the name of the output that is generated by a task. Each task has an output name. When you configure dependencies between tasks within an Alibaba Cloud account, the output name of a task is used to connect to its descendant tasks.

When you configure dependencies for a task, you must use the output name of the task instead of the node name or ID. After you configure the dependencies, the output name of the task serves as the input name of its descendant tasks.输出名称

Note

The output name of a task distinguishes the task from other tasks within the same Alibaba Cloud account. By default, the output name of a task is in the following format: Workspace name.Randomly generated nine-digit number.out. You can customize the output name for a task. You must make sure that the output name of the task is unique within your Alibaba Cloud account.

metadata

Metadata describes data attributes, data structures, and other relevant information. Data attributes include the name, size, and data type, data structures include the field, type, and length, and other relevant information includes the location, owner, output task, and access permissions. In DataWorks, metadata refers to information about tables or databases. Data Map is the main service used to manage metadata.

data backfill

After an auto triggered task is developed, and committed and deployed to the scheduling system, the scheduling system runs the task as scheduled. If you want to perform computing on data that is generated in a historical period of time, you can use the data backfill feature. The generated data backfill instance is run based on the specified data timestamp.