This topic describes the terms that are related to DataWorks, including workspace, workflow, solution, SQL script template, node, instance, data timestamp, scheduling time, commit operation, script, resource, function, and output name.
workspace
You can associate compute engines such as MaxCompute, E-MapReduce (EMR), and Realtime Compute for Apache Flink compute engines with a workspace as compute engine instances. After you associate compute engines with a workspace as compute engine instances, you can develop and schedule nodes by using the compute engine instances in the workspace.
workflow
- Allows you to develop and manage code by node type.
- Supports a hierarchical directory structure. We recommend that you create a maximum of four levels of subdirectories for a workflow.
- Allows you to view and optimize a workflow from the business perspective.
- Allows you to deploy and manage nodes in a workflow as a whole.
- Provides a dashboard for you to develop code with improved efficiency.
solution
A solution contains one or more workflows.
- A solution can contain multiple workflows.
- Multiple solutions can use the same workflow.
- An organizational solution can contain various types of nodes. This improves user experience.
SQL script template
SQL script templates are general logic chunks that are abstracted from SQL scripts and can facilitate the reuse of code.
Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate source tables to generate a result table based on your business requirements. An SQL script template contains multiple input and output parameters.
node
- A data synchronization node is used to synchronize data from a source to a destination, such as data synchronization from ApsaraDB RDS to MaxCompute.
- An ODPS SQL node is used to convert data by executing SQL statements that are supported by MaxCompute.
Each node has zero or more input tables or datasets and generates one or more output tables or datasets.
Node type | Description |
---|---|
Node task | A node task is used to perform a data operation. You can configure dependencies between a node task and flow tasks or other node tasks to form a directed acyclic graph (DAG). |
Flow task | A flow task contains a group of inner nodes that process a workflow. We recommend that you create less than 10 flow tasks in a workspace. Inner nodes in a flow task cannot be configured as the dependencies of node tasks or other flow tasks. You can configure dependencies between a flow task and node tasks or other flow tasks to form a DAG. Note In DataWorks V2.0 and later, the flow tasks that are created in DataWorks V1.0 are retained, but you cannot create flow tasks. Instead, you can create workflows to perform similar operations. |
Inner node | An inner node is a node within a flow task. The features of an inner node are basically the same as those of a node task. You can drag lines between inner nodes in a flow task to configure dependencies. However, you cannot configure scheduling properties for inner nodes because these nodes use the scheduling configurations of the flow task. |
instance
An instance is a snapshot of a node at a specific point in time. An instance is generated every time a node is run as scheduled by the scheduling system or is manually triggered. An instance contains information such as the time at which the node is run, the running status of the node, and run logs.
data timestamp/scheduling time
- Data timestamp
For example, if you collect statistical data on the turnover of the previous day on the current day, the previous day is the date on which the business transaction is conducted and represents the data timestamp.
- Scheduling time
The scheduling time can be different from the actual time at which the node is scheduled to run. The actual time at which a node is run is affected by multiple factors.
commit
script
A script stores code for data analysis. The code in a script can be used only to query and analyze data. The code cannot be committed to the scheduling system for scheduling or used to configure scheduling parameters.
resource and function
Resources and functions are terms in MaxCompute and refer to resources and functions that are used by a MaxCompute compute engine. For more information, see Resource and Function.
output name
An output name is the name of the output that is generated by a node. Each node has an output name. When you configure dependencies between nodes within an Alibaba Cloud account, the output name of a node is used to connect to its descendant nodes.
metadata
Metadata describes data attributes, data structures, and other relevant information. Data attributes include the name, size, and data type, data structures include the field, type, and length, and other relevant information includes the location, owner, output node, and access permissions. In DataWorks, metadata refers to information about tables or databases. DataMap is the main service used to manage metadata.
data backfill
After an auto triggered node is developed, and committed and deployed to the scheduling system, the scheduling system runs the node as scheduled. If you want to perform computing on data that is generated in a historical period of time, you can use the data backfill feature. The generated data backfill instance is run based on the specified data timestamp.