This topic introduces the basic concepts in DataWorks, including workspace, workflow, solution, SQL script template, node, instance, commit operation, script, resource, function, and output name.
You can bind instances of multiple compute engines such as MaxCompute, E-MapReduce (EMR), and Realtime Compute to a single workspace. After you bind a compute engine instance to a workspace, you can configure and schedule the corresponding type of nodes in the workspace.
- A workflow allows you to organize nodes by type.
- A workflow supports a hierarchical directory structure. We recommend that you create a maximum of four levels of subdirectories for a workflow.
- You can view and optimize a workflow from the business perspective.
- You can deploy and manage nodes in a workflow as a whole.
- A workflow provides a dashboard for you to develop code with improved efficiency.
You can include one or more workflows in a solution.
- A solution can contain multiple workflows.
- A workflow can be used in multiple solutions.
- An organizational solution can include various nodes. This improves user experience.
SQL script template
SQL script templates are general logic chunks that are abstracted from SQL scripts. They can help reuse code.
Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate source tables to generate a result table based on your business requirements. An SQL script template includes multiple input and output parameters.
- A sync node is used to synchronize data from ApsaraDB RDS to MaxCompute.
- An ODPS SQL node is used to convert data by executing SQL statements that are supported by MaxCompute.
Each node has zero or more input tables or datasets and generates one or more output tables or datasets.
|Node task||A node task is used to perform a data operation. You can configure dependencies between a node task and other node tasks or flow tasks to form a directed acyclic graph (DAG).|
|Flow task||A flow task contains a group of inner nodes that process a workflow. We recommend
that you create less than 10 flow tasks.
Inner nodes in a flow task cannot be depended upon by other flow tasks or node tasks. You can configure dependencies between a flow task and other flow tasks or node tasks to form a DAG.
Note In DataWorks V2.0 and later, you can find the flow tasks that are created in DataWorks V1.0 but cannot create flow tasks. Instead, you can create workflows to perform similar operations.
|Inner node||An inner node is a node within a flow task. The features of an inner node are basically the same as those of a node task. You can drag lines between inner nodes in a flow task to configure dependencies. However, you cannot configure a recurrence for inner nodes because they follow the recurrence configuration of the flow task.|
An instance is a snapshot of a node at a specific point in time. An instance is generated every time a node is run as scheduled by the scheduling system or is manually triggered. An instance contains information such as the time at which the node is run, the running status of the node, and operational logs.
A script stores code for data analysis. The code in a script can be used only for data query and analysis. It cannot be committed to the scheduling system for scheduling.
Resource and function
The DataWorks console allows you to manage resources and functions. If resources or functions are uploaded by using other services such as MaxCompute, you cannot query them in DataWorks.
Each node has an output name. When you configure dependencies under an Alibaba account, the output name of a node is used to connect to its descendant nodes.
Metadata is data that provides descriptions for other data. It can describe the attributes such as the name, size, and data type, or the structure including the field, type, and length, or relevant information such as the location, owner, output node, and access permission. In DataWorks, metadata refers to information about tables or databases. Data Map is the main application for metadata management.
Retroactive data generation
After an auto triggered node is developed, committed, and deployed to the scheduling system, the system runs the node as scheduled. If you want to perform computing on historical data in a time period, you can generate retroactive data for the node. The generated retroactive instance is run based on the specified data timestamp.