Create and deploy an auto triggered node for data development - DataWorks

Prerequisites

Before you begin, make sure you have:

Completed the environment setup described in Data development: Developers
Added a MaxCompute data source to your workspace (required to create an ODPS SQL node)
An Alibaba Cloud account or a RAM user assigned the Workspace Administrator or Develop role

How it works

DataWorks organizes development work into workflows, where each node represents a unit of computation. DataStudio provides a visualized development interface for nodes of various compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), and CDH. For more information, see Overview.

The typical data flow is:

A data synchronization node ingests raw business data into a source table.
A compute node cleanses and transforms the source data, then writes results to a result table.

In this tutorial, you skip the synchronization node by uploading data directly from your local machine. You then create a compute node to process the data.

Step 1: Create a workflow

Workflows are the organizational unit in DataStudio. All nodes live inside a workflow, so create one before doing any development.

Log on to the DataWorks console. In the top navigation bar, select your region. In the left-side navigation pane, choose Data Development and Governance > Data Development. Select your workspace and click Go to Data Development.
In the Scheduled Workflow pane, create a workflow using one of these methods:
- Move the pointer over the icon and click Create Workflow.
- Right-click Business Flow and select Create Workflow.
In the Create Workflow dialog box, enter a workflow name and description, then click Create. In this example, the workflow name is Create the first auto triggered node.

For more information, see Create a workflow.

Step 2: Create tables

Before writing any node code, create the tables that will hold your raw and processed data. This example uses two MaxCompute tables.

Create and define the tables

In the Scheduled Workflow pane, click Business Flow, find your workflow, right-click MaxCompute, and select Create Table.

In the Create Table dialog box, set the Engine Instance and Name parameters. Create the following two tables:

Table name	Description
`bank_data`	Stores raw business data
`result_table`	Stores the cleansed and processed data

On the configuration tab of each table, switch to DDL mode and paste the DDL statement to generate the table schema. Set the Display Name parameter in the General section. DDL for bank_data:

CREATE TABLE IF NOT EXISTS bank_data
(
 age             BIGINT COMMENT 'Age',
 job             STRING COMMENT 'Job type',
 marital         STRING COMMENT 'Marital status',
 education       STRING COMMENT 'Education level',
 default         STRING COMMENT 'Credit card',
 housing         STRING COMMENT 'Mortgage',
 loan            STRING COMMENT 'Loan',
 contact         STRING COMMENT 'Contact information',
 month           STRING COMMENT 'Month',
 day_of_week     STRING COMMENT 'Day of the week',
 duration        STRING COMMENT 'Duration',
 campaign        BIGINT COMMENT 'Number of contacts during the campaign',
 pdays           DOUBLE COMMENT 'Interval from the last contact',
 previous        DOUBLE COMMENT 'Number of contacts with the customer',
 poutcome        STRING COMMENT 'Result of the previous marketing campaign',
 emp_var_rate    DOUBLE COMMENT 'Employment change rate',
 cons_price_idx  DOUBLE COMMENT 'Consumer price index',
 cons_conf_idx   DOUBLE COMMENT 'Consumer confidence index',
 euribor3m       DOUBLE COMMENT 'Euro deposit rate',
 nr_employed     DOUBLE COMMENT 'Number of employees',
 y               BIGINT COMMENT 'Time deposit available or not'
);

DDL for result_table:

CREATE TABLE IF NOT EXISTS result_table
(
education STRING COMMENT 'Education level',
num BIGINT COMMENT 'Number of persons'
)
PARTITIONED BY
(
day STRING,
hour STRING
);

In the top toolbar, click Commit to Development Environment, then click Commit to Production Environment.

Table creation and updates take effect in the compute engine only after they are committed to the target environment. For more information, see Table creation statements and Create tables.

Upload data

Upload the sample file banking.txt from your local machine to the bank_data table.

For detailed steps, see Upload a file from your on-premises machine to the bank_data table.

Step 3: Create a node

Nodes are the building blocks of data development in DataWorks. Choose the node type based on the compute engine you want to use.

Create a node using one of these methods:

From the Scheduled Workflow pane: Right-click the compute engine, move the pointer over Create Node, and select the node type.
From the workflow canvas: Double-click the workflow name to open the canvas, then click or drag the required node type from the left-side section.

In the Create Node dialog box, set the Engine Instance and Name parameters.

This example creates an ODPS SQL node named result_table — the same name as the result table. Naming the node after its output table makes it easy to trace which node produced a given table.

Step 4: Configure the node

Open the node by double-clicking its name. Write the node code using the syntax of the target compute engine.

This example reads data from a specified partition in bank_data and writes the cleansed results to a partition in result_table. The partition is defined by the day and hour variables.

INSERT OVERWRITE TABLE result_table partition (day='${day}', hour='${hour}')
SELECT education
, COUNT(marital) AS num
FROM bank_data
GROUP BY education;

To use variables in scheduling scenarios, define them in the code as ${variable_name} and assign values in step 5.

For supported scheduling parameter formats, see Supported formats of scheduling parameters. For node code syntax, see Create and use nodes.

Step 5: Configure scheduling properties

Scheduling properties control when and how often DataWorks runs your node. Click the Properties tab in the right-side pane of the node configuration tab.

Important

Configure rerun settings and ancestor nodes before committing the node in step 7.

Section	What to configure
General	Automatically populated with node name, ID, type, and owner. Modify the owner if needed — only workspace members can be set as owner. Note: the node ID is automatically generated after the node is committed.
Scheduling parameter	Assign values to the variables defined in step 4. In this example, assign `${yyyymmdd}` to `day` and `$[hh24]` to `hour`. This writes each hour's data from `bank_data` into the corresponding hourly partition in `result_table`.
Schedule	Set the scheduling cycle, start time, rerun settings, and timeout. In this example, the node runs every hour starting at `00:00`.
Resource group	Select the resource group for scheduling. By default, a serverless resource group is provided when you activate DataWorks. See Create and use a serverless resource group.
Dependencies	Configure the ancestor nodes that must complete before this node runs. If the node queries data generated by other nodes, configure the ancestor node using one of these methods: (a) If the ancestor node is outside the current workflow, enter the output name of the ancestor node in the Parent Nodes table. (b) If the ancestor node is inside the current workflow, configure the dependency by drawing lines on the workflow canvas. In this example, the `result_table` node reads from `bank_data`, which is not generated by another node in the workflow. Set the workspace root node as the ancestor node.
(Optional) Input and output parameters	Configure parameters passed between nodes. Required only when using assignment nodes.

Step 6: Debug the node

Before committing, verify that your code runs correctly. The recommended path for this tutorial is Run with Parameters, which lets you assign test values to the variables defined in step 4.

In the top toolbar, click the Run with Parameters icon.
In the dialog box, assign constant values to the variables defined in step 4.
Review the output to confirm the results are correct.

In this example, the node is run at 2022.09.07 14:00 as the test timestamp.

If you need a different debugging approach:

Debug feature	Best for
Quick run	Running a selected code snippet quickly
Run	Full-code debugging with saved variable assignments (saved after first run)
Run with Parameters	Full-code debugging when you need to change variable values each time

Step 7: Save and commit the node

After debugging, save the node and commit it to the development environment.

Important

Before committing, confirm that you have configured rerun settings and ancestor nodes in step 5.

Click the icon in the top toolbar to save the node.
Click the icon to commit the node to the development environment.

Step 8: Perform smoke testing

Smoke testing validates that the scheduling parameters are configured correctly before you deploy to production. Run it in the development environment after committing the node.

Click the icon and specify the data timestamp for the test.
After the test completes, click the icon to view the results.

In this example, the result_table node runs hourly from 00:00 to 23:59. Smoke testing generates two instances with scheduling times of 00:00 and 01:00.

Auto triggered instances are snapshots generated for a node each time it is scheduled. For hourly nodes, specify both the start and end timestamps when running the smoke test. For more information, see Perform smoke testing.

Step 9: Deploy the node

DataWorks only schedules nodes that are deployed to the production environment.

Basic mode workspaces: The node is periodically scheduled as soon as it is committed.
Standard mode workspaces: Committed changes enter a pending state. Click Deploy to open the Create Deploy Task page and push changes to production.

Click Deploy, review the pending operations (additions, updates, and deletions), and confirm the deployment. For detailed steps, see Deploy nodes.

Deployment detail	Description
Deployment control	Developers can create deployment packages. Deploying them requires O&M permissions. Check the deployment status on the Deployment Packages page.
Instance generation timing	If you deploy between `23:30` and `24:00`, instances take effect on the third day. This applies to nodes with the instance generation mode set to Next Day or Immediately After Deployment. See Configure immediate instance generation for a task.

For differences between basic mode and standard mode workspaces, see Differences between workspaces in basic mode and workspaces in standard mode.

What's next

Go to Operation Center and open the Auto Triggered Tasks page to view your deployed node and perform O&M operations. For more information, see Perform basic O&M operations on auto triggered nodes.