All Products
Search
Document Center

DataWorks:Data development: Developers

Last Updated:Oct 11, 2023

This topic describes how developers can create an auto triggered node in DataStudio. This topic provides a relevant example in which a MaxCompute compute engine instance is used to run MaxCompute jobs in DataWorks. This way, you can quickly understand the basic usage concepts of the DataStudio module.

Prerequisites

The environment preparations are complete. For more information, see Prepare an environment.

Note
  • In this example, an ODPS SQL node is used. Before you create the node in a workspace, you must associate a MaxCompute compute engine instance with the workspace.

  • The account that you use to perform operations must be granted data development permissions. You can use an Alibaba Cloud account or a RAM user to perform operations. If you use a RAM user to perform operations, the RAM user must be assigned the Workspace Manager or Development role.

Background information

DataStudio provides a visualized development interface for nodes of various types of compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), and CDH. You can use the visualized development interface to configure settings to perform intelligent code development, data cleansing and processing, and standardized node development and deployment. This helps ensure efficient and stable data development. For more information about how to use the DataStudio module, see Overview.

The procedure that is used to write raw business data to DataWorks and obtain a final result table consists of the following steps:

  1. Create multiple tables in DataWorks. Example:

    • Source table: stores data that is synchronized from other data sources.

    • Result table: stores data that is cleansed and processed in DataWorks.

  2. Create a data synchronization node to synchronize business data to the preceding source table.

  3. Create a compute node to cleanse the data in the source table, process the data at each layer, and then write the results of each layer to the result table.

You can also upload data from your on-premises machine to the source table. Then, you can use a compute node to cleanse and process the data, and store the processed data in the result table. In this example, data is uploaded from an on-premises machine to a source table and a compute node is used to cleanse and process the data.

Go to the DataStudio page

  1. Log on to the DataWorks console.

  2. In the left-side navigation pane, click Workspaces.

  3. In the top navigation bar, select the region in which the desired workspace resides. On the Workspaces page, find the workspace and click DataStudio in the Actions column.

Procedure

  1. Step 1: Create a workflow

    Code is developed based on a workflow in DataStudio. Before you perform development operations, you must create a workflow.

  2. Step 2: Create tables

    DataWorks allows you to create tables in a visualized manner and displays tables in a directory structure. Before data development, you must create a table in the MaxCompute compute engine instance to store the data processing results.

  3. Step 3: Create a node

    Data development in DataWorks is based on nodes, and tasks of different types of compute engines are encapsulated into different types of nodes in DataWorks. You can select a suitable node type for node development based on your business requirements.

  4. Step 4: Configure the node

    You can write code for the node on the node configuration tab based on the syntax that is supported by the related database.

  5. Step 5: Configure scheduling properties for the node

    You can configure scheduling attributes for the node to schedule and run the node at regular intervals.

  6. Step 6: Debug the code of the node

    You can use the quick run feature for code snippets, or the Run feature or Run with Parameters feature to debug and check the logic of the code of the node.

  7. Step 7: Save and commit the node

    After the node is debugged, you must save and commit the node.

  8. Step 8: Perform smoke testing

    To ensure efficient running of a node in the production environment and prevent wastage of computing resources, you can commit the node to the development environment and perform smoke testing in the development environment before you deploy the node. This helps ensure the correctness of the code of the node.

  9. Step 9: Deploy the node

    DataWorks can schedule only nodes that are deployed to the production environment. After the node passes the smoke testing, you must deploy the node to the production environment to enable DataWorks to schedule the node at regular intervals.

Step 1: Create a workflow

DataWorks organizes data development processes by using workflows. DataWorks provides dashboards for different types of nodes in each workflow and allows you to use tools and optimize and manage nodes on the dashboards. This facilitates data development and management. You can place nodes of the same type in one workflow based on your business requirements.

  1. Go to the DataStudio page.

  2. Create a workflow.

    You can use one of the following methods to create a workflow:

    • Method 1: Move the pointer over the Create icon and click Create Workflow.

    • Method 2: Right-click Business Flow in the Scheduled Workflow pane and select Create Workflow.

  3. Configure the Workflow Name parameter and the Description parameter for the workflow, and click Create.

    In this example, the Workflow Name parameter is set to Create the first auto triggered node. You can configure the Workflow Name parameter based on your business requirements in actual data development scenarios.

    Note

    For more information about how to use workflows, see Create and manage workflows.

Step 2: Create tables

A data development node of DataWorks cleanses and processes source data. Before data development, you must create a table in the required compute engine instance to store the data cleansing results and define the table schema.

  1. Create tables.

    1. Click Business Flow in the Scheduled Workflow pane. Find the workflow that is created in Step 1, click the workflow name, right-click MaxCompute, and then select Create Table.

    2. Configure the Engine Type, Path, and Name parameters.

    In this example, the following tables are created.

    Table name

    Description

    bank_data

    Used to store raw business data.

    result_table

    Used to store the data cleansing results.

    Note
    • For information about table creation statements, see Table creation statements.

    • For information about how to create tables in different compute engine instances in a visualized manner, such as creating a MaxCompute table or an EMR table, see Create tables.

  2. Generate table schemas.

    Go to the editing pages of the tables, switch to the DDL mode, and use DDL statements to generate schemas for the tables. After the table schemas are generated, configure the Display Name parameter in the General section, click Commit to Development Environment, and then click Commit to Production Environment in the top toolbar. For information about how to view the compute engines that are associated with workspaces in different environments, see Manage MaxCompute projects.

    Note
    • Operations such as table creation and table update can take effect in the related compute engine instances only after they are committed to the required environment.

    • You can also follow the on-screen instructions that are displayed in the DataWorks console to configure the table schemas in a visualized manner based on your business requirements. For more information about how to create a table in a visualized manner, see Create and manage MaxCompute tables.

    Generate a table schema In this example, the following statement is used to generate the schema of the bank_data table:

    CREATE TABLE IF NOT EXISTS bank_data
    (
     age             BIGINT COMMENT 'Age',
     job             STRING COMMENT 'Job type',
     marital         STRING COMMENT 'Marital status',
     education       STRING COMMENT 'Education level',
     default         STRING COMMENT 'Credit card',
     housing         STRING COMMENT 'Mortgage',
     loan            STRING COMMENT 'Loan',
     contact         STRING COMMENT 'Contact information',
     month           STRING COMMENT 'Month',
     day_of_week     STRING COMMENT 'Day of the week',
     duration        STRING COMMENT 'Duration',
     campaign        BIGINT COMMENT 'Number of contacts during the campaign',
     pdays           DOUBLE COMMENT 'Interval from the last contact',
     previous        DOUBLE COMMENT 'Number of contacts with the customer',
     poutcome        STRING COMMENT 'Result of the previous marketing campaign',
     emp_var_rate    DOUBLE COMMENT 'Employment change rate',
     cons_price_idx  DOUBLE COMMENT 'Consumer price index',
     cons_conf_idx   DOUBLE COMMENT 'Consumer confidence index',
     euribor3m       DOUBLE COMMENT 'Euro deposit rate',
     nr_employed     DOUBLE COMMENT 'Number of employees',
     y               BIGINT COMMENT 'Time deposit available or not'
    );

    In this example, the following statement is used to generate the schema of the result_table table:

    CREATE TABLE IF NOT EXISTS result_table
    (
    education STRING COMMENT 'Education level',
    num BIGINT COMMENT 'Number of persons'
    )
    PARTITIONED BY
    (
    day STRING,
    hour STRING
    );
  3. Upload data.

    Upload raw business data to a table in DataWorks. In this example, data is uploaded to the bank_data table. In this example, a file named banking.txt is uploaded from an on-premises machine to DataWorks. The following figure shows the procedure. Upload dataFor more information about how to upload data, see Upload a file from your on-premises machine to the bank_data table.

Step 3: Create a node

Select a suitable node type for node development based on your business requirements.

Note

Nodes in DataWorks can be classified into data synchronization nodes and compute nodes. In most data development scenarios, you need to use a batch synchronization node to synchronize data from a business database to a data warehouse, and then use a compute node to cleanse and process the data in the data warehouse.

  1. Create a node.

    You can use one of the following methods to create a node:

    • Method 1: Create a node in the Scheduled Workflow pane

      1. In the Scheduled Workflow pane of the DataStudio page, click Business Flow, find the workflow that you created, and then click the name of the workflow.

      2. Right-click the compute engine that you want to use, and select a suitable node type after you move the pointer over Create Node.

    • Method 2: Create a node on the configuration tab of the workflow

      1. In the Scheduled Workflow pane of the DataStudio page, click Business Flow, find the workflow that you created, and then click the name of the workflow.

      2. Double-click the name of the workflow to go to the configuration tab of the workflow.

      3. In the left-side section of the configuration tab, click the required node type or drag the required node type to the canvas on the right side.

  2. Configure the Engine Instance, Node Type, Path, and Name parameters for the node.

    In this example, an ODPS SQL node named result_table is created. The name of the node is the same as the name of the result table that is created in Step 2.

    Note

    When you use DataWorks for data development, you need to use a compute node to cleanse the data and then store the cleansing results in a result table. We recommend that you use the name of the result table as the name of the node to quickly locate the table data that is generated by the node.

    Create a node

Step 4: Configure the node

Find the node that you created in Step 3, and double-click the name of the node to go to the node configuration tab. On the node configuration tab, write the code of the node based on the syntax that is supported by the related database.

In this example, the result_table node is used to write the data in the specified partition in the bank_data table to the specified partition in the result_table table, and the partition to which the data is written is defined by the day and hour variables.

Note
  • If you want to use variables to dynamically replace parameters in scheduling scenarios during code development, you can define the variables in the code in the ${Custom variable name} format and assign values to the variables when you configure scheduling properties for the node in Step 5.

  • For more information about scheduling parameters, see Supported formats of scheduling parameters.

  • For more information about the code syntax for each type of node, see Create and use nodes.

Edit the code of the nodeSample code:

INSERT OVERWRITE TABLE result_table partition (day='${day}', hour='${hour}')
SELECT education
, COUNT(marital) AS num
FROM bank_data
GROUP BY education;

Step 5: Configure scheduling properties for the node

You can configure scheduling properties for a node to enable periodical scheduling for the node. In the right-side navigation pane of the node configuration tab, click the Properties tab. You can configure scheduling properties in different sections of the tab for the node based on your business requirements.

Tab

Description

General

In this section, the node name, node ID, node type, and owner of the node are automatically displayed. You do not need to configure additional settings.

Note
  • By default, the owner of the node is the current user. You can modify the owner of the node based on your business requirements. You can select only a member in the current workspace as the owner of the node.

  • An ID is automatically generated after the node is committed.

Parameters

In this section, you can configure the scheduling parameters that are used to define how the node is scheduled.

DataWorks provides scheduling parameters that can be classified into custom parameters and built-in variables based on their value assignment methods. Scheduling parameters support dynamic parameter settings for node scheduling. If a variable is defined during the modification of the code of the node in Step 4, you can assign a value to the variable in the Parameters section.

In this example, the following variables are defined in Step 4, and values are assigned to the variables to write the data that is generated in the 24 hours of the previous day in the bank_data table to the partition in the result_table table.

  • Assign ${yyyymmdd} to day as the value.

  • Assign $[hh24] to hour as the value.

Assign a scheduling parameter to a variable as a value

Schedule

In this section, you can configure time properties for the node, such as the instance generation mode, the scheduling cycle, the point in time when you want to schedule the node to start, the rerun settings, and the timeout period.

Note
  • You can commit the node only after you configure the rerun settings.

  • The scheduling time that you specify for a node takes effect only on the node. The point in time when the node starts to run is related to the scheduling time of the ancestor node of the node. The node can start to run only if the scheduling time of the ancestor node arrives and the ancestor node is successfully run, even if the scheduling time of the node is earlier than the scheduling time of the ancestor node.

In this example, the result_table node is scheduled to run at an interval of 1 hour from 00:00. The data that is generated each hour in the bank_data table in the 24 hours of the previous day is written to the related hourly partition in the result_table table each hour. Scheduling properties

Resource Group

In this section, you can select the resource group for scheduling that you want to use to deploy the node to the production environment. When you activate DataWorks, the service provides the shared resource group for scheduling. In this example, the shared resource group for scheduling is used.

Note

If a large number of nodes need to run in parallel, exclusive computing resources are required to ensure that the nodes can be run as scheduled. In this case, we recommend that you use an exclusive resource group for scheduling. For more information about exclusive resource groups for scheduling, see Exclusive resource groups for scheduling.

Dependencies

In this section, you can configure scheduling dependencies for the node. We recommend that you configure scheduling dependencies for the node based on the lineage of the node. If the ancestor node of the current node is successfully run, the table data that the current node needs to use is generated. This way, the current node can obtain the table data.

Note
  • If a SELECT statement is specified in the code of the current node to query the table data that is not generated by an auto triggered node, you can disable the automatic parsing feature and use the root node of the workspace to schedule the current node.

  • If a SELECT statement is specified in the code of the current node to query the table data that is generated by other nodes, you can use the following methods to configure the ancestor node that is used by the current node as a dependency. The ancestor node generates the table data that is queried by the current node, and is used to schedule the current node.

    • If the ancestor node does not belong to the current workflow or workspace, specify the value of the output name parameter of the ancestor node in the Parent Nodes table.

    • If the ancestor node belongs to the current workflow, configure scheduling dependencies for the current node by drawing lines in the canvas of the workflow.

In this example, if the result_table node queries the data in the bank_data table that is not generated by a node in the current workflow, configure the root node of the workspace as the ancestor node of the result_table node, and use the root node to schedule the result_table node.

(Optional) Parameters

In this section, you can configure input parameters and output parameters for the node. The configurations in this section are optional. A node can obtain the values of parameters that are configured for its ancestor node by using specific parameters.

Note

In most cases, this process requires assignment nodes or scheduling parameters.

Step 6: Debug the code of the node

You can use one of the following features to debug the code logic to ensure that the code you write is correct.

Feature

Description

Suggestion

Quick run (used to debug a code snippet)

You can quickly run the code snippet that you selected on the configuration tab of the node.

You can use this feature to quickly run a code snippet of a node.

Top toolbar: Run (Run)

You can assign constants to the variables that are defined in the code in specific test scenarios.

Note

The first time you click the Run icon to run a new node, you must manually assign constants to the variables that are defined in the code of the node in the dialog box that appears. The assignment operation will be recorded in the system. You do not need to repeat the operations for subsequent running of the node.

You can use this feature to frequently debug full code of a node.

Top toolbar: Run with Parameters (Run with Parameters)

You must assign constants to the variables that are defined in the code in specific test scenarios each time you click this icon.

You can use this feature to modify the values assigned to the variables in the code.

This section provides an example of the results that are returned when the node is run at 2022.09.07 14:00 in a Run with Parameters test.

Step 7: Save and commit the node

After node configuration and testing are complete, save the node configuration, and then commit the node to the development environment.

Note

You can commit the node to the development environment only after you configure rerun settings and ancestor nodes for the node in Step 5.

  1. Click the Save icon in the top toolbar to save the node.

  2. Click the Submit icon in the top toolbar to commit the node to the development environment.

Step 8: Perform smoke testing

To ensure that the node that you developed can be run in an efficient manner and fully utilize computing resources, we recommend that you perform smoke testing on the node before you commit and deploy the node. The smoke testing must be performed in the development environment. You must commit the node to the development environment before you perform a smoke testing on the node.

  1. Click the Smoke Testing icon in the top toolbar. In the smoke testing dialog box, specify the data timestamp of the node.

  2. After the smoke testing is complete, click the View smoke testing records icon in the top toolbar to view the test results.

In this example, smoke testing is performed to check whether the scheduling parameters that are configured meet user requirements. The result_table node is scheduled to run at an interval of 1 hour from 00:00 to 23:59. When the smoke testing is performed on the node, two instances are generated. The scheduling time of the instances are 00:00 and 01:00.

Note
  • Auto triggered node instances are snapshots that are generated for an auto triggered node when the node is scheduled to run based on the specified scheduling cycle.

  • The result_table node is scheduled by hour. You must specify the data timestamp of the node for the smoke testing. You must also select the start time and end time of the test.

  • For more information about how to perform smoke testing in the development environment, see Perform smoke testing.

Smoke Testing

Step 9: Deploy the node

If the workspace is in basic mode, the node can be scheduled at regular intervals after the node is committed. If the workspace is in standard mode, the node is in the pending state after the node is committed. You must refer to the operations that are described in this step to deploy the node. The node can be scheduled at regular intervals only after the node is deployed.

Note
  • DataWorks can automatically schedule only the nodes that are deployed to the production environment. After smoke testing is complete, commit and deploy the node to the production environment to enable DataWorks to schedule the node at regular intervals.

  • For more information about workspaces in basic mode and workspaces in standard mode, see Differences between workspaces in basic mode and workspaces in standard mode.

In a workspace in standard mode, the operations that are committed on the DataStudio page, including addition, update, and deletion of data development nodes, resources, and functions, are in the pending state on the Create Deploy Task page. You can click Deploy to go to the Create Deploy Task page, and deploy the related operations to the production environment. The operations take effect only after they are deployed to the production environment. For more information, see Deploy nodes.

The following table describes the items that are related to the deployment procedure.

Item

Description

Deployment control

Whether the deployment operation is successful varies based on the permissions of the role of the user that performs this operation and the specified deployment procedure that is used.

Note
  • After you deploy a node, you can view the deployment record and status of the node on the Deploy Tasks page.

  • Developers can only create deployment tasks. If you want to deploy a node, you must have O&M permissions.

Instance generation mode

If you create or update a node and deploy the node in the time range of 23:30 to 24:00, instances that are generated for the node take effect on the third day.

Note

This limit takes effect on nodes for which the Instance Generation Mode parameter is set to Next Day or Immediately After Deployment. For more information about the instance generation mode, see Configure immediate instance generation for a node.

What to do next

You can go to Operation Center and view the auto triggered node that is deployed to the production environment on the Cycle Task page and perform the related O&M operations on the node. For more information, see Perform basic O&M operations on auto triggered nodes.