DataWorks Data Development Case - DataWorks - Alibaba Cloud Documentation Center

This tutorial demonstrates how to analyze home buyer groups to help you master DataWorks Data Development and Data Analysis.

Case introduction

This tutorial analyzes purchasing behavior based on home buyer data. You will use DataWorks to upload local data to a MaxCompute bank_data table, analyze user groups using a MaxCompute SQL node to generate a result_table, and visualize the results to create group profiles.

Note

This tutorial uses simulated data. In actual scenarios, replace this with your own business data.

The following flowchart illustrates the data flow and development process.

The analysis yields the following profile: Single home buyers with loans primarily hold university.degree or high.school diplomas.

Prerequisites

Activate DataWorks

This tutorial uses the Singapore region. Log in to the DataWorks Console, switch to the Singapore region, and check if DataWorks is activated in that region.

Note

This tutorial uses Singapore. Select the region where your data resides:

If your business data resides in other Alibaba Cloud services, select the same region.
If your business is on-premises and requires access via the public network, select a region geographically closer to you to reduce access latency.

New user

New users will see the following prompt. Click Purchase Product Portfolio for Free.

Configure the parameters on the combination purchase page.

Parameter	Description	Example
Region	Select the target region.	Singapore
DataWorks Edition	Select the DataWorks edition to purchase. Note This tutorial uses Basic Edition as an example. All editions can experience the features involved in this tutorial. You can refer to Features of DataWorks editions to select the appropriate DataWorks edition based on your actual business needs.	Basic Edition

Click Confirm Order and Pay to complete the payment.

Activated but expired

If you have previously activated DataWorks in the Singapore region but the DataWorks edition has expired, the following prompt will appear, and you need to click Purchase Edition.

Configure the parameters on the purchase page.

Parameter	Description	Example
Edition	Select the DataWorks edition to purchase. Note This tutorial uses Basic Edition as an example. All editions can experience the features involved in this tutorial. You can refer to Features of DataWorks editions to select the appropriate DataWorks edition based on your actual business needs.	Basic Edition
Region and Zone	Select the region where you want to activate DataWorks.	Singapore

Click Buy Now to complete the payment.

Important

After purchasing a DataWorks edition, if you cannot find the relevant DataWorks edition, perform the following operations:

Wait a few minutes and refresh the page, as there may be a delay in system updates.
Check if the current region matches the region where you purchased the DataWorks edition to prevent failing to find the relevant DataWorks edition due to incorrect region selection.

Activated

If you have already activated DataWorks in the Singapore region, you will enter the DataWorks overview page and can proceed directly to the next step.

Create a workspace

On the DataWorks Workspace List page, select Singapore and click Create Workspace.
On the Create Workspace page, enter a custom Workspace Name, enable Use Data Studio (New Version), and click Create Workspace.
Note
After February 18, 2025, new workspaces created by primary accounts in Singapore enable the new DataStudio by default. The Use Data Studio (New Version) option will not appear.

Create and associate resources

Go to the DataWorks Resource Group List page, switch to the Singapore region, and click Create Resource Group.

On the resource group purchase page, configure the following parameters.

Parameter	Description
Resource Group Name	Custom.
VPC, vSwitch	Select an existing VPC and vSwitch. If there are none in the current region, click the console link in the parameter description to create them.
Service-linked Role	Follow the on-screen instructions to create a service-linked role.

Click Buy Now to complete the payment.
Go to the DataWorks Resource Groups page, switch to the Singapore region, find the created resource group, and click Associate Workspace in the Actions column.
On the Associate Workspace page, find the created DataWorks workspace and click Associate in its Actions column.

Associate MaxCompute resources

Create a MaxCompute project and associate it with DataWorks for data ingestion and analysis.

Go to the DataWorks Workspace List page, switch to the Singapore region, find the created workspace, and click the workspace name to enter the Workspace Details page.

In the left navigation pane, click Computing Resource to enter the computing resources page. Click Associate Computing Resource and select the MaxCompute type. Configure the following key parameters to create a MaxCompute project and associate it as a DataWorks computing resource.

Note

Keep the default values for parameters not mentioned in the table.

Parameter	Description
MaxCompute Project	Click Create in the drop-down selection box and fill in the following parameters. Project Name: Custom, unique across the entire network. Billing Method: Select Pay-as-you-go. Note If pay-as-you-go is not selectable, click Activate to complete the activation of the MaxCompute service. Default Quota: Select an existing default Quota from the drop-down list.
Default Access Identity	Select Alibaba Cloud Account.
Computing Resource Instance Name	Identifies the resource for task execution. For example, in this tutorial, it is named `MaxCompute_Source`.

Click OK.

Procedure

In this tutorial, you will use DataWorks to upload test data to a MaxCompute project. Then, you will create a DataStudio workflow to clean and write data, debug the workflow, and verify the results using SQL.

Step 1: Create a table

First, use Data Catalog in DataWorks to create a bank_data table in MaxCompute.

Log on to the DataWorks Console. Switch to the target region, click Data Development and Operations > Data Development in the left navigation pane, select the corresponding workspace from the drop-down list, and then click Go to Data Studio.
Click the icon in the left navigation pane to go to the Data Catalog page.
(Optional) If your MaxCompute project is missing from Data Catalog, click the icon, go to DataWorks Data Sources, and add the project.
Click to expand the MaxCompute directory, select the target MaxCompute project, and create a MaxCompute table in the Table folder.
Note
- If the schema feature is enabled for your MaxCompute project, you must select the target schema after selecting the project to create the MaxCompute table in the Table folder.
- This example uses a standard mode workspace. Create the bank_data table in the development environment only. If you are using a simple mode workspace, you only need to create the bank_data table in the MaxCompute project corresponding to the production environment.

Click the icon to open the table editing page.

Enter the following SQL statement in the DDL section. The system will automatically generate the table information.

CREATE TABLE IF NOT EXISTS bank_data (
    age             BIGINT   COMMENT 'Age',
    job             STRING   COMMENT 'Job type',
    marital         STRING   COMMENT 'Marital status',
    education       STRING   COMMENT 'Education level',
    `default`       STRING   COMMENT 'Has credit card',
    housing         STRING   COMMENT 'Housing loan',
    loan            STRING   COMMENT 'Loan',
    contact         STRING   COMMENT 'Contact method',
    month           STRING   COMMENT 'Month',
    day_of_week     STRING   COMMENT 'Day of week',
    duration        STRING   COMMENT 'Duration',
    campaign        BIGINT   COMMENT 'Number of contacts in this campaign',
    pdays           DOUBLE   COMMENT 'Interval from last contact',
    previous        DOUBLE   COMMENT 'Number of previous contacts',
    poutcome        STRING   COMMENT 'Outcome of previous marketing campaign',
    emp_var_rate    DOUBLE   COMMENT 'Employment variation rate',
    cons_price_idx  DOUBLE   COMMENT 'Consumer price index',
    cons_conf_idx   DOUBLE   COMMENT 'Consumer confidence index',
    euribor3m       DOUBLE   COMMENT 'Euribor 3 month rate',
    nr_employed     DOUBLE   COMMENT 'Number of employees',
    y               BIGINT   COMMENT 'Has term deposit'
);

On the editing page, click Deploy to create the bank_data table in the MaxCompute project corresponding to the development environment.
After the bank_data table is created, you can click the table name in the Data Catalog to view the table details.

Step 2: Upload data

Download the banking.csv file. Use the DataWorks upload feature to upload it to the bank_data table.

Important

Ensure that a Scheduling Resource Group and a Data Integration Resource Group are configured before uploading. For details, see Data upload Limitations.

Click the icon and choose All Products > Data Integration > Upload and Download to go to the Upload & Download page.

Click Upload Data and configure the following settings:

Parameter		Description
Data Source		Local file.
Specify Data to Be Uploaded	Data Source Type	Upload the local `banking.csv` file.
Configure Destination Table	Target Engine	MaxCompute
	MaxCompute Project Name	Select the project containing the `bank_data` table.
	Select Destination Table	Select the `bank_data` table as the target table.
Preview Data of Uploaded File		Click Mapping by Order to map data to table fields.

Note

Local files support .csv, .xls, .xlsx, and .json formats.
For spreadsheet files, the first sheet is uploaded by default.
The maximum size for .csv files is 5 GB. For other file types, the limit is 100 MB.

Click Upload Data to upload the data from the downloaded CSV file to the bank_data table in the MaxCompute computing resource.
Verify the upload.
Verify the data in the bank_data table via SQL Query (Legacy).
1. Click the icon in the upper-left corner, and click All Products > Data Analysis > SQL Query in the pop-up page.
2. Click > Create File next to My Files, customize the File Name, and click OK.
3. On the SQL Query page, configure the following SQL.
```
SELECT * FROM bank_data limit 10;
```
4. Select the workspace and MaxCompute data source where the bank_data table resides in the upper-right corner, and then click OK.
  Note
  This example uses a standard mode workspace and the bank_data table is created only in the development environment. Therefore, you must select the MaxCompute data source for the development environment. If you are using a simple mode workspace, you can select the MaxCompute data source for the production environment.
5. Click Run (confirm cost estimation if prompted). The bottom pane displays the first 10 records, confirming the upload.

Step 3: Process data

Use a MaxCompute SQL node to filter the bank_data table for the education levels of single home buyers with loans, and then write the results to the result_table.

Build the data processing pipeline

Click the icon in the upper-left corner and choose All Products > Data Development and O&M > Data Development.
Switch to the workspace created in this tutorial at the top of the page. Click in the left navigation pane to go to Data Studio.
In Workspace Directories, click > Create Workflow. Name it dw_basic_case and click OK.

Drag a Zero Load Node and two MaxCompute SQL nodes onto the canvas. Rename them as follows:

The node names and functions used in this tutorial are as follows:

Type	Name	Function
Zero Load	`workshop_start`	Manages the workflow structure. This is a no-op task requiring no code.
MaxCompute SQL	`ddl_result_table`	Creates the result_table to store the cleaned data from bank_data.
MaxCompute SQL	`insert_result_table`	Filters the bank_data and writes the results to the result_table.

Connect the nodes as shown:
Note
Workflows support configuring upstream/downstream dependencies via manual connection or by automatically identifying dependencies through code parsing. This tutorial uses the manual connection method. For more information, see Automatic dependency parsing.
Click Save in the node toolbar.

Configure data processing nodes

Configure ddl_result_table node

This node creates result_table to store the analysis results.

Open the ddl_result_table node.

Paste the following code into the node editing page.

CREATE TABLE IF NOT EXISTS result_table(
  education STRING COMMENT'Education level',
  num       BIGINT COMMENT'Count'
);

Configure debug parameters.
Click Running Configurations on the right side of the MaxCompute SQL node editing page:
- Set Computing Resource to the bound MaxCompute resource Prerequisites.
- Configure the Resource Group parameter by selecting the Serverless resource group purchased during preparation Prerequisites.
Click Save in the node toolbar.

Configure insert_result_table node

This node queries bank_data and populates result_table.

On the workflow editing page, hover over the insert_result_table node and click Open Node.

Paste the following code into the node editing page.

INSERT OVERWRITE TABLE result_table --Insert data into result_table.
SELECT
  education,
  COUNT(marital) AS num
FROM bank_data
WHERE 
  housing = 'yes'
  AND marital = 'single'
GROUP BY
education;

Configure debug parameters.
Click Running Configurations on the right side of the MaxCompute SQL node editing page:
- Set Computing Resource to the bound MaxCompute resource Prerequisites.
- Configure the Resource Group parameter by selecting the Serverless resource group purchased during preparation Prerequisites.
Click Save in the node toolbar.

Step 4: Debug and run

Click the icon to execute the workflow. Check the logs if any failures occur.

Step 5: Data query and display

Data processing is complete. Query the result_table and analyze the data in SQL Query (Legacy).

Click the icon in the upper-left corner, and click All Products > Data Analysis > SQL Query in the pop-up page.
Click > Create File next to My Files, customize the File Name, and click OK.
On the SQL Query page, configure the following SQL.
```
SELECT * FROM result_table;
```
Select the workspace and MaxCompute data source where the result_table table resides in the upper-right corner, and then click OK.
Note
This example uses a standard mode workspace. result_table exists only in the development environment, so select the corresponding data source. If you are using a simple mode workspace, you can select the MaxCompute data source for the production environment.
Click the Run button at the top. In the cost estimation page, click Run.
Click in the query results to view the visualized chart results. You can click in the upper-right corner of the chart to customize the chart style.
You can also click Save in the upper-right corner of the chart to save the chart as a card, and then click Card () in the left navigation pane to view it.

Next steps

For details on modules and parameters, see Data Studio (new version) and Data Analysis.
In addition to the modules introduced in this tutorial, DataWorks also supports multiple modules such as Data Modeling, Data Quality, Data Security Guard, DataService Studio, Data Integration, and Node scheduling configuration, providing you with one-stop data monitoring and O&M.
You can also experience more DataWorks practice tutorials. For specific content, see More use cases and tutorials.

Resource release and cleanup

To release resources:

Stop auto-triggered tasks.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.
2. In Auto Triggered Node O&M > Auto Triggered Nodes, select all previously created periodic tasks (the workspace root node does not need to be taken offline), and then click More Actions > Undeploy at the bottom.
Delete nodes and unbind MaxCompute resources.
1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
2. In the left navigation pane of DataStudio, click to enter the data development page. Then, in the Workspace Directoies area, find the created workflow, right-click the workflow, and click Delete.
3. In the left navigation pane, click > Computing Resources, find the associated MaxCompute computing resource, and click Disassociate. In the confirmation window, check the options and follow the instructions.
Delete MaxCompute project.
Go to the MaxCompute Project Management page, find the created MaxCompute project, click Delete in the Actions column, and follow the instructions to complete the deletion.
Delete DataWorks workspace.
1. In the DataWorks Console, find the workspace and click Actions > Delete Workspace.
2. In the Delete Workspace dialog box, click OK to delete the workspace.